It wasn’t until the IBM mainframe ↗System/360 (1964) that a byte was officially defined as 8 bits. This allowed the encoding tables to finally include lowercase letters and more punctuation. IBM’s proprietary encoding system was called ↗EBCDIC (developed in 1963), while US-ASCII also appeared in 1963 as a general encoding recommendation — but it was not compatible with EBCDIC. ASCII used only 7 bits (values 0 to 127), reserving the 8th bit as a parity bit — a check bit for detecting transmission errors when communicating with printers and terminals.
For the IBM PC (introduced in 1981), IBM developed the ASCII-based 8-bit code page ↗CP437, which included umlauts and accented characters. Later, other ASCII-compatible code pages were created for different regions and languages. Starting with ↗PC-DOS and MS-DOS 3.3 (1987), users could define the code page themselves in the CONFIG.SYS file.
Around the same time, Microsoft also shifted toward ISO 8859. Since Windows 3.1 (1992), Windows no longer relied on the active DOS code page,
but instead used "ANSI Code Pages", which are based on the ISO family.
For example, CP1252 (Windows-1252, Western European) corresponds to ISO-8859-1.
However, because Windows was a graphical user interface, there was a desire to include typographic symbols for the first time —
for instance, these “ ” quotation marks could now be displayed.
So Microsoft replaced the unused, non-printable control codes between 80(hex) and 9F with such typographic characters.
Another problem arises in the Asian region with its complex writing systems. Some scripts consist of several thousand characters, which simply can’t be represented within 8 bits (256 code points).
After several years of development, version 1.0 of the Unicode standard was released in 1991, containing over 7,000 characters. To support the project, the Unicode Consortium had already been formed, with many well-known companies such as Adobe, Apple, Borland, DEC, IBM, Lotus, Microsoft, NeXT, Novell, Sun Microsystems, Symantec, Unisys, WordPerfect, Xerox, and others.
In the years that followed, more and more languages and characters were added to the Unicode tables, so that by 1993 the number had already reached 34,000 characters. It became clear that the 64K limit of 16 bits (UCS-2) would be exceeded — especially if historical writing systems were also to be included.
That’s why, in 1996, Unicode version 2.0 defined its scalability to 32 bits. This means that the most commonly used characters could still be encoded in a 16-bit word, and a second data word would only be needed for rare or historical characters. This flexible memory usage is implemented in the UTF-16 encoding format, which replaced the fixed-width UCS-2.
UTF-8 is highly memory-efficient. Combined with its compatibility with ASCII, it is easy to integrate into existing systems — a key factor in its widespread adoption on Linux systems and the Web. Although the Unicode Consortium originally promoted UTF-16 as the primary encoding, UTF-8 was officially added to the Unicode standard later (starting with Unicode 3.0 in 1999).
From the mid-2000s, most Linux distributions (Debian, Red Hat, later Ubuntu) began using UTF-8 by default. On the Internet as well, UTF-8 is now the dominant character encoding.
However, some APIs were switched from ANSI/code pages to UTF-16 before UTF-8 became widespread. These include the Windows API starting with Windows NT (from 1993), Apple’s Cocoa API and Core Foundation, and Android (in the Java context). UTF-16 is the primary string type used in programming languages such as Java, Kotlin, C# (.NET), Delphi (since 2009), and JavaScript.