Unicode and UTF-8

Why Unicode and UTF-8 ?

Anyone with programming experience from, say, the 1990s still remembers how easy it was to evaluate or modify a string — because each byte stood for exactly one character. But with UTF-8, things can be a bit trickier today. Here’s a look back at the good reasons behind the development of Unicode and UTF-8 ...

Uppercase Letters Only

It’s almost forgotten now: the popular home computer C-64 (released in 1982) didn’t support ↗European umlauts. Another limitation was: if a game wanted to use many graphical characters, it had to switch to the second character set, which lacked lowercase letters. That’s why many games displayed their text in uppercase only. And German text adventures used AE OE UE to replace umlauts.

6 Bits or 7 Bits

Things were even simpler with early computers in the 1950s, where a byte might consist of only 6 bits. Yes, really: a byte hadn’t yet been standardized as 8 bits. With only 6-bit values (0 to 63), you could just fit in a few control characters (like line feed and carriage return for printers), plus the 10 digits, some punctuation, and only the uppercase Latin letters. That wasn’t a problem at the time, since mainframes — controlled via punched tape — were used for computing, not text.

It wasn’t until the IBM mainframe ↗System/360 (1964) that a byte was officially defined as 8 bits. This allowed the encoding tables to finally include lowercase letters and more punctuation. IBM’s proprietary encoding system was called ↗EBCDIC (developed in 1963), while US-ASCII also appeared in 1963 as a general encoding recommendation — but it was not compatible with EBCDIC. ASCII used only 7 bits (values 0 to 127), reserving the 8th bit as a parity bit — a check bit for detecting transmission errors when communicating with printers and terminals.

Code Pages

In the 1960s, IBM gained more and more customers in Europe, where language-specific characters — such as German umlauts — were needed. To address this, IBM created over 200 national code pages (character tables), as language-specific, regional variants of EBCDIC, which were used on mainframes.

For the IBM PC (introduced in 1981), IBM developed the ASCII-based 8-bit code page ↗CP437, which included umlauts and accented characters. Later, other ASCII-compatible code pages were created for different regions and languages. Starting with ↗PC-DOS and MS-DOS 3.3 (1987), users could define the code page themselves in the CONFIG.SYS file.

ISO 8859

In 1987, a new international standard called ↗ISO 8859 was introduced for languages in Europe and the Middle East. It was based on 8 bits, ASCII compatibility, and used separate code pages called "parts" (sub-standards).
But for DOS, ISO 8859 came a few years too late, since DOS already had several of its own ASCII-compatible code pages. On the other hand, Linux was first developed in 1991, when ISO-8859-1 (Latin-1) was already well established. As a result, Linux could adopt the ISO standard from the start. It is also used for email and HTML.

Around the same time, Microsoft also shifted toward ISO 8859. Since Windows 3.1 (1992), Windows no longer relied on the active DOS code page, but instead used "ANSI Code Pages", which are based on the ISO family. For example, CP1252 (Windows-1252, Western European) corresponds to ISO-8859-1.
However, because Windows was a graphical user interface, there was a desire to include typographic symbols for the first time — for instance, these “ ” quotation marks could now be displayed. So Microsoft replaced the unused, non-printable control codes between 80(hex) and 9F with such typographic characters.

Code Page Issues

A major issue with code pages is that — beyond ASCII — they are not compatible with one another. You can encode and display either Western European, or Greek, or Cyrillic characters, etc. — but not all together in one text. And if you don’t know which code page a text or file was encoded with, it may not be readable at all.

Another problem arises in the Asian region with its complex writing systems. Some scripts consist of several thousand characters, which simply can’t be represented within 8 bits (256 code points).

Unicode Started with 16 Bits

Starting in 1987, employees from various computer companies began developing the Unicode standard using 16 bits. They wrote (translated): "Unicode is intended to meet the need for a practical, reliable world text encoding. Unicode could be loosely described as 'wide-body ASCII' stretched to 16 bits to encompass the characters of all living languages of the world."

After several years of development, version 1.0 of the Unicode standard was released in 1991, containing over 7,000 characters. To support the project, the Unicode Consortium had already been formed, with many well-known companies such as Adobe, Apple, Borland, DEC, IBM, Lotus, Microsoft, NeXT, Novell, Sun Microsystems, Symantec, Unisys, WordPerfect, Xerox, and others.

Unicode Actually Needs 32 Bits

In the years that followed, more and more languages and characters were added to the Unicode tables, so that by 1993 the number had already reached 34,000 characters. It became clear that the 64K limit of 16 bits (UCS-2) would be exceeded — especially if historical writing systems were also to be included.

That’s why, in 1996, Unicode version 2.0 defined its scalability to 32 bits. This means that the most commonly used characters could still be encoded in a 16-bit word, and a second data word would only be needed for rare or historical characters. This flexible memory usage is implemented in the UTF-16 encoding format, which replaced the fixed-width UCS-2.

UTF-8 Is Space-Efficient

UTF-8 was developed in 1992 at Bell Labs, independently of the Unicode Consortium. UTF-8 is backward-compatible with ASCII, meaning ASCII characters still only occupy one byte. Thanks to its variable length of up to 4 bytes, it can encode all Unicode code points. Unicode now defines over 1 million characters (code points).

UTF-8 is highly memory-efficient. Combined with its compatibility with ASCII, it is easy to integrate into existing systems — a key factor in its widespread adoption on Linux systems and the Web. Although the Unicode Consortium originally promoted UTF-16 as the primary encoding, UTF-8 was officially added to the Unicode standard later (starting with Unicode 3.0 in 1999).

From the mid-2000s, most Linux distributions (Debian, Red Hat, later Ubuntu) began using UTF-8 by default. On the Internet as well, UTF-8 is now the dominant character encoding.

However, some APIs were switched from ANSI/code pages to UTF-16 before UTF-8 became widespread. These include the Windows API starting with Windows NT (from 1993), Apple’s Cocoa API and Core Foundation, and Android (in the Java context). UTF-16 is the primary string type used in programming languages such as Java, Kotlin, C# (.NET), Delphi (since 2009), and JavaScript.

Please send feedback, suggestions, etc. via email.