Character Encoding: ASCII, UTF-8, and Why Mojibake Happens

The problem encoding solves

Computers store bytes, which are 8-bit values from 0 to 255. Text is a sequence of characters, which are abstract symbols like A, é, 中, or 🎉. A character encoding is a mapping between characters and the byte sequences that represent them. Without an encoding, a sequence of bytes is just numbers; with an encoding, those numbers become readable text. The problem is that there are many encodings, they disagree with each other, and the same bytes can mean different things depending on which encoding you assume.

The history of encodings is a story of trying to fit more characters into fewer bytes, with each region and each vendor inventing their own solution. The result was decades of garbled text, broken email, and unreadable files. The modern solution is Unicode, a single character set that aims to include every character from every writing system ever used, paired with the UTF-8 encoding that represents Unicode characters as variable-length byte sequences. UTF-8 has won so thoroughly that more than 98 percent of the web uses it, but the legacy encodings are still around and still cause problems.

The reason encoding matters even today is that the bytes on disk do not declare what encoding they are in. A file of bytes could be ASCII, ISO-8859-1, UTF-8, UTF-16, or any of dozens of legacy code pages. The reader has to guess, and if it guesses wrong, the text is garbled. This garbling is called mojibake, from the Japanese words for unintelligible and transformed, and it is what produces the familiar café becoming cafÃ© or résumé becoming rÃ©sumÃ©.

ASCII: the 7-bit foundation

The American Standard Code for Information Interchange, ASCII, was first published in 1963 and finalized in its current form in 1968. It defines 128 characters numbered 0 to 127, fitting in 7 bits per character. The first 32 codes (0 to 31) are control characters: 9 is tab, 10 is line feed, 13 is carriage return, 27 is escape, and so on. Code 127 is the delete character. Codes 32 to 126 are the printable characters: space, punctuation, digits, uppercase letters (65 to 90), lowercase letters (97 to 122), and a few symbols.

ASCII was a brilliant design for its time. The choice of 7 bits left the high bit of each byte free for parity checking, which mattered on noisy telephone lines. The arrangement of letters in contiguous alphabetical order meant that simple string comparison sorted text correctly. The inclusion of control characters for terminal operations meant that ASCII could drive both printers and screens. For English-language computing, ASCII was complete and sufficient.

The limitation was obvious from the start: 128 characters cannot represent any language other than English. The accented characters of Western European languages, the Cyrillic alphabet, the Greek alphabet, Arabic, Hebrew, the CJK characters (Chinese, Japanese, Korean) — none of them fit. Every language community needed more than 128 characters, and the high bit of the byte, unused by ASCII, was the obvious place to put them.

The 8-bit era: ISO-8859 and the code-page mess

The 8-bit byte could hold 256 characters, double what ASCII offered. The ISO-8859 series of standards, published starting in 1987, defined 15 parts, each covering a different region. ISO-8859-1 (Latin-1) covered Western European languages: French, Spanish, Portuguese, German, and the Nordic languages. ISO-8859-2 covered Central European. ISO-8859-5 covered Cyrillic. ISO-8859-7 covered Greek. Each part kept the ASCII characters in positions 0 to 127 and defined different characters in positions 128 to 255, so an ASCII text was valid in any ISO-8859 part, but a text with accented characters was only valid in the part it was written for.

Meanwhile, IBM and Microsoft had their own code pages, which were incompatible with ISO-8859 and with each other. IBM's EBCDIC was a completely different mapping from ASCII. Microsoft's code page 437 was the original IBM PC character set, code page 850 was the multinational variant, code page 1252 was the Windows equivalent of ISO-8859-1 (but with different characters in positions 128 to 159, breaking compatibility). The Macintosh used its own Roman encoding, different again. Email used a 7-bit transport and required quoted-printable or Base64 encoding for any non-ASCII character. The result was that a text file written on one system might be unreadable on another, and there was no way to know which encoding to use.

This is the world that mojibake came from. A text file written in ISO-8859-1, opened in a system assuming Windows-1252, would mostly work because the two encodings are similar. A file written in Shift-JIS (a Japanese encoding), opened in a system assuming ISO-8859-1, would be unintelligible garbage. The first few characters might be readable, then everything would collapse into accented Latin characters that meant nothing. The Japanese coined the term mojibake for this phenomenon, and it spread as the problem became universal.

Unicode and how UTF-8 works

Unicode was started in 1987 and first published in 1991, with the goal of assigning a unique number (a code point) to every character in every writing system. Code points are written as U+ followed by four to six hexadecimal digits: U+0041 is A, U+00E9 is é, U+4E2D is 中, U+1F389 is 🎉. The current version of Unicode (15.1, as of 2023) defines about 150,000 characters, including all modern scripts, many historical scripts, mathematical symbols, and thousands of emoji.

Unicode defines code points, but it does not define how to encode them as bytes. There are three main encodings. UTF-32 uses 4 bytes per code point, which is simple but wasteful: an English text file becomes four times larger than ASCII. UTF-16 uses 2 bytes for most characters and 4 bytes for the rest, which is a compromise but has endianness issues (UTF-16LE vs UTF-16BE) and is awkward for English text. UTF-8 uses 1 byte for ASCII, 2 bytes for Latin-extended characters, 3 bytes for most CJK characters, and 4 bytes for the rest.

UTF-8, designed by Ken Thompson and Rob Pike in 1992, is the encoding that won. The design is elegant: every ASCII character is a single byte with the same value as in ASCII, so any ASCII text is valid UTF-8 and any UTF-8 text that contains only ASCII is byte-for-byte identical to ASCII. The first byte of a multi-byte sequence indicates how many bytes follow, using a prefix of 110 for a 2-byte sequence, 1110 for 3 bytes, 11110 for 4 bytes; continuation bytes start with 10. This self-synchronizing property means that if a byte is lost or corrupted, the decoder can resync at the next character boundary, which older encodings could not do.

UTF-8 won for several reasons. It is backward-compatible with ASCII, which means existing English text and English-language protocols keep working. It is compact for English text, which is the majority of legacy data. It has no endianness issue, unlike UTF-16. It is self-synchronizing, which makes it robust to corruption. The XML and HTML specifications made UTF-8 the default encoding, the JSON specification requires UTF-8, and modern operating systems and programming languages default to it. The victory was slow but decisive: in 2008, UTF-8 was used by less than 20 percent of the web; today it is over 98 percent.

Why mojibake happens

Mojibake happens when bytes are decoded using the wrong encoding. The classic case is a UTF-8 file opened as ISO-8859-1. The character é is encoded in UTF-8 as two bytes: 0xC3 0xA9. ISO-8859-1 reads each byte separately: 0xC3 is Ã and 0xA9 is ©. So café becomes cafÃ©. The reverse case is an ISO-8859-1 file opened as UTF-8: the single byte 0xE9 is not a valid UTF-8 start byte (UTF-8 expects 0xE9 to begin a 3-byte sequence), so the decoder either inserts a replacement character (the diamond with a question mark) or aborts with an error.

The fix is to know the encoding of the source and convert correctly. If you have a string of bytes and you know it is ISO-8859-1, decode it as ISO-8859-1 to get Unicode code points, then encode those code points as UTF-8 for storage or transmission. Most modern languages have functions for this: in JavaScript, TextDecoder can decode bytes as any of dozens of encodings; in Python, the str type has an encode method and the bytes type has a decode method, both taking an encoding argument.

The hard part is detecting the encoding when it is not declared. There is no completely reliable way to detect encoding from the bytes alone, because the same bytes can be valid in multiple encodings. Heuristics like the charset-detection library used by Firefox and Chrome can guess correctly most of the time for common cases, but they are not perfect. The right answer is to always declare the encoding: in HTML, the meta charset tag should appear in the first 1024 bytes; in email, the Content-Type header specifies the charset; in HTTP, the Content-Type header does the same. If everyone declared the encoding, mojibake would be a historical curiosity.

The BOM, normalization, and other gotchas

The Byte Order Mark (BOM) is a Unicode character (U+FEFF) placed at the start of a text file to indicate the encoding and byte order. For UTF-8, the BOM is the three bytes EF BB BF. For UTF-16, it is FF FE (little-endian) or FE FF (big-endian), and the BOM is essential because UTF-16 has endianness. For UTF-8, the BOM is optional and controversial: it conveys no information that the encoding declaration does not, and it appears as the characters ï»¿ if the file is opened as ISO-8859-1, which has caused countless bugs in shell scripts and config files. UTF-8 files for the web should not have a BOM; UTF-16 files must have one.

Unicode normalization is the second gotcha. The character é can be represented in two ways in Unicode: as the single code point U+00E9 (precomposed, or NFC form), or as the combination of e (U+0065) and the combining accent (U+0301) (decomposed, or NFD form). Both render identically, but they are different byte sequences and will not compare as equal in a naive string comparison. Databases and search indexes need to normalize text to a consistent form before comparing. Unicode defines four normalization forms (NFC, NFD, NFKC, NFKD), and NFC is the recommended default for the web.

The third gotcha is grapheme clusters. The emoji 👨‍👩‍👧‍👦 (family with four members) is a single grapheme cluster but seven Unicode code points: four person emoji joined by three zero-width joiners (U+200D). A naive string length function returns 7, but the user perceives one character. Most programming languages have libraries for grapheme cluster iteration (the Intl.Segmenter API in JavaScript, the grapheme library in Python, the \X regex construct in PCRE), but the default string functions give the wrong answer. If you are counting characters for a UI limit, use grapheme clusters, not code points.