Have you ever opened a text file only to find it full of mysterious characters like "锟斤拷烫烫烫" or "â¥人"? Or received an API response where all the non-Latin characters turned into "??"? Nearly all of these frustrating problems share the same root cause: character encoding mismatches. Understanding character encoding is not just about fixing bugs — it's about understanding the fundamental mechanism by which computers process text.
1. Characters and Numbers: How Does a Computer Store Text?
At the hardware level, computers only understand 0s and 1s. To store text, we need a mapping table that says "65 means uppercase A" or "20013 means the Chinese character 中." This system of rules is called character encoding.
Character encoding has two distinct components:
- Character Set: Defines which characters exist and assigns each a unique number (called a code point)
- Encoding Scheme: Defines how to represent those numbers as a sequence of bytes
This distinction is crucial: Unicode is a character set (it assigns numbers to over 140,000 characters), while UTF-8 is an encoding scheme (it defines how to convert those numbers into bytes). They are not the same thing.
2. ASCII: Where It All Began
In 1963, the American National Standards Institute (ANSI) published ASCII (American Standard Code for Information Interchange) — the foundation of all modern text encoding. Using 7 bits, ASCII defines 128 characters (0–127):
- 0–31: Control characters (newline LF=10, carriage return CR=13, Tab=9)
- 32–47: Punctuation and space
- 48–57: Digits 0–9
- 65–90: Uppercase A–Z
- 97–122: Lowercase a–z
ASCII is perfect for English — but with only 128 characters, it can't accommodate accented European letters, let alone Chinese, Japanese, Arabic, or Korean.
The designers made the offset between uppercase and lowercase exactly 32 (the difference in the 6th bit), so toggling case only requires flipping a single bit. This elegant design lives on today: in Python,
ord('A') == 65 and ord('a') == 97.
3. The Fragmented Era: Big5, GBK, Shift-JIS, and ISO 8859
As ASCII proved insufficient, different regions created their own incompatible extensions. This fragmentation is the origin of most garbled text problems.
3.1 Europe: ISO 8859 Series
ISO 8859 used the 8th bit (unused by ASCII) to extend the character range from 128 to 256. But it came in 16 variants (ISO 8859-1 for Western European languages, 8859-7 for Greek, etc.), with the same byte value meaning different characters in different variants.
3.2 Traditional Chinese: Big5
Big5, created in Taiwan in 1984, uses 2 bytes per Chinese character and encodes about 13,060 Traditional Chinese characters. It dominated the Taiwanese and Hong Kong computing environments for decades.
3.3 Simplified Chinese: GB2312 / GBK
Mainland China developed GB2312 in 1981 (6,763 characters), later expanded to GBK (21,003 characters). Windows Simplified Chinese editions defaulted to GBK (code page 936).
3.4 Japanese: Shift-JIS and EUC-JP
Japanese computing was split between Shift-JIS (Windows default) and EUC-JP (Unix/Linux standard), which are mutually incompatible.
| Encoding | Language | Characters | Bytes/Char | Primary Region |
|---|---|---|---|---|
| ASCII | English | 128 | 1 | Global (English) |
| ISO 8859-1 | Western European | 256 | 1 | Europe |
| Big5 | Traditional Chinese | 13,060 | 2 | Taiwan, Hong Kong |
| GBK | Simplified Chinese | 21,003 | 2 | Mainland China |
| Shift-JIS | Japanese | ~6,879 | 1–2 | Japan |
| UTF-8 | All languages | 140,000+ | 1–4 | Global |
4. Unicode: The Unified Standard That Ended the Chaos
In 1987, engineers from Xerox and Apple began an ambitious project: creating a single character set that could encompass every writing system on Earth. Unicode 1.0 was published in 1991.
Unicode assigns each character a unique code point, written in the format U+ followed by a hexadecimal number:
U+0041→ Uppercase AU+4E2D→ Chinese character 中 (middle)U+AC00→ Korean syllable 가U+1F600→ 😀 (grinning face emoji)U+1F4A9→ 💩 (pile of poo emoji)
Unicode 14.0 now defines over 144,697 characters across 159 writing systems, from ancient Egyptian hieroglyphics to modern emoji.
This is the most common source of confusion. Unicode is the standard that assigns numbers to characters (the "code point"). UTF-8 is one way to serialize those numbers into bytes. Other encoding schemes for Unicode include UTF-16 and UTF-32.
5. UTF-8: Why It Became the Global Standard
Once Unicode defined the code points, a serialization method was needed. UTF-8, designed by Ken Thompson and Rob Pike in 1992, uses an ingenious variable-length encoding of 1 to 4 bytes:
5.1 UTF-8 Encoding Rules
- 1 byte (0xxxxxxx): U+0000 to U+007F — the full ASCII range. UTF-8 is 100% backward compatible with ASCII
- 2 bytes (110xxxxx 10xxxxxx): U+0080 to U+07FF — most European characters
- 3 bytes (1110xxxx 10xxxxxx 10xxxxxx): U+0800 to U+FFFF — CJK characters (Chinese, Japanese, Korean)
- 4 bytes (11110xxx ...): U+10000 and above — emoji and rare/ancient scripts
5.2 Why UTF-8 Won
UTF-8 now accounts for over 97% of all web pages globally. Its advantages:
- ASCII backward compatibility: Pure ASCII content is identical in both ASCII and UTF-8. Existing systems required no modification
- Self-synchronizing: Multi-byte sequence headers (11xxxxxx) and continuation bytes (10xxxxxx) are distinct, allowing decoding to start at any point in a stream
- No byte-order issues: UTF-16 and UTF-32 require a Byte Order Mark (BOM) to indicate endianness. UTF-8 operates on individual bytes and has no such problem
- Balanced space efficiency: English text uses 1 byte per character (same as ASCII). CJK characters use 3 bytes each
| Encoding | Bytes/Char | ASCII Compatible | Byte-order Issue | Primary Use |
|---|---|---|---|---|
| UTF-8 | 1–4 (variable) | Yes | None | Web, file storage |
| UTF-16 | 2 or 4 | No | Yes (needs BOM) | Windows internals, Java strings |
| UTF-32 | 4 (fixed) | No | Yes | Internal processing, fast indexing |
6. How Does Garbled Text Actually Happen?
Garbled text (known as mojibake in Japanese) is simply the result of interpreting a byte sequence with the wrong encoding. Typical scenarios:
6.1 Opening a GBK File as UTF-8
The bytes of GBK-encoded Chinese text don't form valid UTF-8 sequences, so they display as "â–»" or replacement characters like "??".
6.2 Big5 vs GBK Cross-Reading
Big5 and GBK have overlapping byte spaces — the same byte value represents entirely different characters in each encoding. This was the source of countless data corruption incidents when exchanging files across the Taiwan Strait.
6.3 The "??" Database Problem
If your MySQL connection charset is set to latin1 but you insert UTF-8 encoded text, the database converts any byte it can't decode into ?. This is irreversible — the original data is permanently lost.
6.4 Missing BOM in UTF-16 Files
UTF-16 needs a BOM (Byte Order Mark) to indicate whether it's Big-Endian or Little-Endian. Without it, software may parse with the wrong byte order, corrupting every character.
This famous garbled text appears when a GBK decoder processes UTF-8 replacement characters (U+FFFD, bytes
EF BF BD). Since replacement characters often appear in pairs, they form the repeating "锟斤拷" pattern. "烫烫烫" appears when GBK decodes 0xCDCD, which is a common "uninitialized memory" fill pattern in Visual C++ debug mode.
7. URL Encoding: Another Battlefield for Character Encoding
URLs may only contain specific ASCII characters. Non-ASCII characters must be percent-encoded: each byte of the character's UTF-8 representation is written as %XX (two hex digits).
For example, the Chinese characters "台灣" (Taiwan) encode as:
- 台 → UTF-8 bytes
E5 8F B0→%E5%8F%B0 - 灣 → UTF-8 bytes
E7 81 A3→%E7%81%A3
RFC 3986 explicitly specifies that URL encoding must be based on UTF-8. Earlier systems that used Big5 or GBK for URL encoding caused URLs that were interpreted differently on different systems.
Try the URL Encoder/Decoder Tool
8. How to Permanently Eliminate Garbled Text
The root solution is to enforce UTF-8 consistently at every layer of your stack:
8.1 Files and Source Code
- Save all text files as UTF-8 without BOM
- Add
<meta charset="UTF-8">at the very top of your HTML<head>
8.2 Database
- Create databases and tables with
CHARACTER SET utf8mb4(important: MySQL'sutf8is a broken 3-byte variant that can't store emoji or certain rare characters — always useutf8mb4) - Set
SET NAMES utf8mb4on each connection, or specifycharset=utf8mb4in your PDO DSN
8.3 HTTP Headers
- Serve all text responses with
Content-Type: text/html; charset=utf-8 - Add
accept-charset="UTF-8"to your HTML forms
utf8 TrapMySQL has a famous historical bug: its
utf8 character set only supports UTF-8 sequences up to 3 bytes long. Full UTF-8 requires up to 4 bytes (for emoji and supplementary characters). If you store an emoji like 😀 in a utf8 column, MySQL will either throw an error or silently truncate the data. The fix: always use utf8mb4.
9. Summary
The history of character encoding is the story of fragmentation giving way to unification:
- ASCII (1963): 128 characters; laid the digital foundation for English text and remains a universal subset of all modern encodings
- The Fragmented Era (1970s–1990s): Big5, GBK, Shift-JIS, and dozens of regional encodings emerged in parallel, leading to rampant incompatibility and garbled text
- Unicode (1991+): A single character set that assigns unique code points to every character in every writing system on Earth
- UTF-8 (1992+): The best encoding for Unicode — ASCII-compatible, space-efficient, and free of byte-order issues. Now the standard for 97%+ of the world's web pages
If you enforce UTF-8 consistently across your entire stack — files, code, database, and HTTP headers — garbled text becomes nearly impossible. And when it does appear, the diagnosis is clear: find the layer where the declared encoding doesn't match the stored bytes, fix it, and the problem disappears.