WHAT IS UTF-8
This document contains a minimal collection of information for you to understand
UTF-8 (which is the encoding used by this website).
For more information, please check out the
Unicode Consortium website.
UTF-8 is a popular encoding form of the Unicode/ISO-10646 standard.
Don't worry if that doesn't make much sense to you yet, read below and things will become clear.
In the early days, there were two independent attempts to create a unified character
set. One was the ISO 10646 project of the
International Organization for Standardization (ISO),
the other was the Unicode Project
organized by a consortium later known as the
Unicode Consortium. ISO came up with a standard called
ISO 10646 which defines a huge 31-bit
Universal Character Set (UCS). A 16-bit subset of
UCS (which contains 65534 characters) is called the
Basic Multilingual Plane (BMP)
and is the part that gets populated first. Unicode Consortium,
on the other hand, were working on its own standard called the Unicode standard.
Having two independent standards is certainly not something
people would call "unified". Both ISO and the Unicode Consortium realized that and
decided to form a joined effort in 1991. Since then new versions of Unicode standard
are made fully compatible and synchronized with the corresponding versions of ISO
10646. All characters are located at the same positions and have the same names in both
Theoretically, the 31-bit UCS can contain about two billion characters, the number
of characters that are actually defined, however, is much smaller (but has been
growing in time). Version 3.2 of the Unicode standard,
for instance, provided codes for 95221 characters (which already goes
beyond the BMP). Unicode is stable,
the growing process of the Unicode is strictly additive, namely
only new characters will be added, no existing characters will be removed or renamed
in the future.
Unicode and ISO 10646 are first of all code tables that assign integer numbers to characters.
Hexadecimal numbers for those integer values are commonly preceded by "U+".
For instance, U+0041 is the character "Latin capital letter A".
Given the integer values, it is up to the character encoding standards
(encoding forms) to define how these values should be represented as a byte sequence.
Unicode standard defines three encoding forms that allow the same character data
to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16,
or 32-bits per code unit). These three encoding forms are called UTF-8, UTF-16
and UTF-32 respectively. There are other encoding forms defined by ISO. The abbrevation
UTF stands for Unicode (or UCS) Transformation Format.
UTF-8 transforms all Unicode characters into a variable length byte sequence, it
has the following properties:
Characters U+0000 to U+007F (ASCII) are encoded as a single byte 0x00 to 0x7F, this
means UTF-8 is fully compatible with ASCII.
All characters greater than U+007F are encoded as a sequence of several bytes, all
of which are above 0x7F (namely no ASCII byte), this makes it unambiguous to
determine whether a byte belows to a multi-byte character or an ASCII character.
The first byte of a multi-byte sequence (that represent a non-ASCII) is always
in the range of 0xC0 to 0xFD. All further bytes in the sequence are in the range
0x80 to 0xBF. This makes it unambiguous to determine the boundary of the
multi-byte characters (in fact, the first byte also contains
a redundant information about how many bytes follow for the character).
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
To fully encode all the 231 characters in UCS, a UTF-8 encoded
character can be up to six bytes long, but the 16-bit BMP characters are only
up to three bytes long. The following formats of byte sequence are used to
represent a character in UTF-8:
|U-00000000 - U-0000007F:
|U-00000080 - U-000007FF:
|U-00000800 - U-0000FFFF:
||1110xxxx 10xxxxxx 10xxxxxx
|U-00010000 - U-001FFFFF:
||11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
|U-00200000 - U-03FFFFFF:
||111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
|U-04000000 - U-7FFFFFFF:
||1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The "xxx" bits are filled with the bits of the character code number as
assigned by the Unicode standard. For instance, the Unicode character U+2260
= 0010 0010 0110 0000
(the symbol "not equal to" - ≠)
belongs to the 3rd category in the table. Fill these 16 bits into the 16
"x" positions in the format one obtains the UTF-8 encoding of the character as:
Notice the digits highlighted matches the bits (highlighted before)
assigned by the Unicode standard.
As mentioned before, UTF-8 is a popular encoding
form for Unicode. Why is it so? The reason lies in the fact that all ASCII
characters are encoded as a single byte in UTF-8 which is not only fully backward
compatible, but also space efficient for US and many European users. In general,
UTF-8 costs no extra space for US ASCII, only a few
percent more for ISO-8859-1 (aka Latin-1, covers most West European languages),
50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.
As a comparison, UTF-16 costs no more space for Chinese/Japanese/Korean, 100%
more for US ASCII and ISO-8859-1, Greek and Cyrillic. UTF-32 is a fixed width
encoding that costs the most amount of space. Since US and West European account
for most of the internet users, English accounts for most of the information
distributed on the web (at the time of this writing),
so UTF-8 has quickly become the most popular Unicode
encoding form for the web.
Finally, a note about a universal way to enter UTF-8 encoded characters for the web.
For instance, to input U+2014 (the em dash "—") to a web document, one
can use either "—" ("x" means what follow are in hexadecimal form)
or "—" (8212 is x2014 in decimal form). Any Unicode characters can
be entered in this form (not a convenient way, but helpful to know
if you don't have any language specific input software).
posted on July 26, 2002
The regular comment period of this article is over
Please come back in the first 7 days of any month to leave comment