Welcome to Changhai Lu's Homepage

I know nothing except the fact of my ignorance.

- Socrates

 
INFO
 
 
 
Email Me
All English Contents
 
STATS
 
 
 
Article Views:
23,904
Site Views:
16,352,254

WHAT IS UTF-8

- Collected by Changhai Lu -

This document contains a minimal collection of information for you to understand UTF-8 (which is the encoding used by this website). For more information, please check out the Unicode Consortium website.

UTF-8 is a popular encoding form of the Unicode/ISO-10646 standard. Don't worry if that doesn't make much sense to you yet, read below and things will become clear.

In the early days, there were two independent attempts to create a unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium later known as the Unicode Consortium. ISO came up with a standard called ISO 10646 which defines a huge 31-bit Universal Character Set (UCS). A 16-bit subset of UCS (which contains 65534 characters) is called the Basic Multilingual Plane (BMP) and is the part that gets populated first. Unicode Consortium, on the other hand, were working on its own standard called the Unicode standard.

Having two independent standards is certainly not something people would call "unified". Both ISO and the Unicode Consortium realized that and decided to form a joined effort in 1991. Since then new versions of Unicode standard are made fully compatible and synchronized with the corresponding versions of ISO 10646. All characters are located at the same positions and have the same names in both standards.

Theoretically, the 31-bit UCS can contain about two billion characters, the number of characters that are actually defined, however, is much smaller (but has been growing in time). Version 3.2 of the Unicode standard, for instance, provided codes for 95221 characters (which already goes beyond the BMP). Unicode is stable, the growing process of the Unicode is strictly additive, namely only new characters will be added, no existing characters will be removed or renamed in the future.

Unicode and ISO 10646 are first of all code tables that assign integer numbers to characters. Hexadecimal numbers for those integer values are commonly preceded by "U+". For instance, U+0041 is the character "Latin capital letter A". Given the integer values, it is up to the character encoding standards (encoding forms) to define how these values should be represented as a byte sequence.

Unicode standard defines three encoding forms that allow the same character data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16, or 32-bits per code unit). These three encoding forms are called UTF-8, UTF-16 and UTF-32 respectively. There are other encoding forms defined by ISO. The abbrevation UTF stands for Unicode (or UCS) Transformation Format.

UTF-8 transforms all Unicode characters into a variable length byte sequence, it has the following properties:

  • Characters U+0000 to U+007F (ASCII) are encoded as a single byte 0x00 to 0x7F, this means UTF-8 is fully compatible with ASCII.
  • All characters greater than U+007F are encoded as a sequence of several bytes, all of which are above 0x7F (namely no ASCII byte), this makes it unambiguous to determine whether a byte belows to a multi-byte character or an ASCII character.
  • The first byte of a multi-byte sequence (that represent a non-ASCII) is always in the range of 0xC0 to 0xFD. All further bytes in the sequence are in the range 0x80 to 0xBF. This makes it unambiguous to determine the boundary of the multi-byte characters (in fact, the first byte also contains a redundant information about how many bytes follow for the character).
  • The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

To fully encode all the 231 characters in UCS, a UTF-8 encoded character can be up to six bytes long, but the 16-bit BMP characters are only up to three bytes long. The following formats of byte sequence are used to represent a character in UTF-8:

Unicode Characters: UTF-8 encoding:
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The "xxx" bits are filled with the bits of the character code number as assigned by the Unicode standard. For instance, the Unicode character U+2260 = 0010 0010 0110 0000 (the symbol "not equal to" - ≠) belongs to the 3rd category in the table. Fill these 16 bits into the 16 "x" positions in the format one obtains the UTF-8 encoding of the character as:

11100010 10001001 10100000

Notice the digits highlighted matches the bits (highlighted before) assigned by the Unicode standard.

As mentioned before, UTF-8 is a popular encoding form for Unicode. Why is it so? The reason lies in the fact that all ASCII characters are encoded as a single byte in UTF-8 which is not only fully backward compatible, but also space efficient for US and many European users. In general, UTF-8 costs no extra space for US ASCII, only a few percent more for ISO-8859-1 (aka Latin-1, covers most West European languages), 50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic. As a comparison, UTF-16 costs no more space for Chinese/Japanese/Korean, 100% more for US ASCII and ISO-8859-1, Greek and Cyrillic. UTF-32 is a fixed width encoding that costs the most amount of space. Since US and West European account for most of the internet users, English accounts for most of the information distributed on the web (at the time of this writing), so UTF-8 has quickly become the most popular Unicode encoding form for the web.

Finally, a note about a universal way to enter UTF-8 encoded characters for the web. For instance, to input U+2014 (the em dash "—") to a web document, one can use either "—" ("x" means what follow are in hexadecimal form) or "—" (8212 is x2014 in decimal form). Any Unicode characters can be entered in this form (not a convenient way, but helpful to know if you don't have any language specific input software).

The regular comment period of this article is over
Please come back in the first 7 days of any month to leave comment