Woodstock Blog

a tech blog for general algorithmic interview questions

[Question] ASCII, Utf-8, Utf-16 and Unicode

First Word

ASCII, UTF-? are encoding schemes. Unicode is character set.

Or in other words, UTF-8 is an encoding used to translate binary data into numbers. Unicode is a character set used to translate numbers into characters.

ASCII

ASCII is a character-encoding scheme based on the English alphabet. It encodes 128 characters into 7-bit integers.

ASCII was the most commonly used character encoding on the World Wide Web until December 2007, when it was surpassed by UTF-8, which includes ASCII as a subset.

UTF-8

UCS Transformation Format—8-bit

UTF-8 is a variable-width encoding that can represent every character in the Unicode character set. It uses 1 byte for ASCII characters, and up to 4 bytes for others.

UTF-8 is the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages.

Utf-8 is good because:

  1. backward compatibility with ASCII

  2. avoid the complications of endianness and byte order marks in UTF-16 and UTF-32

UTF-8 has become the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages

UTF-16 and UTF-32

UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. The word “Unicode” means something else according to this person:

Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings and there are Unicode strings (stored internally as UTF-16LE).

This was all devised in the early days of Unicode, before UTF-8 was invented. This is why Windows’s support for UTF-8 is all-round poor.

UTF-8: “size optimize”:

Best suited for Latin character based data (or ASCII), it takes only 1 byte.

UTF-16: “balance”

It takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling, but size is still variable and can grow up to 4 bytes per character.

UTF-32: “performance”

Allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage

Unicode

Unicode is a computing industry standard for the consistent encoding. It expresses most of world’s writing systems with more than 110,000 characters covering 100 scripts and symbols.

Unicode can be implemented by Utf-8, Utf-16 and now-obsolete Ucs-2. When unicode is encoded with UTF-8, its code value is same as ASCII.

Table of the scheme

Bits of
code point
First
code point
Last
code point
Bytes in
sequence
Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
  7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Example

Translate ‘€’ into binary Utf-8.

  1. The Unicode code point for “€” is U+20AC.

  2. According to the scheme table above, this will take 3 bytes (16 bits).

  3. Convert hexadecimal 20AC into a 16-bit binary 0010000010101100.

  4. Fill in the 16 bits into: 1110xxxx 10xxxxxx 10xxxxxx, we got 11100010 10000010 10101100

  5. Almost done. The result can be concisely written in hexadecimal, as E2 82 AC.

Note that the last byte of any encoding starts with ‘0’, in this way computer can easily identify the encoding length of every character. Also note that ASCII code is natively supported by Utf-8.