May 01, 2020

Unicode Explained

The step of translating text to binary format, called character encoding, is an important part of software development. Nowadays, we have some reminiscence of the ASCII code but majority of the systems default to UTF-8. In today’s post we will be looking at what the ASCII, Unicode and UTF-8 are.

ASCII
UTF-8
Hexadecimal
Base64

ASCII

ASCII stands for American Standard Code for Information Interchange and was introduced in 1963. It was made specifically for English alphabet and allowed to encode 128 characters into 7 bits. The first 32 characters were used for special instructions and the rest for alphabet and punctuation.

For example, A is 65 which is 1000001 and u is 117 which is 01110101. With ASCII, we were able to transfer documents through the wire but only documents containing English alphabet. Other countries which use a different alphabet had their own standards which was a problem as documents weren’t able to be transmitted from computer to computer from different countries. The world needed a new worldwide standard which would cater for all characters.

UTF-8

For that purpose, Unicode standard was invented to gather all characters in the world. Unicode works with code points, which is a numerical value identifying a single character (and other special values). The notation of a Unicode code point starts with U+, for example for the unicode of y is U+0079,

y is part of the block Basic Latin, U+0000 - U+007F, which is part of the Plane Basic Multilingual Plane, U+0000 - U+FFFF (BMP).

There are over a million code points and as of today, to express the full palette of Unicode characters, we need 21 bits. To encode those characters, Unicode has multiple encoding standard, UTF-8, UTF-16 and UTF-32.

The most commonly used standard is UTF-8. UTF stands for Unicode Transformation Format and 8 stands for 8 bit. The 8 bit part represent the code unit which is the mininum required unit of storage used to represent a code point (or part of a code point). UTF-8 was designed to be backward compatible with ASCII. The first 128 characters used in UTF-8 match one for one to the ASCII characters, and since the code unit is of 8 bit, by putting 0 as first digit, we are able to transmit data the exact same way we would for an ASCII encoding.

For example, A is ASCII is 65 (1000001) and A in UTF-8 is also 65 (01000001).

In order to encode the rest of the Unicode characters UTF-8 uses one to four bytes (one byte being 8 bits). The following table from Wikipedia explains how the bits from the code points are spread on the one to four bytes.

Number of bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	7	U+0000	U+007F	0xxxxxxx	-	-	-
2	11	U+0080	U+07FF	110xxxxx	10xxxxxx	-	-
3	16	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx	-
4	21	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

For any ASCII character, a single byte will be needed with the leading bit being zero. For subsequent character, a continuation pattern is followed with the leading byte starting with 11 and continuation bytes starting with 10.

So with UTF-8, we are able to encode all characters contained in the Unicode standard, are then able to transmit data through the wire and able to display documents in all computers in the world!

Hexadecimal

Now when looking at ASCII or UTF-8, we can’t get away from seing hexadecimal notation. Hexadecimal notation is used to group bits into fewer more readable terms. The notation is made of 16 symbols, 10 digits 0-9 and 6 letters A-F and is usually prefixed with 0x. 16 symbols will cater for a group of 4 bits, called a nibble, therefore to compose the hex value of sequence of bits, we group it by 4 and find the corresponding hex value and one byte will be represented by 2 hexadecimal symbols.

Hexadecimal encoding has a very different purpose than UTF-8. UTF-8 was created to encode all possible characters, supporting all languages, to and from binary format while Hexadecimal encoding is meant to encode binary format of 4 bit data to a hexadecimal character for ease of read.

For example, 101010111100 will be 0xABC, much easier to read. Hexadecimal is also another name for base16.

Base64

While Hexadecimal is good for small values, it becomes quite heavy when the data become too large. If size matter, another encoding that is widely used is base64 where the notation is made of 64 symbols, 26 uppercase letters A-Z, 26 lowercase letters a-z, 10 digits 0-9 and 2 special character + and /. A special padding = is also added for remaining non used bits. 64 symbols will cater for a group of 6 bits of data so 3 bytes of data (24 bits) is represented by 4 base64 symbols - much smaller than hexadecimal which requires 2 symbols for one byte.

Base64 works exactly the same way as a hexadecimal conversion where we would group every 6 bits of data and find the corresponding character mapping to it.

For example, M is 01000001 therefore 0100 0000 0100 0000 which is QQ== in base64.

And that concludes today’s post!

Conclusion

Today we looked into Unicode, we started by looking at what ASCII was, we then moved on to talk about Unicode and what was UTF-8. We then finished by touching quickly on hexadecimal and base64 conversation. I hope you liked this post and I see you on the next one!