Looking for a specific post? Checkout the Blog Index.
May 1st, 2020 - written by Kimserey with .
The step of translating text to binary format, called character encoding, is an important part of software development. Nowadays, we have some reminiscence of the ASCII code but majority of the systems default to UTF-8. In today’s post we will be looking at what the ASCII, Unicode and UTF-8 are.
ASCII stands for American Standard Code for Information Interchange and was introduced in 1963. It was made specifically for English alphabet and allowed to encode 128 characters into 7 bits. The first 32 characters were used for special instructions and the rest for alphabet and punctuation.
65 which is
117 which is
01110101. With ASCII, we were able to transfer documents through the wire but only documents containing English alphabet. Other countries which use a different alphabet had their own standards which was a problem as documents weren’t able to be transmitted from computer to computer from different countries. The world needed a new worldwide standard which would cater for all characters.
For that purpose, Unicode standard was invented to gather all characters in the world.
Unicode works with code points, which is a numerical value identifying a single character (and other special values).
The notation of a Unicode code point starts with
U+, for example for the unicode of
yis part of the block
Basic Latin, U+0000 - U+007F, which is part of the Plane
Basic Multilingual Plane, U+0000 - U+FFFF(BMP).
There are over a million code points and as of today, to express the full palette of Unicode characters, we need 21 bits. To encode those characters, Unicode has multiple encoding standard, UTF-8, UTF-16 and UTF-32.
The most commonly used standard is UTF-8. UTF stands for Unicode Transformation Format and
8 stands for 8 bit. The 8 bit part represent the code unit which is the mininum required unit of storage used to represent a code point (or part of a code point).
UTF-8 was designed to be backward compatible with ASCII. The first 128 characters used in UTF-8 match one for one to the ASCII characters, and since the code unit is of 8 bit, by putting 0 as first digit, we are able to transmit data the exact same way we would for an ASCII encoding.
A is ASCII is
65 (1000001) and
A in UTF-8 is also
In order to encode the rest of the Unicode characters UTF-8 uses one to four bytes (one byte being 8 bits). The following table from Wikipedia explains how the bits from the code points are spread on the one to four bytes.
|Number of bytes||Bits for code point||First code point||Last code point||Byte 1||Byte 2||Byte 3||Byte 4|
For any ASCII character, a single byte will be needed with the leading bit being zero. For subsequent character, a continuation pattern is followed with the leading byte starting with
11 and continuation bytes starting with
So with UTF-8, we are able to encode all characters contained in the Unicode standard, are then able to transmit data through the wire and able to display documents in all computers in the world!
Now when looking at ASCII or UTF-8, we can’t get away from seing hexadecimal notation. Hexadecimal notation is used to group bits into fewer more readable terms.
The notation is made of 16 symbols, 10 digits
0-9 and 6 letters
A-F and is usually prefixed with
16 symbols will cater for a group of 4 bits, called a nibble, therefore to compose the hex value of sequence of bits, we group it by 4 and find the corresponding hex value and one byte will be represented by 2 hexadecimal symbols.
Hexadecimal encoding has a very different purpose than UTF-8. UTF-8 was created to encode all possible characters, supporting all languages, to and from binary format while Hexadecimal encoding is meant to encode binary format of 4 bit data to a hexadecimal character for ease of read.
101010111100 will be
0xABC, much easier to read. Hexadecimal is also another name for base16.
While Hexadecimal is good for small values, it becomes quite heavy when the data become too large. If size matter, another encoding that is widely used is
base64 where the notation is made of 64 symbols, 26 uppercase letters
A-Z, 26 lowercase letters
a-z, 10 digits
0-9 and 2 special character
/. A special padding
= is also added for remaining non used bits.
64 symbols will cater for a group of 6 bits of data so 3 bytes of data (24 bits) is represented by 4 base64 symbols - much smaller than hexadecimal which requires 2 symbols for one byte.
Base64 works exactly the same way as a hexadecimal conversion where we would group every 6 bits of data and find the corresponding character mapping to it.
0100 0000 0100 0000 which is
QQ== in base64.
And that concludes today’s post!
Today we looked into Unicode, we started by looking at what ASCII was, we then moved on to talk about Unicode and what was UTF-8. We then finished by touching quickly on hexadecimal and base64 conversation. I hope you liked this post and I see you on the next one!