bits and text (intro)

character sets and character encodings

To represent a character of text as bits, we represent it as a number, and so we must (arbitrarily) decide which numbers represent which characters. A character set is a standardized selection of characters given designated numbers.

When expressing characters as numbers, we need to decide how exactly to write the numbers as bits. How many bits do we use to represent each character? Do we use the same number of bits for every character? In other words, how should we encode the characters? A character encoding is a standardized way of encoding text.

ASCII and Unicode

ASCII (American Standard Code for Information Interchange) was the most widely used character set for several decades. ASCII contains just 128 characters: the English alphabet, English punctuation, numerals, and a few miscellaneous others.

The Unicode character set has now supplanted ASCII as the most widely used character set.¬†Created in the 1990’s, Unicode contains over a million characters, including basically every symbol of every written language in history.

UTF-8, UTF-16, UTF-32

Unicode text is most commonly encoded in three standard encodings: UTF-8, UTF-16, and UTF-32. UTF-8, as the name implies, uses as few as 8 bits to represent a single character, though some characters require as many as 32 bits. In UTF-16, the most commonly used characters are represented in 16 bits and the rest in 32. In UTF-32, all characters are represented in 32 bits. The choice of which encoding to use comes down to a trade off between space efficiency and processing efficiency.

Comments are closed.