bits and text (notes)

 

byte: the size of a cell of addressable memory. Varies from system to system, but basically all modern systems have 8-bit bytes.

octet: 8 bits

megabyte: 1,000,000 (106) bytes

gigabyte: 1,000,000,000 (109) bytes

terabyte: 1012

petabyte: 1015

exabyte: 1018

zettabyte: 1021

kibibyte: 210 bytes

mebibyte: 220 bytes

gibibyte: 230 bytes

These prefixes can all be used with –bit as well as –byte., e.g. a mebibit is 220 bits. In abbreviations, uppercase B should mean byte while lowercase b should mean bit.

When talking about throughput, it’s traditional to talk in terms of bits, not bytes, e.g. the download speed of my internet connection is 10mbps (10 megabits per second).

§

Characters can be arbitrarily mapped to numbers such that we can represent a sequence of characters as a sequence of numbers.

character set: a set of characters in which each character is mapped to a unique number.

ASCII (“ass-key”): American Standard Code for Information Interchange was the most widely used character set for many decades. Contains 128 characters: the 26 lowercase letters of English, the 26 uppercase letters of English, the 10 numerals (0 1 2 3 4 5 6 7 9), punctuation marks, whitespace characters, and control characters.

whitespace character: A character representing spacing in text, e.g. space, tab, line break, etc. To represent a paragraph mark in ASCII, we use the linefeed character (Unix convention), or carriage return character (Mac convention), or both (Windows convention).

control characters: A character meant to signal an action to the program/hardware that reads the text, e.g. when a teletype machine would read the ASCII bell character, it would then sound a bell. In ASCII, the whitespace characters other than space itself are considered control characters. The other control characters are generally archaic and so tend to be ignored in modern programs.

§

plain text: text data that consists of just characters, not formatting.

character: an abstract unit of written language.

glyph: a particular visual representation of a character, e.g. the lowercase letter j is a character which can be represented by many different glyphs:

j j j j

character encoding: a scheme for representing a sequence of characters as bits.

Unicode: A standard character set created in the 1990’s that encompasses all the world’s languages. Contains 17 “planes”, each of which consists of 65,536 (216) “codepoints” (so in total: 1,114,112 codepoints). A codepoint is a mapping of one number to one character. By convention, Unicode codepoints are denoted by U+ followed by at least four hex digits:

U+0000              (the first codepoint of Unicode)
U+10FFFF            (the last codepoint of Unicode)

The first plane of Unicode, plane 0, the BMP (Basic Multilingual Plane), U+0000 to U+FFFF, is the most important because it contains nearly all of today’s modern written languages. The only exception is that CJK (Chinese, Japanese, and Korean) languages don’t entirely fit in the BMP, so about 40,000 Chinese ideographs are put in plane 2, the SIP (Supplementary Ideographic Plane), U+20000 to U+2FFFF.

Plane 1, the SMP (Supplementary Multilingual Plane), U+10000 to U+1FFFF, contains symbols like mathematical and musical notation.

Planes 3 to 13, U+30000 to U+DFFFF, are totally unused.

Plane 14, the SSP (Supplementary Special-purpose Plane), U+E0000 to U+EFFFF, contains only a few hundred arcane characters having to do with data protocols.

Planes 15 and 16 are the PUA’s (Private Use Areas), U+F0000 to U+10FFFF. These planes are reserved for custom use by any program: they will never be assigned any characters, and so you’re free to use these codepoints for whatever purpose in your own applications.

§

UTF-32: UTF stands for Unicode Transformation Format. UTF-32 is the simplest encoding for Unicode. In UTF-32, each character is simply encoded in four bytes. For instance, the codepoint U+40077 is encoded as:

0000_0000   0000_0100   0000_0000   0111_0111
00          04          00          77

UTF-16: A Unicode encoding that uses 2 bytes to encode the codepoints of the BMP but 4 bytes to encode all other codepoints. For example, U+0065 is in the BMP, so it is encoded as:

0000_0000   0110_0101
00          65

To encode a character outside the BMP, we use two pairs of bytes, the first always beginning with the bits 110110, the second always beginning with the bits 110111. In the first pair, the four bits after 110110 are used to represent the plane by subtracting 1, e.g. plane 3 is represented as 2 while plane 7 is represented as 6. The remaining 16 bits are used to represent the codepoint within the plane. So for example, U+20065 is a codepoint in plane 2, and the codepoint within the plane is 0065, so we get:

1101_1000   0100_0000     1101_1100   0110_0101
D8          40            DC          65

These two pairs are not mistaken for the two characters U+D840 and U+DC65 because those codepoints in the BMP are “surrogates”: the codepoints U+D800 to U+DFFF are reserved just for this purpose of 4-byte encodings in UTF-16.

§

UTF-8: A Unicode encoding that uses 1 to 3 bytes per character in the BMP, but 4 characters for every other character:

The codepoints U+0000 to U+007F are encoded in one byte, like so:

0xxx_xxxx

The codepoints U+0080 to U+07FF are encoded in two bytes, like so:

110x_xxxx   10xx_xxxx

The codepoints U+0800 to U+FFFF are encoded in three bytes, like so:

1110_xxxx   10xx_xxxx   10xx_xxxx

The codepoints U+10000 to U+100FF are encoded in four bytes, like so:

1111_0xxx   10xx_xxxx   10xx_xxxx   10xx_xxxx

text editor: A text editor is a program for editing text files, files containing plain text (text data with no formatting).

Comments are closed.