Python – strings and collections

string literals

The only difference between single-quote and double-quote strings is whether you must escape single-quote characters or double-quote characters:

‘In a single-quote string, escape the ‘ character but not the “ character.’
“In a double-quote string, escape the “ character but not the ‘ character.”

Python also allows strings delimited by three single-quote marks or three double-quote marks. Such strings can run across multiple lines:

‘‘‘Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
 incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
 exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure
 dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
 Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit
 anim id est laborum.’’’
“““Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
 incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
 exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure
 dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
 Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit
 anim id est laborum.”””

When spread this way onto multiple lines, the indentation of the additional lines does not matter. The normal indentation rules resume on the line after the string is closed. Any newlines inside the string literal become newline characters in the string.

comparisons with the relational operators

The <, ><=, and >= operators invoke the methods__lt____gt__, __lte__, and __gte__, respectively, which all return True or False. For numbers, the behavior should be obvious, e.g. 3 < 5 returns True.

While all built-in types include these methods, invoking these types on some methods simply triggers a TypeError exception. For example, comparing two dictionaries with any of these methods gives a TypeError exception with the message, ‘unorderable types’, because comparing two dictionaries doesn’t make any sense. (So why give dictionaries these methods in the first place? Simply to give a more helpful exception message than ‘AttributeError: no such attribute’ when you mistakenly try to compare dictionaries.)

Comparisons between sequences (ordered collections of items, such as lists) are defined to find the first pair of items at corresponding indexes that differ and then return the result based upon their comparison:

[5, 7] > [5, 4]           # True (because 7 is greater than 4)
[5, 7] < [3, 7]           # False (because 5 is not less than 3)
[5, 7] <= [3, 7]          # False (because 5 is not less than or equal to 3)

When the sequences match, neither is greater or less than the other:

[5, 7] > [5, 7]           # False
[5, 7] < [5, 7]           # False
[5, 7] <= [5, 7]          # True (because they are equal)
[5, 7] >= [5, 7]          # True (because they are equal)

When both sequences match except one is longer, the longer sequence is considered greater:

[5, 7] < [5, 7, -9]       # True
[5, 7] > [5, 7, -9]       # False
[5, 7] <= [5, 7, -9]      # True
[5, 7] >= [5, 7, -9]      # False

Be clear that each pair of items in the sequences are compared using the same comparison methods (__lt__et al.), so if the items are themselves sequences, the items of those sequences are compared in the same manner recursively.

When defining your own sequence type, its __lt____gt____lte__, and __gte__ methods should probably conform to these semantics.

A string is a sequence in which each item is a single character, and characters are compared numerically using their code point values, e.g. ‘A’ is less than ‘B’ because ‘A’ is 65 while ‘B’ is 66. Otherwise they are compared just like other sequences:

‘hello’ < ‘hi’            # True (‘e’ is less than ‘i’)
‘hello’ > ‘hi’            # False (‘e’ is not greater than than ‘i’)
‘hello’ > ‘hell’          # True (‘hello’ is longer than ‘hell’)
‘hello’ < ‘hell’          # False (‘hello’ is not shorter than ‘hell’)
‘hello’ <= ‘hello’        # True (both are equal)
‘hello’ <= ‘hell’         # False (‘hello’ is longer than ‘hell’)
‘hello’ >= ‘hell’         # True (‘hello’ is longer than ‘hell’)

tuples

The built-in tuple type is a sequence type which is basically an immutable list: the objects referenced by a tuple may themselves be mutable, but a tuple can change neither which objects it references nor the number of objects it references. The tuple type has almost all the same operations and methods as the list type except, of course, for the ones that modify the contents, e.g. you can’t assign to an index of a tuple.

Calling the tuple class (‘tuple’ in the builtins namespace) with no arguments creates an empty tuple:

tuple()      # a new empty tuple

Calling tuple with a sequence argument creates a tuple with the same contents:

tuple([7, 3, 5])        # a tuple with the items: 7, 3, 5

A tuple can also be created with literal syntax like a list but with () instead of []. This can be terribly confusing until you get used to distinguishing tuple literals from other uses of parens. While an empty tuple is written as just an empty pair of parens, a tuple with just one item must be written specially with a comma after the item to distinguish the parens as a tuple literal rather than just an expression surrounded in parens:

(6, 2, 'hi')      # a tuple with the items: 6, 2, ‘hi’
()                # an empty tuple
(7,)              # a tuple with one item: 7
(7)               # just an expression of the value 7

 

dictionary keys

As already mentioned, Python dictionaries offer more flexibilty than Javascript objects because dictionary keys needn’t be just strings or numbers. However, not just any object can be a dictionary key. The rule is that a dictionary key must have a fixed value. This effectively means the key not only must be immutable, it must itself only be composed of other immutable objects, which in turn must only be composed of immutable objects—it’s immutable objects all the way down. So to use a tuple as a key in a dictionary, that tuple cannot, say, reference any list or any object with mutable attributes, or for that matter cannot reference any other tuples if those tuples themselves contain mutable objects:

{(3, 4): 5)             # OK
{(3, []): 5}            # exception: the tuple key contains a mutable list
{(3, (8, []), 7): 5}    # exception: the tuple key contains a tuple which itself contains a mutable list

In Python terminology, objects with a fixed value are called ‘hashable’. We’ll dicuss the term ‘hash’ in a later unit when we cover hash functions.

sequence and mapping operations

Collection types in Python are grouped into sequence types (collections with a concept of order, indexed numerically from 0) and mapping types (collections without a concept of order, indexed by keys). Sequence types include string, list, tuple, and a few others we haven’t yet discussed, but Python includes only one mapping type, dictionary. (Even if there’s just one mapping type that comes stock in Python, the concept has value because programmers may create their own mapping types.) In object-oriented programming terms, ‘sequence’ and ‘mapping’ effectively refer to informal interfaces, types defined as a set of operations. First we’ll look at the operations which sequences and mappings have in common:

get and set items

To retrieve items from a sequence or map, we postfix the collection object with the [] operator and specify an index inside:

a = [6, 2, 14]
a[0]                    # 6 (1

st

 item in the sequence)
a[2]                    # 14 (3

rd

 item in the sequence)
b = {71: ‘moose’, ‘North Dakota’: 11}
b[71]                   # ‘moose’ (the value of key 71)

(As said before, don’t be confused by the interchangeability of the [] and . operators in Javascript. These operators are unrelated in Python: [] is for items of collections while . is for object attributes.)

Indexing a string returns a single-character string:

s = ‘pickled herring’
s[3]                 # return ‘k’ (the 4

th

 character in the string)

For sequences, you can specify negative indexes to retrieve items from the end:

a = [6, 2, 14]
a[-1]                 # return 14 (last item in the sequence)
a[-2]                 # return 2 (second-to-last item in the sequence)
a[-3]                 # return 6 (third-to-last item in the sequence)

Retrieving an index that does not exist in the collection triggers an exception:

a = [6, 2, 14]
a[7]                 # exception: out of bounds

To set items in a sequence or mapping, we use [] in an assignment context:

a = [6, 2, 14]
a[1] = ‘hi’                 # set the 2

nd

 item to ‘hi’
a                           # [6, ‘hi’, 14]

For sequences, you can only set indexes that already exist, but for maps, the key needn’t already exist:

a = [6, 2, 14]
a[5] = ‘hi’                 # exception: list doesn’t already have a 6

th

 item
b = {71: ‘moose’, ‘North Dakota’: 11}
b[‘goat’] = 2               # OK, adds new key-value pair
b                           # {71: ‘moose’, ‘goat’: 2, ‘North Dakota’: 11}

Strings and tuples are immutable, so attempting to modify a string or tuple raises an exception:

a = (6, 2, 14)
a[1] = ‘hi’                 # exception: can’t modify a tuple

Uses of the [] operator are really method calls in disguise. Getting an item with [] invokes the __getitem__ method while setting an item with [] invokes the __setitem__ method:

a = [6, 2, 14]
a[2]                        # a.__getitem__(2)
a[1] = ‘hi’                 # a.__setitem__(1, ‘hi’)

If you were to create your own collection type, you would define its getting and setting behavior by defining its __getitem__ and __setitem__ methods.

get the number of items

Collection types include a method __len__ (short for ‘length’) which returns the number of items in the collection:

a = [6, 2, 14]
a.__len__()                 # 3

(For dictionaries, an item is defined as a key-value pair, so __len__ returns the number of key-value pairs.)

Rather than call this method directly, the preferred practice is to call the builtins module function len, which itself calls __len__:

a = [6, 2, 14]
len(a)                      # 3

(The reason for having this function is simply stylistic. Getting the length of a collection is so common that Python decided that invoking a method wasn’t succinct enough. Recall that, in general, any attributes surrounded in double underscores are only meant to be used indirectly.)

test for membership with in and not in

Collection types have a method __contains__ which returns True or False depending upon if the collection contains an item (or key, in the case of mappings) equal (not necessarily identical!) to the specified value:

a = [6, 2, 14]
a.__contains__(14)                     # True
a.__contains__(‘hello’)                # False
b = {71: ‘moose’, ‘North Dakota’: 11}
b.__contains__(‘North Dakota’)         # True
b.__contains__(33)                     # False
b.__contains__(11)                     # False (11 is a value in the dictionary, not a key)

Rather than call __contains__ directly, preferred practice is to use the in operator, which itself invokes __contains__:

a = [6, 2, 14]
14 in a                               # True
b = {71: ‘moose’, ‘North Dakota’: 11}
‘North Dakota’ in b                   # True
33 in b                               # False

The not in operator (written as two separate words but considered a single operator), simply returns the logical inverse of in:

a = [6, 2, 14]
14 not in a                               # False
b = {71: ‘moose’, ‘North Dakota’: 11}
‘North Dakota’ not in b                   # False
33 not in b                               # True

 

removing items

Items can be removed by index from mutable collections using the __delitem__ method:

a = [6, 2, 14]
a.__delitem__(1)                    # remove item at index 1
a                                   # [6, 14]
b = {71: ‘moose’, ‘North Dakota’: 11}
b.__delitem__(‘North Dakota’)       # removes item with the key ‘North Dakota’
b                                   # {71: ‘moose’}

Rather than invoke __delitem__ directly, preferred practice is to use the del statement with an indexing expression:

a = [6, 2, 14]
del a[1]                      # a.__delitem__(1)
b = {71: ‘moose’, ‘North Dakota’: 11}
del b[‘North Dakota’]         # b.__delitem__(‘North Dakota’)

The del statement is used for a few other purposes, as we’ll see later.

iterators

An iterator, as the name implies, is an object useful for iterating over every item in a collection. In Python, iterators are used most commonly in conjunction with the for-in loop (discussed later).

The collection types include a method __iter__ for returning an iterator object. Each call to an iterator’s __next__ method returns, in order, one item of the collection; once all the values have been iterated through, subsequent calls to __next__ will throw a StopIteration exception:

a = [6, 2, 14]
i = a.__iter__()               # assign i an iterator over the list
i.__next__()                   # 6
i.__next__()                   # 2
i.__next__()                   # 14
i.__next__()                   # exception: StopIteration

For dictionaries, the iterator returned by __iter__ iterates over just the keys, not the values.

b = {71: ‘moose’, ‘North Dakota’: 11}
i = b.__iter__()               # assign i an iterator over the dictionary
i.__next__()                   # ‘North Dakota’
i.__next__()                   # 71
i.__next__()                   # exception: StopIteration

(Above, we show the key ‘North Dakota’ returned first, but the order may as well be the other way around. Because dictionary items have no concept of order, the order of the keys is effectively random. Wwhile the order is deterministic and so the same for every iterator we create from the dictionary, you should treat the order as random because it is only determined by the details of Python’s dictionary implementation, which we’re not supposed to concern ourselves with. So don’t write code that expets the keys in any particular order.)

Rather than call __iter__ directly, preferred practice is to call the builtins module function iter, which itself calls __iter__ . The builtins module also includes the function next, which itself calls __next__:

a = [6, 2, 14]
i = iter(a)                    # assign i an iterator over the list
next(i)                        # 6

Iterators in Python are generally ‘live’ or ‘active’, meaning that they reflect changes in the collections over which they iterate. For example, given an iterator over a list, changes made to the list after creation of the iterator show up in calls to next on the iterator:

a = [6, 2, 14]
i = iter(a)                    # assign i an iterator over the list
next(i)                        # 6
a[1] = 77
next(i)                        # 77

Effectively, the iterator advances through the indicies each time next is called and only stops when the following index goes past the last valid index of the underlying collection. However, once an iterator has thrown StopIteration, it is effectively dead and will not return any more vaules, even if we append more items to the underlying collection.

sequence operations

These operations apply to sequences but not mappings:

concatenate sequences together

The __add__ method (and hence the + operator) is defined for sequence types to return a new sequence which is the concatenation of two sequences. The two sequences must be of the same type, and the type of sequence returned is the same:

‘hello, ’ + ‘world!’                      # ‘hello, world!’
[6, 2, 14] + [7, 33]                      # [6, 2, 14, 7, 33]
(7, 33) + (6, 2, 14)                      # (7, 33, 6, 2, 14)

Note that the order of the operands matters. Also be clear that the operation produces a new sequence object; the original sequence objects remain unmodified.

concatenate a sequence with itself

The __mul__ method (and hence the * operator) is defined for sequence types to return a new sequence which is the concatenation of the original with itself some number of times. One operand is the original sequence, and the other is an integer zero or greater:

‘moo’ * 2                     # ‘moomoo’
‘moo’ * 4                     # ‘moomoomoomoo’
‘moo’ * 1                     # ‘moo’
‘moo’ * 0                     # ‘’
[11, 8] * 5                   # [11, 8, 11, 8, 11, 8, 11, 8, 11, 8]
[11, 8] * 0                   # []

Like +, this operation always returns a new object without modifying the original.

When you make the integer the left operand, the operation will invoke __mul__ of the integer type, not any sequence type. This is actually okay, though, because the integer class’s __mul__ method will do the same kind of repeated concatenation when the other operand is a sequence:

5 * [11, 8]                   # effectively the same as [11, 8] * 5

getting the largest and smallest items

To return the smallest and largest items from a sequence, we have the builtins module functions min and max:

min([17, 8, -3, 200])                     # -3
max([17, 8, -3, 200])                     # 200

For min and max to work on a sequence, all the items in it must be comparable with each other using the comparison operators (__lt__et al.).

find an item in a sequence and its number of occurences

Sequence types have a method index which, given an object, returns the index of the first occurrence in the sequence of an item equal to the argument (as determined with the == operator):

[9, ‘hello’, -3, ‘hello’, 87].index(‘hello’)          # 1 (the index of the first occurrence of ‘hello’)

When no item equals the argument, an exception is thrown:

[9, ‘hello’, -3, ‘hello’, 87].index(100)        # exception: ValueError

By default, index searches from index 0, but you can specify the starting index with a second argument:

[9, ‘hello’, -3, ‘hello’, 87].index(‘hello’, 2) # 3 (search started from index 2, so found second occurrence of ‘hello’)

You can also specify the index at which to stop searching with an optional third argument.

Sequences also have a method count, which returns the number of occurences of a value in a sequence:

[9, ‘hello’, -3, ‘hello’, 87].count(‘hello’)          # 2 (‘hello’ found twice in the list)

mutable sequence operations

These operations apply only to mutable sequences, namely lists and bytearrays (discussed later).

add items to the end

Mutable sequences can expand in length, but we can’t simply use [] to assign to indexes beyond the current end of the sequence (as we could in Javascript). Instead, we must use the methods append or extend. The append method appends its single argument to the end of the sequence:

a = [3, 61, 9]
a.append(‘yo’)
a                 # [3, 61, 9, ‘yo’]

The extend method appends all of the items of another sequence:

a = [3, 61, 9]
b = [22, ‘hi’, 88, 5]
a.extend(b)
a                 # [3, 61, 9, 22, ‘hi’, 88, 5]
b                 # [22, ‘hi’, 88, 5]      (Note that extend does not modify the other sequence.)

insert an item at an index

If you wish to add an item to somewhere in the middle rather than the end, use the insert method, which takes as argument an index and the value to insert there:

a = [3, 61, 9]
a.insert(1, ‘hi’)        # insert ‘hi’ at index 1
a                        # [3, ‘hi’, 61, 9]

remove items by value

The remove method removes the first occurrence of an item equal to the argument:

a = [3, 61, 9, 8, 61, 7]
a.remove(61)
a                      # [3, 9, 8, 61, 7]

If no matching value is found, remove throws a ValueError exception.

Note that for the same effect we could simply use index and del:

a = [3, 61, 9, 8, 61, 7]
i = a.index(61)             # assign index of first occurrence of 61 to i del a[i]                    # delete item at index i a                           # [3, 9, 8, 61, 7]

pop items

A pop operation returns and removes an item from a collection. The pop method returns and removes the item at an index, by default the last index:

a = [3, 61, 9, 8, 61, 7]
x = a.pop()                 # pop last index
a                           # [3, 61, 9, 8, 61]
x                           # 7
x = a.pop(2)                # pop index 2
a                           # [3, 61, 8, 61]
x                           # 9

reverse the items

The reverse method reverses the order of the items:

a = [3, 61, 9, 8, 61, 7]
a.reverse()
a                         # [7, 61, 8, 9, 61, 3]

sorting the items

The sort method sorts the items using the comparison methods of the items, e.g. numbers are sorted by numeric value:

a = [3, 61, 9, 8, 61, 7]
a.sort()
a                         # [3, 7, 8, 9, 61, 61]

To get a reverse sorting, provide a keyword argument reverse with the value True:

a = [3, 61, 9, 8, 61, 7]
a.sort(reverse=True)
a                         # [61, 61, 9, 8, 7, 3]

To modify the values themselves which are used in the comparisons, provide a function to the keyword argument key. The function should take one argument. When the values are compared, each value is passed to this function, and the value returned is used in the comparison instead—but just for the comparison: the sorted list still contains the original values. For example, we can sort a list of numbers by their sine using the math module function sin:

import math
a = [3, 61, 9, 8, 61, 7]
a.sort(key=math.sin)
a                    # [61, 61, 3, 9, 7, 8] (so apparently 61 has the least sine while 8 has the greatest)

Because sorting relies upon comparison, you’ll get an exception if you sort a list containing objects which can’t be compared with each other:

a = [3, 61, ‘hello’, 8, 61, 7]
a.sort()                          # exception: can’t compare numbers to strings

The bytearray type (discussed later) doesn’t support sort, as it doesn’t make sense to sort bytes by their value.

mapping operations

These operations apply to mappings, namely the built-in dictionary type. (While Python includes just one mapping type, you may write your own additional mapping types, in which case these are the operations you should have it support.)

remove all items

The clear method removes all items:

a = {3: ‘hi’, ‘North Dakota’: 234, -24: 6}
a.clear()
a                                     # {}

copy the mapping

The copy method returns a shallow copy of the mapping:

a = {3: ‘hi’, ‘North Dakota’: 234, -24: 6}
b = a.copy()
a == b                 # True
a is b                 # False

You can also copy a dictionary by passing it to an invocation of the dict class:

a = {3: ‘hi’, ‘North Dakota’: 234, -24: 6}
b = dict(a)
a == b                 # True
a is b                 # False

(You should generally prefer the copy method because it is more efficient: the copy method simply copies the original dictionary object byte-for-byte whereas dict constructs a new dictionary by iterating over every item in the original. On the other hand, dict is more general because it can create a dictionary from any kind of mapping, not just other dictionaries.)

retrieve a value

Retrieving values with [] throws an exception if the mapping doesn’t have the specified key. The get method retrieves a value, but upon failing to find the key, it returns a value rather than throw an exception. By default, this value is None, but another value can be specified by an optional second argument:

a = {3: ‘hi’}
a.get(3)                    # ‘hi’
a.get(88)                   # no key 88, so return None
a.get(88, ‘yo’)             # no key 88, so return ‘yo’

get a collection of the keys and values

The keys method returns a view object of all the keys. A view object is like a wrapper around a collection, in this case presenting the keys like a sequence but only allowing us to perform three operations: len to get the number of keys, iter to get an iterator over the keys, and in to test whether a value is found in the keys. If you need a proper list or tuple of the keys, pass the view object as argument to list or tuple:

a = {3: ‘hi’, 7: -9}
a.keys()                     # a view object of the keys: 3, 7
list(a.keys())               # [3], a list of the keys

The reason Python doesn’t produce a list or tuple from the keys directly is mainly for efficiency: producing a sequence of the keys requires a lot of work iterating through the mapping whereas producing the view object takes a fixed amount of work no matter how large the mapping. Another reason is that the view objects are ‘live’ representations of the wrapped collection: as the mapping gets updated, these changes get reflected in the view object. (In many ways, view objects are much like iterators.) The same cannot be said of a list object:

a = {3: ‘hi’, 7: -9}
kv = a.keys()                 # a view object of the keys
kl = list(a.keys())           # a list of the keys
len(kv)                       # 2
len(kl)                       # 2
a[7] = ‘bye’
len(kv)                       # 3 (reflects the change in the dictionary)
len(kl)                       # 2 (doesn’t reflect the change in the dictionary)

The values method returns a view object of the values in the mapping:

a = {3: ‘hi’, 7: -9}
a.values()                    # a view object of the values: ‘hi’, -9

The items method returns a view object of the items in the mapping, each item expressed as a tuple of the key and value:

a = {3: ‘hi’, 7: -9}
a.items()                     # a view object of the items: (3, ‘hi’), (7, -9)

popping keys and items

The pop method of mappings returns the value of the specified key and removes that item. If the key is not found, a KeyError exception is thrown, unless an optional second argument is provided for a default return value:

a = {3: ‘hi’, 7: -9}
a.pop(7)                     # -9
a                            # {3: ‘hi’}
a.pop(‘avast’)               # exception: KeyError
a.pop(‘avast’, 88)           # 88 (key ‘avast’ wasn’t found)

The popitem method returns and removes an item (expressed as a tuple). No key is specified, so an item is chosen arbitrarily. If the mapping is empty, popitem throws a KeyError exception:

a = {3: ‘hi’, 7: -9}
a.popitem()                  # (7, -9)
a.popitem()                  # (3, ‘hi’)
a.popitem()                  # exception: KeyError

set a key only if no such key already exists

The setdefault method returns the value of the specified key, but if no such key is found, the key is set and returns the key’s new value; this value is None if not specified by an optional second argument:

a = {3: ‘hi’, 7: -9}
a.setdefault(3)              # ‘hi’
a.setdefault(4)              # None
a                            # {4: None, 3: ‘hi’, 7: -9}
a.setdefault(8, ‘yo’)        # ‘yo’
a                            # {4: None, 8: ‘yo’, 3: ‘hi’, 7: -9}
a.setdefault(3, ‘ahoy’)      # ‘hi’ (key found, so the mapping is not modified)
a                            # {4: None, 8: ‘yo’, 3: ‘hi’, 7: -9}

copy items from one maping into another

The update method modifies a mapping to incorporate all the items from another mapping. Where both mappings have a key in common, the key in the original gets set to the value of that key in the other mapping:

a = {3: ‘hi’, 7: -9}
a.update({‘dog’: 6, 3: 77})
a                             # {3: 77, ‘dog’: 6, 7: -9}     (notice that key 3 gets set to 77)

shallow vs. deep copies

When talking about copying objects, we need to distinguish between ‘shallow’ copies and ‘deep’ copies. A shallow copy copies only the object itself, not any of the other objects referenced in that object, and so all the references in the copy point to the same objects as in the original. A deep copy, in contrast, copies not just the object itself but all of the objects it references such that the new copy consists of references to all new objects: effectively, nothing in the original object and the new copy is shared.

The obvious problem with deep copies is that they are much less efficient, requiring more time and space, so shallow copies are much more common. When we talk about copying objects in Python, you should assume we’re talking about shallow copies unless otherwise stated. The + and * sequence operations, for example, perform only shallow copies.

for-in loop

The for-in loop is a more syntactically compact way than the while loop of iterating over the items of an iterable object. The general form of for-inis:

for target in iterable:  body

…where iterable is an object with an __iter__ method for producing an iterator. A for-in starts by calling __iter__ to get an iterator of the iterable. In each iteration, the value returned by the iterator’s __next__ method is assigned to the variable target before the body is executed. The loop ends when __next__ raises a StopIterationexception, which is then caught by the loop, and execution leaves the loop. For example,

for x in [8, 3, “moo”]:
    print(x)

This for-in will call print three times, first with argument 8, then with 3, then with “moo”. If the object to iterate over doesn’t have an __iter__method, for-in will try to use __getitem__ instead, starting with an index of 0 and incrementing from there until __getitem__ throws an exception because we’ve gone past the last index. Assignment to the for-in target is just like a local assignment because the for-in target variable is a regular variable just like any other in its scope. So after a for-inloop ends, the target variable will still exist in that scope:

# will print 5, 6, then 7
for x in [5, 6, 7]:
    print(x)
x                           # 7

The established idiom for looping n times in Python is to use a range with for-in:

# will iterate 5 times with the values: 0, 1, 2, 3, and then 4
for x in range(5):
    …

list comprehensions

When we create a list, we very often create it based upon some other existing sequence. For example, we may create a new list in which each item is the square of each number in another list:

orig = [5, 3, 2, 10]
new = []
for x in orig:
    new.append(x ** 2)                  # append the square of the item to the new list (** is the powers operator)
new                                 # [25, 9, 4, 100]

list comprehension can do all of this work in a single expression that returns the new list:

[output for target in input]

…where each item of input (a sequence) gets assigned in turn to the target (a variable), and then the items of the new list are produced from output (an expression). For example, we can get the same new list as in the above example with this list comprehension:

[x ** 2 for x in [5, 3, 2, 10]]                       # [25, 9, 4, 100]

However, there is one subtle semantic difference between using for-in statements and list comprehensions. With a for-in statement, the target variable is a regular variable of the current scope:

x = ‘hi’
for x in [5, 3, 2, 10]:
    print(x)
x                                                                                                                              # 10, the last value assigned to x

A list comprehension, in contrast, is actually its own local scope, so the list comprehension does not assign to x of the scope containing it:

x = ‘hi’
[x ** 2 for x in [5, 3, 2, 10]]
x                                  # ‘hi’ (x unchanged by the comprehension)

A list comprehension may optionally include an if clause which can filter out iterations form the resultant list:

[output for target in input if condition]

For each iteration, the condition is evaluated, and if false, the output is neither evaluated nor added to the new list.

[x ** 2 for x in [5, 3, 2, 10] if x < 9]                # [25, 9, 4]

Above, because the condition tested false in the last iteration, no output was generated for that iteration. We get the same effect writing:

new = []
for x in [5, 3, 2, 10]:
    if x < 9:
        new.append(x ** 2)
new                                 # [25, 9, 4]

A list comprehension can get even more busy with the addition of any number of additional for-in clauses (each optionally with their own if clauses), e.g.:

[output for target in sequence if condition for target in sequence if condition for target in sequence if condition]

To picture what such complicated comprehensions do, simply imagine them written out as a series of nested for-in and if statements, left-to-right with the output in the most inner block. For example:

[(x + y + z) ** 2 for x in [5, 2] if x < 4 for y in [7, 3, -1] for z in [2, 9, 15] if z <= 12]

…would be written out in statements as:

new = []
for x in [5, 2]:
    if x < 4:
        for y in [7, 3, -1]:
            for z in [2, 9, 15]:
                if z <= 12:
                    new.append((x + y + z) ** 2)
new         # [121, 324, 49, 196, 9, 100]

Complicated comprehensions can be quite hard to read, so it’s often best to favor the more verbose but clearer form of a series of nested for-in’s and if’s.

bytes and bytearray

 

For dealing with raw sequences of bytes, Python has the bytes (immutable) and bytearray (mutable) types. Each byte is represented as an integer from 0 to 255, e.g. 8 represents the byte 0000_1000 (8 in binary is 1000). We can create byte objects and bytearray objects from sequences of such integers:

x = bytes([240, 102, 133, 7])             # an immutable sequence of 4 bytes
y = bytearray([240, 102, 133, 7])         # a mutable sequence of 4 bytes
y[0]                                      # 240
y[1] = 47                                 # change second byte to 47
x[1] = 47                                 # exception: can’t modify a bytes object
y[1] = 702                                # exception: not a valid byte value

Alternatively, we can create a bytes object with a literal syntax of a string prefixed with b or B:

b‘asdf’            # an immutable sequence of 4 bytes (one for each character)

Each character represents its Unicode code poing as a byte, e.g. ‘a’ represents the byte 97. Obviously then, characters with code points greater than 255 aren’t allowed in these literals.

The bytes and bytearray types are sequences and so support most of the usual sequence operations.

raw string literals

Normally when we write a string literal, \ denotes the start of an escape sequence, e.g. \n. A string literal preceded with r or R is a ‘raw’ string literal in which \ just denotes itself:

‘bla\n\\bla’                  # the text ‘bla(newline)\bla’
r‘bla\n\\bla’                 # the text ‘bla\n\\bla’

Raw strings can come in handy in a few cases, such as writing regular expressions.

string operations

As already discussed, a string is a kind of immutable sequence and so supports many of the usual sequence operations. In addition, the str type also includes several methods:

findrfindindex, and rindex

The find method returns the first index at which the argument is found as a substring (returning -1 if no match is found):

‘hello, there’.find(‘hell’)               # 0 (the substring ‘hell’ found starting at index 0)
‘hello, there’.find(‘the’)                      # 7 (the substring ‘the’ found starting at index 7)
‘hello, there’.find(‘then’)               # -1 (the substring ‘then’ not found)
‘hello, there’.find(‘he’)                       # 0 (the substring ‘he’ found first starting at index 0)

Search of the string can be constrained by optional start and end index arguments (where end is non-inclusive):

‘hello, there’.find(‘hell’, 4)                  # -1 (‘hell’ not found in range starting at index 4)
‘hello, there’.find(‘h’, 4)               # 8 (the substring ‘h’ found starting at index 8 in range starting at index 4)
‘hello, there’.find(‘h’, 4, 8)                  # -1 (the substring ‘h’ not found in range of index 4 to index 7)
‘hello, there’.find(‘the’, 0, 5)                # -1 (the substring ‘the’ not found in range of index 0 to index 4)

The rfind method (‘right find’) works the same as find but searches starting from the right, not the left:

‘hello, there’.find(‘he’)                       # 8 (the substring ‘he’ found first from right at index 8)

The index and rindex methods work just like find and rfind, respectively, except index and rindex raise a ValueError exception rather than return -1 when no match is found:

‘hello, there’.find(‘purple’)             # exception ValueError (the substring ‘purple’ not found)

startswith and endswith

The startswith method tests whether the string begins with the argument as substring:

‘hello, there’.startswith(‘hell’)               # True
‘hello, there’.startswith(‘the’)                      # False
‘hello, there’.startswith(‘help’)               # False

The test can be modified by optional start and end index arguments (where end is non-inclusive):

‘hello, there’. startswith (‘the’, 7)                 # True (‘the’ found starting at index 7)
‘hello, there’.startswith(‘the’, 5)             # False
‘hello, there’.startswith(‘hello’, 0, 4)        # False (‘hello’ found but exceeds the end index)

The first argument may optionally be a tuple of strings to match such that startswith returns True if any match:

‘hello, there’.startswith((‘hell’, ‘he’, ‘derp’))     # True
‘hello, there’.startswith((‘herp’, ‘derp’))           # False

The endswith method works like startswith but matches at the end of the string:

‘hello, there’.endswith(‘ere’)                  # True
‘hello, there’.endswith(‘were’)                 # False
‘hello, there’.endswith(‘the’, 0, -2)           # True

(Notice that the optional start and end indexes constrain the range of the search just like in startswith.)

isalnumisalphaisdecimalisdigitisnumericisidentifierislowerisupper, istitle, and isspace

The isalnum method tests whether every character in the string is alphanumeric (either a letter or a decimal digit):

‘hello3’.isalnum()                 # True
‘hello, there’.isalnum()           # False (comma and space are not alphanumeric)
‘’.isalnum()                       # False

The isalpha method tests whether every character in the string is a letter of the alphabet:

‘Washington’.isalpha()              # True
‘hello3’.isalpha ()                 # False
‘hello, there’.isalpha ()           # False (comma and space are not alphanumeric)
‘’.isalpha ()                       # False

The isdecimal method tests whether every character in the string is a letter is a decimal digit:

‘334’.isdecimal()                   # True
‘33.4’.isdecimal()                  # False
’33 4’.isdecimal()                  # False
‘334 tacos’.isdecimal()             # False
‘’.isdecimal()                      # False

(Two other methods, isdigit and isnumeric, test mostly the same thing as isdecimal , but a distinction is made between ‘digit’ , ‘decimal’, and ‘numeric’ characters as explained in the documentation.)

The isidentifier method tests whether the string characters form a valid Python identifier:

‘Apple’.isidentifier()            # True
‘ban_ana4’.isidentifier()         # True
‘3foo’.isidentifier()             # False
‘foo bar’.isidentifier()          # False

The islower method tests whether every letter in the string is lowercase (returning False if the string contains no letters at all):

‘washington’.islower()              # True
‘Washington’.islower()              # False
‘hello3’.islower()                  # True
‘hello, there’.islower()            # True
’35 ,%’.islower()                   # False

The isupper method tests whether every letter in the string is lowercase (returning False if the string contains no letters at all):

‘washington’.islower()              # True
‘Washington’.islower()              # False
‘hello3’.islower()                  # True
‘hello, there’.islower()            # True
’35 ,%’.islower()                   # False

Conversely, the isupper method tests whether every letter in the string is uppercase (returning False if the string contains no letters at all).

The istitle method tests whether the string is and whether every letter preceded by another letter is lowercase while all other letters are uppercase:

‘Washington’.istitle()              # True
‘WashIngTon’.istitle()              # False
‘Washington Dc’.istitle()           # True
‘Washington 3Dc’.istitle()          # True
‘Washington 3DC’.istitle()          # False
‘’.istitle()                        # False

The isspace method tests whether the string is not empty and every letter in the string is a whitespace character:

‘    ’.isspace ()                   # True
‘  \n  ’.isspace ()                 # True
‘    \t\t’.isspace ()               # True
‘ 3 ’.isspace ()                    # False
‘’.isspace ()                       # False

upperlowertitleswapcase, and capitalize

The upper method returns a new string in which all the lowercase letters have been made uppercase:

‘herp Derp’.upper()                 # ‘HERP DERP’

The lower method returns a new string in which all the lowercase letters have been made lowercase:

‘heRp Derp’.lower()                # ‘herp derp’

The title method returns a new string in which all the letter cases have changed to conform to the title pattern:

‘heRp 3Derp’.title()               # ‘Herp 3Derp’

The swapcase method returns a new string in which all the letters have switchede case:

‘heRp Derp’.swapcase()             # ‘HerP dERP’

The capitalize method returns a new string in which all the first character is made uppercase if it is a letter:

‘heRp Derp’.capitalize()          # ‘HeRp Derp’
‘3heRp Derp’.capitalize()         # ‘3heRp Derp’

centerrjust, and ljust

The center method returns a new string padded to a minimum length, placing as many needed extra spaces on both sides:

‘herp derp’.center(12)         # ‘ herp derp  ’
‘herp derp’.center(4)          # ‘herp derp’

(Notice that when the number of spaces added is not even, the extra space goes on the end. Also note that when the argument is equal in length or shorter than the given string, the original string itself is returned.)

The rjust (‘right justify’) method returns a new string padded to a minimum length, placing the extra spaces in front:

‘herp derp’.rjust(12)             # ‘   herp derp’

The ljust (‘left justify’) method returns a new string padded to a minimum length, placing the extra spaces on the end:

‘herp derp’.ljust(12)             # ‘herp derp   ’

striplstrip, and rstrip

The strip method returns a new string with any leading and trailing whitespace removed:

‘ \n   herp derp\t’.strip()         # ‘herp derp’
‘herp derp’.strip()                 # ‘herp derp’

(Note that strip simply returns the original string when it contains no leading or trailing whitespace.)

The lstrip (‘left strip’) method returns a new string with only leading whitespace removed, while the rstrip (‘right strip’) method returns a new string with only the traililng whitespace removed:

‘ \n   herp derp\t’.lstrip()         # ‘herp derp\t’
‘ \n   herp derp\t’.lstrip()         # ‘ \n   herp derp’

These three methods all take an optional string argument that specifies a set of characters to strip other than whitespace:

‘y%%herp derp9\ty’.strip(‘9y\t%’)      # ‘herp derp’ (removed leading and trailing y’s, 9’s, %’s, and tabs)

(Note that in the string argument, the order of characters doesn’t matter.)

partitionrpartitionsplitrsplit, and splitlines

The paritition method takes a separator argument and returns a tuple of three strings: the part before the first occurrence of the separator, the separator itself, and the part after the separator:

‘herp derp’.partition(‘er’)            # (‘h’, ‘er’, ‘p derp’)
‘herp derp’.partition(‘ ’)             # (‘herp’, ‘ ’, ‘derp’)
‘herp derp’.partition(‘meow’)          # (‘herp derp’, ‘’, ‘’)

(Notice that the string itself gets returned as the first argument if no occurrence of the separator is found.)

The rparition method does the same as partition, but it searches for an occurance of the separator starting from the right rather than the left.

‘herp derp’.rpartition(‘er’)            # (‘herp d’, ‘er’, ‘p’)
‘herp derp’.rpartition(‘meow’)          # (‘’, ‘’, ‘herp derp’)

The split method takes a separator argument and returns a list of all substrings divided by the separator:

‘supercalifragilistic herp derp’.split(‘er’)         # [‘sup’, ‘califragilistic h’, ‘p d’, ‘p’]

When no separator is specified, the string is split by all whitespace:

‘ \nherp  \t  derp ’.split(‘er’)              # [‘herp’, ‘derp’]

An optional second argument specifies a maximum numer of times to split the string:

‘supercalifragilisticexpialadocious.split(‘r’, 2)       # [‘superc’, ‘lifr’, ‘gilisticexpialadocious’]

The rsplit method does the same as split, but it splits starting from the right rather than the left.

The splitlines method returns a list of the substrings which are divided by newlines:

‘hello \ngoodbye’.splitlines()          # [‘hello ’, ‘goodbye’]

join

The join method returns the concatenation of every string in a sequence, with the string itself placed in between:

‘ASDF’.join([‘cat’, ‘dog’, ‘bird’, ‘rat’])          # ‘catASDFdogASDFbirdASDFrat
‘ ’.join([‘cat’, ‘dog’, ‘bird’, ‘rat’])             # ‘cat dog bird rat

expandtabs and zfill

The expandtabs method returns a string with each tab character replaced by spaces in a way that is consistent with tab boundaries. The optional argument species the number of characters in each tab boundary, which defaults to 8:

‘cat\tdog’.expandtabs()             # ‘cat     dog’
‘\tmouse’.expandtabs(4)             # ‘    mouse’

The zfill method returns a string padded with leading zeros to a designated size (returning the original string if it is equal or larger than this size). For strings beginning with -, the padding is placed after the -:

‘herp derp’.zfill(12)           # ‘000herp derp’
‘herp derp’.zfill(5)            # ‘herp derp’ (the very same string)
‘-herp derp’.zfill(12)          # ‘-00herp derp’

replace

The replace method returns a new string with all occurences of the first string argument replaced with the second:

‘herp derp’.replace(‘er’, ‘zzz’)        # ‘hzzzp dzzzp’

translate and maketrans

The translate method returns a new string with each characters translated according to the translation map argument, a dictionary of Unicode code-points mapped to other Unicode code-points, to strings, or to None:

‘herp derp’.translate({100: None, 114: 65, 101: ‘?’})      # ‘h?Ap ?Ap’

Above, e’s (code point 101) become ?’s, r’s (code point 114) become A’s (code point 65), and d’s (code point 100) get removed.

The maketrans class method is a convenience for making a translation map. With one dictionary argument, you may specify the keys as as single-character keys instead of code points:

str.maketrans({‘d’: None, ‘r’: 65, ‘e’: ‘?’})         # {100: None, 114: ‘A’, 101: ‘?’}

With two string arguments, each character in the first is mapped to each repsective character in the second; the two strings must be of equal length:

str.maketrans(‘at%’, ‘8NN’)         # {97: 56, 116: 78, 37: 78}

The characters of an optional third string argument get mapped to None:

str.maketrans(‘at%’, ‘8NN’, ‘WQ’)   # {97: 56, 116: 110, 37: 110, 81: None, 87: None}

format

The format method returns a new string that interpolates the arguments into the string according to the rules of a complex syntax (documented here), e.g.:

‘Dear {0} {1},’.format(‘Mr.’, ‘Thompson’)       # ‘Dear Mr. Thompson,’

Here the first argument replaces ‘{0}’ while the second replaces ‘{1}’. This is only a simple example, but understand that format is a very powerful tool when you wish to insert variable data into designated parts of a template string.

encode

The encode method returns a bytes object of the string encoded into a specified encoding, such as UTF-8 or UTF-16; the encoding is expressed by name as a string (see the possible values:

‘abc’.encode(‘utf_8’)        # the bytes: 97, 98, 99
‘abc’.encode(‘utf_16’)       # the bytes: 255, 254, 97, 0, 98, 0, 99, 0

(Notice that the UTF-16 encoding began with a byte-order mark.)

Comments are closed.