Python insights

This page collects some Python (2.5) insights for my own use, but can perhaps be useful for other people too.

xrange vs. range

Always use xrange for iteration, i.e.:

for i in xrange(10):
  ...

xrange is more efficient because it generates an iterable object, and not the whole list like range.

>>> k = range(10)
>>> k
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> m = xrange(10)
>>> m
xrange(10)

In Py3K, xrange will be renamed to range, and the functionality of range will be achieved by list(range(n))

Initializing a 2D list

While this can be done safely to initialize a list:

lst = [0] * 3

The same trick won’t work for a 2D list (list of lists):

>>> lst_2d = [[0] * 3] * 3
>>> lst_2d
[[0, 0, 0], [0, 0, 0], [0, 0, 0]]
>>> lst_2d[0][0] = 5
>>> lst_2d
[[5, 0, 0], [5, 0, 0], [5, 0, 0]]

The operator * duplicates its operands, and duplicated lists constructed with [] point to the same list. The correct way to do this is:

>>> lst_2d = [[0] * 3 for i in xrange(3)]
>>> lst_2d
[[0, 0, 0], [0, 0, 0], [0, 0, 0]]
>>> lst_2d[0][0] = 5
>>> lst_2d
[[5, 0, 0], [0, 0, 0], [0, 0, 0]]

Detecting empty lines

To find out if line is empty (i.e. either size 0 or contains only whitespace), use the string method strip in a condition, as follows:

if not line.strip():    # if line is empty
    continue            # skip it

Iterating a sequence with an index

Use enumerate:

>>> items = ['a', 'b', 'c', 'd']
>>> for i, item in enumerate(items):
...     print i, item
...
0 a
1 b
2 c
3 d
>>>

enumerate will work for any iterable.

Initializing a dictionary from a list of keys

Suppose you have a list of items, and you want a dictionary with these items as the keys. Use fromkeys:

>>> items = ['a', 'b', 'c', 'd']
>>> idict = dict().fromkeys(items, 0)
>>> idict
{'a': 0, 'c': 0, 'b': 0, 'd': 0}
>>>

The second argument of fromkeys is the value to be granted to all the newly created keys.

Read-only attributes

In Python, object attributes are R/W accessible to outside code. Sometimes, you may want some of the attributes to be read-only. Although this can be achieved by __setattr__, it will also intercept assignments from inside the object (by its own methods attempting to modify self.attr).

A better way is to use properties. These can be added to new-style classes (classes that derive from object) by calling the built-in property function. The most convenient way to use this function is via a decorator:

class Parrot(object):
    def __init__(self):
        self._voltage = 100000

    @property
    def voltage(self):
        """Get the current voltage."""
        return self._voltage

This class now has a read-only attribute named voltage:

>>> blacky = Parrot()
>>> blacky.voltage
100000
>>> blacky.voltage = 5000
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: can't set attribute
>>>

String reverse

Python doesn’t have a built-in reverse method for strings. Luckily, this can be easily done with slices:

def reverse(str):
    return str[::-1]

Dynamic code evaluation

Python has two constructs for dynamic code evaluation: eval, which works for single expressions, and exec which is more general. The following example demonstrates the use of exec:

def create_function(code, name='foo'):
    """ Create and return the function defined in 'code'.
        'name' specifies the name of the function, as
        given in the 'def' in the code.
    """
    d = {}
    exec code.strip() in globals(), d
    return d[name]

def make_packet_extract(a, b):
    code = """
        def foo(packet):
            return ord(packet[%d]) + 256 * ord(packet[%d])
        """ % (a, b)

    return create_function(code, 'foo')

foo = make_packet_extract(3, 4)
print foo('abcdefg')

A couple of things to note here:

  • exec is given a dict of global and local variables. 99% of the times it’s a good idea to provide it with globals() for the global variables, and a local dict for the locals (unless the function you’re defining modifies the global environment, but this isn’t recommended).
  • The code string passed to exec is stripped of leading and trailing whitespace. This is because the function definition is indented, and Python doens’t like an indentation for no scope reason.
  • Using create_function will not work if you place it in a separate file from make_packet_extra, because globals() returns the dictionary of the module where it is defined, not the module where it is called

Turning a callable into an iterator

Suppose you have the following extremely useful class somewhere. It has already been defined and used, and you can’t change it:

class RandomChunker(object):
    """ Returns random chunks from the string
        provided at creation time.
    """
    def __init__(self, str, a=1, b=4):
        self.str = str
        self.a = a
        self.b = b
        self.pos = 0

    def chunk(self):
        """ Return the next random chunk from the
            input string. When the string's end
            has been reached, None is returned.
        """
        if self.pos >= len(self.str): return None

        chunk_size = randint(self.a, self.b)
        if chunk_size > len(self.str) - self.pos:
            chunk_size = len(self.str) - self.pos

        ret = self.str[self.pos:self.pos+chunk_size]
        self.pos += chunk_size
        return ret

It implements a quite common idiom: return useful values while they exist, and None (or EOF, or any other end value) when there’s nothing more to return.

How do you comfortably iterate over all the values of such a class/function ? Here’s one way:

rc = RandomChunker("abracadabra12345")

while 1:
    chunk = rc.chunk()
    if chunk is None: break
    print chunk

This isn’t very comfortable… There’s a better way - using the iter function:

rc = RandomChunker("abracadabra12345")

for chunk in iter(rc.chunk, None):
    print chunk

Much prettier, isn’t it ? With this method, you can also return all chunks at once:

all_chunks = list(iter(rc.chunk, None))
print all_chunks

Default values for a dictionary

Suppose you have a list of words, and you want to create a dict with a wordcount - each word mapped to the amount of times it appears in the list. Here’s a solution with defaultdict:

from collections import defaultdict

def elemcount(elems):
    count = defaultdict(lambda: 0)
    for e in elems: count[e] += 1
    return count

count = elemcount(['ax', 'ex', 'bx', 'ex', 'ex', 'bx'])

for ec in count:
    print ec, count[ec]

defaultdict enables us to implicitly initialize all the dictionary values which are accessed for reading to known values, and solve this problem gracefully. Without it, we’d have to check for the existence of e in the dict and explicitly initialize it.

The immutability of Python strings

Did this ever happen to you ?

>>> name = 'big foot'
>>> name[2]
'g'
>>> name[2] = 'G'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: 'str' object does not support item assignment

The strings in Python are immutable, just like numbers and tuples. This means that you can create them, move them around, but not change them. Why is this so ? For a few reasons (you can find a better discussion online):

  • By design, strings in Python are considered elemental and unchangeable. This spurs better, safer programming styles.
  • The immutability of strings has efficiency benefits, chiefly in the area of lower storage requirements.
  • It also makes strings safer to use as dictionary keys

If you look around the Python web a little, you’ll notice that the most frequent advice to "how to change my string" is "design your code so that you won’t have to change it". Fair enough, but what other options are there ? Here are a few:

  • name = name[:2] + ‘G’ + name[3:] - this is an inefficient way to do the job. Python’s slice semantics ensure that this works correctly in all cases (as long as your index is in range), but involving several string copies and concatenations, it’s hardly your best shot at efficient code. Although if you don’t care for that (and most chances are you don’t), it’s a solid solution.
  • Use the MutableString class from module UserString. While no more efficient than the previous method (it performs the same trick under the hood), it is more consistent syntactically with normal string usage.
  • Use a list instead of a string to store mutable data. Convert back and forth using list and join. Depending on what you really need, ord and chr may also be useful.
  • Use an array object. This is perhaps your best option if you use the string to hold constrained data, such as ‘binary’ bytes, and want fast code.

Matching in part of a string

Suppose that I want to re.match a pattern at some index in a string. I could say:

m = re.match(pattern, str[idx:])

But this can be very wasteful. String slices make copies, and if str is very large, making this copy for each match may be too costly. The alternative is to use the pos argument of the compiled regex form of match:

cp = re.compile(pattern)
m = cp.match(str, pos=idx)

This will do just what I need - try to match beginning at index idx, without making any copies.

Writing text file filters

A common task for scripting languages is to do some simple filtering on a text file, or a group of them. Python’s fileinput module provides a convenient utility to do such filtering. Consider the basic use:

import fileinput
for line in fileinput.input():
    process(line)

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty.

To filter a single file, returning the results to itself (i.e. overwriting it with the filtered contents), use the inplace option. Here’s an example that adds line numbers to a file:

import fileinput

for line in fileinput.input("test.txt", inplace=1):
    print "%d: %s" % (fileinput.filelineno(), line),

Here you can also see how the filename is provided directly as an argument to input. fileinput has a number of utility functions similar to filelineno that can make processing even simpler.

Lack of lexical closure in a loop

flist = []

for i in xrange(3):
    def func(x): return x * i
    flist.append(func)

for f in flist:
    print f(2)

When I first ran this code, I was very surprised to see it printed "4 4 4", and not "0 2 4" as I expected.

The reason for this is simple. When func is created, it has an accompanying lexical frame, where i appears. But it’s the same i for all 3 instances of func created. What’s captured is the variable, not its value.

This is no more surprising than the fact that the following prints 60 and not 10:

i = 1
def func(x): return x * i
f = func
i = 6

print f(10)

New lexical frames are only created for functions, not for the insides of for loops (unlike in Perl, by the way, where it’s a special feature that for and foreach loops create fresh lexical closures for each iteration).

If such behavior is required, the simplest solution is to pass the variable you want to capture into the function as a parameter. This way, its value is included in the lexical frame created for the function’s own parameters and locals. Python’s default argument values provide a very natural way to achieve this:

flist = []

for i in xrange(3):
    def func(x, i=i): return x * i
    flist.append(func)

for f in flist:
    print f(2)

This prints "0 2 4".

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

If you have comments on this page, please post them here, or drop me an email.