Python insights
Specifying variable formatting field width
We’re all familiar with the field width specification in a formatting string:
>>> "%6d" % 15
' 15'
However, suppose that the field width (6 in the example above) isn’t constant, but should be computed. This is common when you want to pre-compute some max length of a group of strings/numbers to line them all up nicely.
Python has a nice option, using an asterisk sign (*) instead of the width. This tell the interpreter to take the width from the formatting values list:
>>> "%*d" % (6, 15)
' 15'
This can be mixed with other options, like zero padding:
>>> "%0*d" % (6, 15)
'000015'
xrange vs. range
Always use xrange for iteration, i.e.:
for i in xrange(10):
...
xrange is more efficient because it generates an iterable object, and not the whole list like range.
>>> k = range(10)
>>> k
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> m = xrange(10)
>>> m
xrange(10)
In Py3K, xrange will be renamed to range, and the functionality of range will be achieved by list(range(n))
Initializing a 2D list
While this can be done safely to initialize a list:
lst = [0] * 3
The same trick won’t work for a 2D list (list of lists):
>>> lst_2d = [[0] * 3] * 3
>>> lst_2d
[[0, 0, 0], [0, 0, 0], [0, 0, 0]]
>>> lst_2d[0][0] = 5
>>> lst_2d
[[5, 0, 0], [5, 0, 0], [5, 0, 0]]
The operator * duplicates its operands, and duplicated lists constructed with [] point to the same list. The correct way to do this is:
>>> lst_2d = [[0] * 3 for i in xrange(3)]
>>> lst_2d
[[0, 0, 0], [0, 0, 0], [0, 0, 0]]
>>> lst_2d[0][0] = 5
>>> lst_2d
[[5, 0, 0], [0, 0, 0], [0, 0, 0]]
Beware of mutable default values for arguments
This may surprise you:
>>> class Foo(object):
... def __init__(self, name='', stuff=[]):
... self.name = name
... self.stuff = stuff
...
... def add_stuff(self, gadget):
... self.stuff.append(gadget)
...
>>> f = Foo()
>>> f.add_stuff('tree')
>>> f.stuff
['tree']
>>> g = Foo()
>>> g.stuff
['tree']
Where did this tree come from in g??
This is something that confuses a lot of Python programmers, sometimes even the experienced ones. Almost every newbie is hit by this gotcha at one stage or another (unless he diligently read about it in someone else’s blog, book or Python recipe). What happens here is that default values for arguments are created by Python only once per each function/method, at the time of its definition.
We can easily verify it by checking the addresses:
>>> object.__repr__(f.stuff)
'<list object at 0x01B35828>'
>>> object.__repr__(g.stuff)
'<list object at 0x01B35828>'
So how can one do it correctly? One solution is avoid using mutable default values for arguments. But this is hardly satisfactory, as from time to time a new list is a useful default. There are some complex solutions like defining a decorator for functions that deep-copies all arguments. This is an overkill, and the problem can be solved easily as follows:
>>> class Foo(object):
... def __init__(self, name='', stuff=[]):
... self.name = name
... self.stuff = stuff or []
...
... def add_stuff(self, gadget):
... self.stuff.append(gadget)
...
>>> f = Foo()
>>> f.add_stuff('tree')
>>> g = Foo()
>>> g.stuff
[]
>>> f.stuff
['tree']
The stuff or [] code does the trick, as the [] in it always creates a fresh new list when the empty list (that is, the default argument) is passed in.
Note that the or operator is quite nondiscriminatory when it comes to booleans, so a lot of values you may consider valid (empty strings, 0, etc.) will be thought of as False. But for most cases this will work just fine (why would anyone pass an empty string instead of a list?)
However, if you’re still concerned of the tricky corner cases, this solution may be more robust (though less simple, which is a disadvantage):
>>> class Foo(object):
... def __init__(self, name='', stuff=None):
... self.name = name
... if stuff is None: stuff = []
... self.stuff = stuff
...
... def add_stuff(self, gadget):
... self.stuff.append(gadget)
Detecting empty lines
To find out if line is empty (i.e. either size 0 or contains only whitespace), use the string method strip in a condition, as follows:
if not line.strip(): # if line is empty
continue # skip it
Iterating a sequence with an index
Use enumerate:
>>> items = ['a', 'b', 'c', 'd']
>>> for i, item in enumerate(items):
... print i, item
...
0 a
1 b
2 c
3 d
>>>
enumerate will work for any iterable.
Initializing a dictionary from a list of keys
Suppose you have a list of items, and you want a dictionary with these items as the keys. Use fromkeys:
>>> items = ['a', 'b', 'c', 'd']
>>> idict = dict().fromkeys(items, 0)
>>> idict
{'a': 0, 'c': 0, 'b': 0, 'd': 0}
>>>
The second argument of fromkeys is the value to be granted to all the newly created keys.
Read-only attributes
In Python, object attributes are R/W accessible to outside code. Sometimes, you may want some of the attributes to be read-only. Although this can be achieved by __setattr__, it will also intercept assignments from inside the object (by its own methods attempting to modify self.attr).
A better way is to use properties. These can be added to new-style classes (classes that derive from object) by calling the built-in property function. The most convenient way to use this function is via a decorator:
class Parrot(object):
def __init__(self):
self._voltage = 100000
@property
def voltage(self):
"""Get the current voltage."""
return self._voltage
This class now has a read-only attribute named voltage:
>>> blacky = Parrot()
>>> blacky.voltage
100000
>>> blacky.voltage = 5000
Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: can't set attribute
>>>
String reverse
Python doesn’t have a built-in reverse method for strings. Luckily, this can be easily done with slices:
def reverse(str):
return str[::-1]
Dynamic code evaluation
Python has two constructs for dynamic code evaluation: eval, which works for single expressions, and exec which is more general. The following example demonstrates the use of exec:
def create_function(code, name='foo'):
""" Create and return the function defined in 'code'.
'name' specifies the name of the function, as
given in the 'def' in the code.
"""
d = {}
exec code.strip() in globals(), d
return d[name]
def make_packet_extract(a, b):
code = """
def foo(packet):
return ord(packet[%d]) + 256 * ord(packet[%d])
""" % (a, b)
return create_function(code, 'foo')
foo = make_packet_extract(3, 4)
print foo('abcdefg')
A couple of things to note here:
- exec is given a dict of global and local variables. 99% of the times it’s a good idea to provide it with globals() for the global variables, and a local dict for the locals (unless the function you’re defining modifies the global environment, but this isn’t recommended).
- The code string passed to exec is stripped of leading and trailing whitespace. This is because the function definition is indented, and Python doens’t like an indentation for no scope reason.
- Using create_function will not work if you place it in a separate file from make_packet_extra, because globals() returns the dictionary of the module where it is defined, not the module where it is called
Turning a callable into an iterator
Suppose you have the following extremely useful class somewhere. It has already been defined and used, and you can’t change it:
class RandomChunker(object):
""" Returns random chunks from the string
provided at creation time.
"""
def __init__(self, str, a=1, b=4):
self.str = str
self.a = a
self.b = b
self.pos = 0
def chunk(self):
""" Return the next random chunk from the
input string. When the string's end
has been reached, None is returned.
"""
if self.pos >= len(self.str): return None
chunk_size = randint(self.a, self.b)
if chunk_size > len(self.str) - self.pos:
chunk_size = len(self.str) - self.pos
ret = self.str[self.pos:self.pos+chunk_size]
self.pos += chunk_size
return ret
It implements a quite common idiom: return useful values while they exist, and None (or EOF, or any other end value) when there’s nothing more to return.
How do you comfortably iterate over all the values of such a class/function ? Here’s one way:
rc = RandomChunker("abracadabra12345")
while 1:
chunk = rc.chunk()
if chunk is None: break
print chunk
This isn’t very comfortable… There’s a better way - using the iter function:
rc = RandomChunker("abracadabra12345")
for chunk in iter(rc.chunk, None):
print chunk
Much prettier, isn’t it ? With this method, you can also return all chunks at once:
all_chunks = list(iter(rc.chunk, None))
print all_chunks
Default values for a dictionary
Suppose you have a list of words, and you want to create a dict with a wordcount - each word mapped to the amount of times it appears in the list. Here’s a solution with defaultdict:
from collections import defaultdict
def elemcount(elems):
count = defaultdict(lambda: 0)
for e in elems: count[e] += 1
return count
count = elemcount(['ax', 'ex', 'bx', 'ex', 'ex', 'bx'])
for ec in count:
print ec, count[ec]
defaultdict enables us to implicitly initialize all the dictionary values which are accessed for reading to known values, and solve this problem gracefully. Without it, we’d have to check for the existence of e in the dict and explicitly initialize it.
The immutability of Python strings
Did this ever happen to you ?
>>> name = 'big foot'
>>> name[2]
'g'
>>> name[2] = 'G'
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: 'str' object does not support item assignment
The strings in Python are immutable, just like numbers and tuples. This means that you can create them, move them around, but not change them. Why is this so ? For a few reasons (you can find a better discussion online):
- By design, strings in Python are considered elemental and unchangeable. This spurs better, safer programming styles.
- The immutability of strings has efficiency benefits, chiefly in the area of lower storage requirements.
- It also makes strings safer to use as dictionary keys
If you look around the Python web a little, you’ll notice that the most frequent advice to "how to change my string" is "design your code so that you won’t have to change it". Fair enough, but what other options are there ? Here are a few:
- name = name[:2] + ‘G’ + name[3:] - this is an inefficient way to do the job. Python’s slice semantics ensure that this works correctly in all cases (as long as your index is in range), but involving several string copies and concatenations, it’s hardly your best shot at efficient code. Although if you don’t care for that (and most chances are you don’t), it’s a solid solution.
- Use the MutableString class from module UserString. While no more efficient than the previous method (it performs the same trick under the hood), it is more consistent syntactically with normal string usage.
- Use a list instead of a string to store mutable data. Convert back and forth using list and join. Depending on what you really need, ord and chr may also be useful.
- Use an array object. This is perhaps your best option if you use the string to hold constrained data, such as ‘binary’ bytes, and want fast code.
Matching in part of a string
Suppose that I want to re.match a pattern at some index in a string. I could say:
m = re.match(pattern, str[idx:])
But this can be very wasteful. String slices make copies, and if str is very large, making this copy for each match may be too costly. The alternative is to use the pos argument of the compiled regex form of match:
cp = re.compile(pattern)
m = cp.match(str, pos=idx)
This will do just what I need - try to match beginning at index idx, without making any copies.
Writing text file filters
A common task for scripting languages is to do some simple filtering on a text file, or a group of them. Python’s fileinput module provides a convenient utility to do such filtering. Consider the basic use:
import fileinput
for line in fileinput.input():
process(line)
This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty.
To filter a single file, returning the results to itself (i.e. overwriting it with the filtered contents), use the inplace option. Here’s an example that adds line numbers to a file:
import fileinput
for line in fileinput.input("test.txt", inplace=1):
print "%d: %s" % (fileinput.filelineno(), line),
Here you can also see how the filename is provided directly as an argument to input. fileinput has a number of utility functions similar to filelineno that can make processing even simpler.
Lack of lexical closure in a loop
flist = []
for i in xrange(3):
def func(x): return x * i
flist.append(func)
for f in flist:
print f(2)
When I first ran this code, I was very surprised to see it printed "4 4 4", and not "0 2 4" as I expected.
The reason for this is simple. When func is created, it has an accompanying lexical frame, where i appears. But it’s the same i for all 3 instances of func created. What’s captured is the variable, not its value.
This is no more surprising than the fact that the following prints 60 and not 10:
i = 1
def func(x): return x * i
f = func
i = 6
print f(10)
New lexical frames are only created for functions, not for the insides of for loops (unlike in Perl, by the way, where it’s a special feature that for and foreach loops create fresh lexical closures for each iteration).
If such behavior is required, the simplest solution is to pass the variable you want to capture into the function as a parameter. This way, its value is included in the lexical frame created for the function’s own parameters and locals. Python’s default argument values provide a very natural way to achieve this:
flist = []
for i in xrange(3):
def func(x, i=i): return x * i
flist.append(func)
for f in flist:
print f(2)
This prints "0 2 4".
Finding out the directory a script is being run from
Here’s a simple, cross-platform way to find out the execution directory of the script (the directory in which the script actually resides, not the current directory from which it’s called):
import os
def script_dir():
return os.path.dirname(os.path.realpath(__file__))
__file__ is the special module attribute specifying the file in which it resides. os.path.realpath is mainly for Linux compatibility, eliminating symbolic links. os.path.dirname returns just the directory name of a full path containing a file name.
Processing Unicode paths and filenames
When you’re working with paths and filenames that may be Unicode encoded (for example, Hebrew filenames on Windows, which automatically encodes all paths in Unicode), keep in mind that you must pass in Unicode strings to os.listdir, os.walk and kin. Otherwise, they’ll read ascii-encoded strings which won’t really represent the names of the files and directories.
On the other hand, when printing out Unicode strings on Windows, make sure to explicitly encode them to ascii. Here’s an example that sums it all up:
import os, sys
files = os.listdir(u'.')
for f in files:
if not f.startswith('uni'):
continue
pf = os.path.join(u'.', f)
print pf.encode('ascii', 'replace'), os.path.exists(pf)
This code will function correctly in the presence of Unicode-named files and directories in the path.
Here’s a good article explaining this stuff in more depth.
Encoding and decoding binary data strings in hex
A code snippet is better than a thousand words:
>>> "\x12\xAB".encode('hex')
'12ab'
>>> 'DEADBEEF'.decode('hex')
'\xde\xad\xbe\xef'
By the way, while we’re at it, the base64 and zlib encodings can be also quite useful.

If you have comments on this page, please post them here, or drop me an email.
