Tags Python

This blog has been unusually quiet lately. Real-life factors such as traveling for work and the sleeping patterns of my daughter aside, the main reason for the quietness has been that I was spending a bit more time working on Python in the past month.

In particular, I'd like to focus on changes in the xml.etree.ElementTree package for Python 3.3.

xml.etree.ElementTree is arguably the most popular standard library package for processing XML. It has a friendly, Pythonic API and a C accelerator with very good performance.

One annoying aspect of using the package is, however, the need to explicitly ask for the C accelerator, and fall back to the (much slower) pure Python implementation if that's not available. In other words, this incantation is very common for code that uses ElementTree for XML processing:

    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

What's interesting is that starting with Python 3, the official Python policy is to transparently hide the C accelerators inside the module:

A common pattern in Python 2.x is to have one version of a module implemented in pure Python, with an optional accelerated version implemented as a C extension; for example, pickle and cPickle. This places the burden of importing the accelerated version and falling back on the pure Python version on each user of these modules. In Python 3.0, the accelerated versions are considered implementation details of the pure Python versions. Users should always import the standard version, which attempts to import the accelerated version and falls back to the pure Python version.

This was quite a large task, however, so in practice it was stretched to several releases in the 3.x line. In particular, in Python 3.2 cElementTree still has to be imported explicitly to access the C accelerator.

Well, no more. Starting with Python 3.3, all you'll have to do is:

import xml.etree.ElementTree as ET

This will import the accelerated C module if it exists, and the pure Python module otherwise. The cElementTree module is not going to be needed any longer, although it will stay in the standard library as a thin alias, for backwards compatibility.

This wouldn't be very interesting if ElementTree had been a usual package. In fact, it was one of the very few externally maintained packages in the standard library. Historically, the package was donated to CPython by its maintainer Fredrik Lundh, who kept the copyright. This made the package somewhat challenging to maintain for the Python core developers, since any change had to be coordinated with Fredrik and his up-stream standalone distribution.

Although de-facto the standard library ElementTree already diverged a bit from Fredrik's implementation (especially due to the great efforts of Florent Xicluna), the change discussed here is at the package's interface, rather than its implementation, so it raised a lively discussion in the Python core development mailing list. Luckily, Fredrik readily agreed to cede further maintenance of ElementTree to the Python developers, so the copyright/maintenance obstacle disappeared.

Some work remains to further improve ElementTree, and there are a few relevant issues open in the Python bug tracker:

  • Issue #14006: The ElementTree documentation could use some love.
  • Issues #14007 and #14128: some mismatches between documentation and implementation.
  • A few other open issues can be found by searching the tracker for ElementTree

I'm currently focusing on the latter (#14128). Specifically, while the Element class can be subclassed in the Python implementation, it can't in the C implementation (because there Element is just a factory function for creating new objects). I already have a patch for this attached to the issue, after which I plan to work out the other discrepancies.

Python development is a cooperative effort, and I'm grateful to many other devs for their help in issues related to ElementTree. More help is needed, though! So if you're thinking of starting contributing to Python, the ElementTree package is a good place to start because there is a lot of remaining work, and it is currently actively in focus of a few core devs so getting meaningful contributions committed should be relatively easy.