Processing XML in Python with ElementTree

March 15th, 2012 at 6:14 am

When it comes to parsing and manipulating XML, Python lives true to its "batteries included" motto. Looking at the sheer amount of modules and tools it makes available out of the box in the standard library can be somewhat overwhelming for programmers new to Python and/or XML.

A few months ago an interesting discussion took place amongst the Python core developers regarding the relative merits of the XML tools Python makes available, and how to best present them to users. This article (and hopefully a couple of others that will follow) is my humble contribution, in which I plan to provide my view on which tool should be preferred and why, as well as present a friendly tutorial on how to use it.

The code in this article is demonstrated using Python 2.7; it can be adapted for Python 3.x with very few modifications.

Which XML library to use?

Python has quite a few tools available in the standard library to handle XML. In this section I want to give a quick overview of the packages Python offers and explain why ElementTree is almost certainly the one you want to use.

xml.dom.* modules – implement the W3C DOM API. If you’re used to working with the DOM API or have some requirement to do so, this package can help you. Note that there are several modules in the xml.dom package, representing different tradeoffs between performance and expressivity.

xml.sax.* modules – implement the SAX API, which trades convenience for speed and memory consumption. SAX is an event-based API meant to parse huge documents "on the fly" without loading them wholly into memory [1].

xml.parser.expat – a direct, low level API to the C-based expat parser [2]. The expat interface is based on event callbacks, similarly to SAX. But unlike SAX, the interface is non-standard and specific to the expat library.

Finally, there’s xml.etree.ElementTree (from now on, ET in short). It provides a lightweight Pythonic API, backed by an efficient C implementation, for parsing and creating XML. Compared to DOM, ET is much faster [3] and has a more pleasant API to work with. Compared to SAX, there is ET.iterparse which also provides "on the fly" parsing without loading the whole document into memory. The performance is on par with SAX, but the API is higher level and much more convenient to use; it will be demonstrated later in the article.

My recommendation is to always use ET for XML processing in Python, unless you have very specific needs that may call for the other solutions.

ElementTree – one API, two implementations

ElementTree is an API for manipulating XML, and it has two implementations in the Python standard library. One is a pure Python implementation in xml.etree.ElementTree, and the other is an accelerated C implementation in xml.etree.cElementTree. It’s important to remember to always use the C implementation, since it is much, much faster and consumes significantly less memory. If your code can run on platforms that might not have the _elementtree extension module available [4], the import incantation you need is:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

This is a common practice in Python to choose from several implementations of the same API. Although chances are that you’ll be able to get away with just importing the first module, your code may end up running on some strange platform where this will fail, so you better prepare for the possibility. Note that starting with Python 3.3, this will no longer be needed, since the ElementTree module will look for the C accelerator itself and fall back on the Python implementation if that’s not available. So it will be sufficient to just import xml.etree.ElementTree. But until 3.3 is out and your code runs on it, just use the two-stage import presented above.

Anyway, wherever this article mentions the ElementTree module (ET), I mean the C implementation of the ElementTree API.

Parsing XML into a tree

Let’s start with the basics. XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two objects for this purpose – ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading, writing, finding interesting elements) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements is done on the Element level. The following examples will demonstrate the main uses [5].

We’re going to use the following XML document for the sample code:

<?xml version="1.0"?>
<doc>
    <branch name="testing" hash="1cdf045c">
        text,source
    </branch>
    <branch name="release01" hash="f200013e">
        <sub-branch name="subrelease01">
            xml,sgml
        </sub-branch>
    </branch>
    <branch name="invalid">
    </branch>
</doc>

Let’s load and parse the document:

>>> import xml.etree.cElementTree as ET
>>> tree = ET.ElementTree(file='doc1.xml')

Now let’s fetch the root element:

>>> tree.getroot()
<Element 'doc' at 0x11eb780>

As expected, the root is an Element object. We can examine some of its attributes:

>>> root = tree.getroot()
>>> root.tag, root.attrib
('doc', {})

True, the root element has no attributes [6]. As any Element, it presents an iterable interface for going over its direct children:

>>> for child_of_root in root:
...   print child_of_root.tag, child_of_root.attrib
...
branch {'hash': '1cdf045c', 'name': 'testing'}
branch {'hash': 'f200013e', 'name': 'release01'}
branch {'name': 'invalid'}

We can also access a specific child, by index:

>>> root[0].tag, root[0].text
('branch', '\n        text,source\n    ')

Finding interesting elements

From the examples above it’s obvious how we can reach all the elements in the XML document and query them, with a simple recursive procedure (for each element, recursively visit all its children). However, since this can be a common task, ET presents some useful tools for simplifying it.

The Element object has an iter method that provices depth-first iteration (DFS) over all the sub-elements below it. The ElementTree object also has the iter method as a convenience, calling the root’s iter. Here’s the simplest way to find all the elements in the document:

>>> for elem in tree.iter():
...   print elem.tag, elem.attrib
...
doc {}
branch {'hash': '1cdf045c', 'name': 'testing'}
branch {'hash': 'f200013e', 'name': 'release01'}
sub-branch {'name': 'subrelease01'}
branch {'name': 'invalid'}

This could naturally serve as a basis for arbitrary iteration of the tree – go over all elements, examine those with interesting properties. ET can make this task more convenient and efficient, however. For this purpose, the iter method accepts a tag name, and iterates only over those elements that have the required tag:

>>> for elem in tree.iter(tag='branch'):
...   print elem.tag, elem.attrib
...
branch {'hash': '1cdf045c', 'name': 'testing'}
branch {'hash': 'f200013e', 'name': 'release01'}
branch {'name': 'invalid'}

XPath support for finding elements

A much more powerful way for finding interesting elements with ET is by using its XPath support. Element has some "find" methods that can accept an XPath path as an argument. find returns the first matching sub-element, findall all the matching sub-elements in a list and iterfind provides an iterator for all the matching elements. These methods also exist on ElementTree, beginning the search on the root element.

Here’s an example for our document:

>>> for elem in tree.iterfind('branch/sub-branch'):
...   print elem.tag, elem.attrib
...
sub-branch {'name': 'subrelease01'}

It found all the elements in the tree tagged sub-branch that are below an element called branch. And here’s how to find all branch elements with a specific name attribute:

>>> for elem in tree.iterfind('branch[@name="release01"]'):
...   print elem.tag, elem.attrib
...
branch {'hash': 'f200013e', 'name': 'release01'}

To study the XPath syntax ET supports, see this page.

Building XML documents

ET provides a simple way to build XML documents and write them to files. The ElementTree object has the write method for this purpose.

Now, there are probably two main use scenarios for writing XML documents. You either read one, modify it, and write it back, or create a new document from scratch.

Modifying documents can be done by means of manipulating Element objects. Here’s a simple example:

>>> root = tree.getroot()
>>> del root[2]
>>> root[0].set('foo', 'bar')
>>> for subelem in root:
...   print subelem.tag, subelem.attrib
...
branch {'foo': 'bar', 'hash': '1cdf045c', 'name': 'testing'}
branch {'hash': 'f200013e', 'name': 'release01'}

What we did here is delete the 3rd child of the root element, and add a new attribute to the first child. The tree can then be written back into a file. Here’s how the result would look:

>>> import sys
>>> tree.write(sys.stdout)   # ET.dump can also serve this purpose
<doc>
    <branch foo="bar" hash="1cdf045c" name="testing">
        text,source
    </branch>
    <branch hash="f200013e" name="release01">
        <sub-branch name="subrelease01">
            xml,sgml
        </sub-branch>
    </branch>
    </doc>

Note that the order of the attributes is different than in the original document. This is because ET keeps attributes in a dictionary, which is an unordered collection. Semantically, XML doesn’t care about the order of attributes.

Building whole new elements is easy too. The ET module provides the SubElement factory function to simplify the process:

>>> a = ET.Element('elem')
>>> c = ET.SubElement(a, 'child1')
>>> c.text = "some text"
>>> d = ET.SubElement(a, 'child2')
>>> b = ET.Element('elem_b')
>>> root = ET.Element('root')
>>> root.extend((a, b))
>>> tree = ET.ElementTree(root)
>>> tree.write(sys.stdout)
<root><elem><child1>some text</child1><child2 /></elem><elem_b /></root>

XML stream parsing with iterparse

As I mentioned in the beginning of this article, XML documents tend to get huge and libraries that read them wholly into memory may have a problem when parsing such documents is required. This is one of the reasons to use the SAX API as an alternative to DOM.

We’ve just learned how to use ET to easily read XML into a in-memory tree and manipulate it. But doesn’t it suffer from the same memory hogging problem as DOM when parsing huge documents? Yes, it does. This is why the package provides a special tool for SAX-like, on the fly parsing of XML. This tool is iterparse.

I will now use a complete example to demonstrate both how iterparse may be used, and also measure how it fares against standard tree parsing. I’m auto-generating an XML document to work with. Here’s a tiny portion from its beginning:

<?xml version="1.0" standalone="yes"?>
<site>
  <regions>
    <africa>
      <item id="item0">
        <location>United States</location>    <!-- Counting locations -->
        <quantity>1</quantity>
        <name>duteous nine eighteen </name>
        <payment>Creditcard</payment>
        <description>
          <parlist>
[...]

I’ve emphasized the element I’m going to refer to in the example with a comment. We’ll see a simple script that counts how many such location elements there are in the document with the text value "Zimbabwe". Here’s a standard approach using ET.parse:

tree = ET.parse(sys.argv[2])

count = 0
for elem in tree.iter(tag='location'):
    if elem.text == 'Zimbabwe':
        count += 1

print count

All elements in the XML tree are examined for the desired characteristic. When invoked on a ~100MB XML file, the peak memory usage of the Python process running this script is ~560MB and it takes 2.9 seconds to run.

Note that we don’t really need the whole tree in memory for this task. It would suffice to just detect location elements with the desired value. All the other data can be discarded. This is where iterparse comes in:

count = 0
for event, elem in ET.iterparse(sys.argv[2]):
    if event == 'end':
        if elem.tag == 'location' and elem.text == 'Zimbabwe':
            count += 1
    elem.clear() # discard the element

print count

The loop iterates over iterparse events, detecting "end" events for the location tag, looking for the desired value. The call to elem.clear() is key here – iterparse still builds a tree, doing it on the fly. Clearing the element effectively discards the tree [7], freeing the allocated memory.

When invoked on the same file, the peak memory usage of this script is just 7MB, and it takes 2.5 seconds to run. The speed improvement is due to our traversing the tree only once here, while it is being constructed. The parse approach builds the whole tree first, and then traverses it again to look for interesting elements.

The performance of iterparse is comparable to SAX, but its API is much more useful – since it builds the tree on the fly for you; SAX just gives you the events and you build the tree yourself.

Conclusion

Of the many modules Python offers for processing XML, ElementTree really stands out. It combines a lightweight, Pythonic API with excellent performance through its C accelerator module. All things considered, it almost never makes sense not to use it if you need to parse or produce XML in Python.

This article presents a basic tutorial for ET. I hope it will provide anyone with interest in the subject enough material to start using the module and explore its more advanced capabilities on their own.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] As opposed to DOM, which loads the whole document into memory and allows "random access" to its elements at any depth.
[2] expat is an open-source C library for parsing XML. Python carries it around in its distribution, and it serves as the base of Python’s XML parsing capabilities.
[3] Fredrik Lundh, the original author of ElementTree, collected some benchmarks here. Scroll down on that page to see them.
[4] When I mention _elementtree in this article, I mean the C accelerator that’s used for cElementTree. _elementtree is a C extension module for Python that is part of the standard distribution.
[5] Be sure to have the module documentation handy and look up the methods I’m calling to better understand the parameters passed.
[6] Attributes is an overloaded term here. There are Python object attributes, and there are XML element attributes. Hopefully it will be obvious from the context which one is implied.
[7] To be precise, the root element of the tree is still alive. In the unlikely event that the root element is very large, you can discard it too, but that would need a bit more code.

Related posts:

  1. more thoughts on previous problem
  2. Python development – improving ElementTree for 3.3
  3. SICP section 2.2.2
  4. SICP section 2.3.4
  5. SICP section 2.3.3

23 Responses to “Processing XML in Python with ElementTree”

  1. Tshepang LekhonkhobeNo Gravatar Says:

    This is quite good, thanks much. It would be kool if something like this can be added to http://docs.python.org/howto/index.html.

    One note:
    You change from doing “ET.ElementTree(file=’file’)” to doing “ET.parse(‘file’)”, without an explanation. Which to use where, or is it mainly a matter of taste?

  2. JoshNo Gravatar Says:

    Thanks so much for this article. I have struggled with a dynamic and elegant way of traversing element trees but this really clarified the process. Great work!

    Josh

  3. UserNo Gravatar Says:

    All this text, and not one word on XSD support?? Usually I want the XML parser also to convert each tag to its corresponding type automatically (using an XSD schema), so that I won’t have to do txt2num() on every numeric entry, deal with enums, bools etc…

    For example, in the entry below, “age” would be automatically stored as a number:

    <person id="1">
        <first-name>John</first-name>
        <middle-name>Mary</middle-name>
        <last-name>Doe</last-name>
        <gender>male</gender>
        <age>33</age>
      </person>

    Most serious languages have support for this. Even statically-typed languages like c++:
    http://www.codesynthesis.com/projects/xsd/documentation/cxx/tree/guide/

    for Java you have Xerces, etc.

    What is the Python equivalent(s)?

  4. elibenNo Gravatar Says:

    Tshepang,

    Thanks for the feedback. The two approaches are exactly equivalent. Here’s the parse function:

    def parse(source, parser=None):
        tree = ElementTree()
        tree.parse(source, parser)
        return tree

    Choosing either is just a matter of taste.

    User

    Interesting, I have never used XSD before. Anyway, this article is very basic and is just meant to describe basic stuff. There are whole books about XML when you need to know “everything”.

  5. eheNo Gravatar Says:

    Thanks for sharing !.

    One small contribution:

    When testing the script “Finding Interesting Elements” under Centos 6 (python 2.6.6) I got this error:

    AttributeError: ElementTree instance has no attribute ‘iter’

    So, I just replace:

    for elem in tree.iter():

    with

    for elem in tree.findall('.//*'):
  6. elibenNo Gravatar Says:

    Ehe,

    Yep, the ElementTree.iter method is new in 2.7

  7. Alby DuahrbNo Gravatar Says:

    Thanks kindly for this; I have been experimenting with all the Python XML tools and didn’t have a good feel for which system to use. I think several of the APIs/systems are probably left over from previous distributions and kept for backwards compatibility. ElementTree has been looking the best for me and with the iterparse method now I can do what I need to.

  8. JWNo Gravatar Says:

    “…it will be sufficient to just import xml.etree.ElemenTree” misspells ET ?

  9. elibenNo Gravatar Says:

    JW,

    Fixed, thanks.

  10. StevNo Gravatar Says:

    Good kick start for ET! I was wandering how to modify a XML file exceed 2 G in file size? With the modification mean require to write back to the existing or new file. I believe this is a critical consideration for people building apps with persistent data stored in XML.

  11. elibenNo Gravatar Says:

    Stev,

    I really, really suggest you reconsider. 2Gb XML files are a bad idea, and processing them in Python is doubly so. Use a proper data store!

  12. EfraimalexanderNo Gravatar Says:

    Great teaching!
    I will keep looking for similar explanation on how ElementTree can be applied to XSD schema files (read and write). The http://www.w3schools.com/schema/schema_example.asp is good but simple tutorials like this one are far better. A second theme to discuss is how to create and apply XSLT format styles. Thanks!

  13. MikeKJNo Gravatar Says:

    Doing some testing with etree to construct an xml and get an error at root.extend as in it has no attribute extend

    import xml.etree.cElementTree as ET
    import sys
    a = ET.Element(‘quotationinput version=”1.0″‘)
    b = ET.SubElement(a, ‘authorisation’)
    c = ET.SubElement(b, ‘password’)
    c.text = pwd
    d = ET.SubElement(a, ‘productdetail’)
    e = ET.SubElement(d, ‘producttype’)
    e.text = “”
    f = ET.SubElement(d, ‘accommodationCode’)
    f.text = “”
    g = ET.SubElement(d, ‘startDate’)
    g.text = “”
    h = ET.SubElement(d, ‘duration’)
    h.text = “”
    i = ET.SubElement(d, ‘unitCode’)
    i.text = unit
    j = ET.SubElement(a, ‘partydetail’)
    k = ET.SubElement(j, ‘adultCount’)
    k.text = “”
    l = ET.SubElement(j, ‘childCount’)
    l.text = “”
    m = ET.SubElement(j, ‘babyCount’)
    m.text = “”
    n = ET.SubElement(j, ‘petCount’)
    n.text = “”
    root = ET.Element(‘root’)
    root.extend([a,b,c,d,e,f,g,h,i,j,k,l,m,n])
    tree = ET.ElementTree(root)
    tree.write(sys.stdout)

  14. elibenNo Gravatar Says:

    MikeKJ,

    As much as I’d like to help, this really isn’t the place for posting possible Python bug reports. I suggest the following:

    1. If you’re very sure you’ve found a problem, open a Python Issue, or send an email to the python-dev mailing list.
    2. If you’re not sure whether this is an actual bug in Python, send an email to the python-list maillist or ask on StackOverflow.

  15. JETNo Gravatar Says:

    Thanks for the tutorial!

    Is there a way to keep the order of the attributes as in the original stream?

  16. LucasNo Gravatar Says:

    Good tutorial!!
    Much more detailed than python docs

    Thanks!

  17. elibenNo Gravatar Says:

    @Lucas,

    Note that the documentation of XML ET has been significantly improved starting with version 3.3

  18. VictorNo Gravatar Says:

    I’m new to Python and ElementTree. I have namespace in my sample xml, eg
    <?xml version="1.0"?>
    <doc xmlns="http://www.fpml.org/FpML-5">
    <branch name="testing" hash="1cdf045c">
    text,source
    </branch>
    <branch name="release01" hash="f200013e">
    <sub-branch name="subrelease01">
    xml,sgml
    </sub-branch>
    </branch>
    <branch name="invalid">
    </branch>
    </doc>

    How do I make
    for elem in tree.iterfind(‘branch/sub-branch’):
    continue to work? (it currently returns nothing)

  19. elibenNo Gravatar Says:

    @Victor,

    I plan to write an article on handling XML namespaces with ET one day. In the meantime, I suggest you ask on Stack Overflow.

  20. Bill MillNo Gravatar Says:

    Thanks so much for pointing me to .clear(), that’s super helpful.

  21. ScottNo Gravatar Says:

    Great tutorial. Thanks for taking the time to write this…

  22. Mark GardnerNo Gravatar Says:

    Did you ever get that page written with the namespaces?

  23. LunaNo Gravatar Says:

    How to disable output escaping when modifying(writing back to) an xml file?
    And also, once the the file is modified and is written back, the xml declaration (proflogue etc is lost).
    For eg:
    ” in strings is replaced by &quot,
    & changes to &amp,
    single quotes are replaced by double quotes
    Can I set some property in the etree.write() method to disable the unwanted escaping of characters?

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)