An interesting tree serialization algorithm from DWARF

September 29th, 2011 at 6:41 am

There are several techniques available for serializing trees. In this post I want to present one interesting technique I recently ran into, in the context of the DWARF debugging information format [1]. It allows serializing generic N-ary trees (where each node can have any number of children) into a linear data structure suitable for storage or tramsmission.

First, let’s define the data structures involved. I will use Python code to demonstrate the algorithm, so a simplistic tree node would be:

class TreeNode(object):
    def __init__(self, data):
        self.data = data
        self.children = []

children is a list of other TreeNode objects, which makes each node the root of a sub-tree that can be traversed.

Now let’s build an actual tree which we’re going to use for demonstration purposes:

tree = {}
for n in 'ABCDEFGXY':
    tree[n] = TreeNode(n)

tree['A'].children = [tree['B'], tree['C'], tree['D']]
tree['B'].children = [tree['E'], tree['F']]
tree['F'].children = [tree['G']]
tree['D'].children = [tree['X'], tree['Y']]

Here’s how it looks [2]:

http://eli.thegreenplace.net/wp-content/uploads/2011/09/tree1.png

So how is a tree serialized?

Here’s a quote from the DWARF v3 standard section 2.3 explaining it, slightly rephrased:

The tree itself is represented by flattening it in prefix order. Each node is defined either to have children or not to have children. If a node is defined not to have children, the next physically succeeding node is a sibling. If a node is defined to have children, the next physically succeeding node is its first child. Additional children are represented as siblings of the first child. A chain of sibling entries is terminated by a null node.

After a couple of minutes of thought it should become obvious that this indeed creates a serialized representation that is reversible. For my Python code, the serialized representation is going to be a list of "entries", each entry being either None (to specify the "null node" from the description above), or a pair of (data, has_children_flag), with data being the tree node data, and has_children_flag a boolean specifying whether this node has children. So for the tree depicted above, the serialized representation is:

A,True
B,True
E,False
F,True
G,False
None
None
C,False
D,True
X,False
Y,False
None
None

The algorithms for serializing a tree into this representation and deserializing it back are charmingly simple. Here they are, with Python (as usual) serving as pseudo-code as well as the implementation.

First, serialization. The main idea is to walk the tree recursively in pre-order (first visiting a node, then its children in order), while populating the serialized list which exists outside the recursive visitor:

def serialize_tree(root_node):
    """ Given a tree root node (some object with a 'data' attribute
        and a 'children' attribute which is a list of child nodes),
        serialize it to a list, each element of which is either a
        pair (data, has_children_flag), or None (which signals an
        end of a sibling chain).
    """
    lst = []
    def serialize_aux(node):
        # Recursive visitor function
        if len(node.children) > 0:
            # The node has children, so:
            #  1. add it to the list & mark that it has children
            #  2. recursively serialize its children
            #  3. finally add a null entry to signal that the children
            #     of this node have ended
            lst.append((node.data, True))
            for child in node.children:
                serialize_aux(child)
            lst.append(None)
        else:
            # The node is child-less, so simply add it to
            # the list & mark that it has no chilren
            lst.append((node.data, False))
    serialize_aux(root_node)
    return lst

And now deserialization. It uses a stack of "parents" to collect the nodes into a tree hierarchy. At each step in the loop the invariant is that the node at the top of the stack is the parent node to which nodes have to be added. When an entry with children is encountered, a new node is pushed on top of the stack. When a null entry is encountered, it means the end of children for the current parent, so it’s popped off the stack:

def deserialize_tree(nodelist):
    """ Expects a node list of the form created by serialize_tree.
        Each entry in the list is either None or a pair of the form
        (data, has_children_flag).
        Reconstruct the tree back from it and return its root node.
    """
    # The first item in the nodelist represents the tree root
    root = TreeNode(nodelist[0][0])
    parentstack = [root]
    for item in nodelist[1:]:
        if item is not None:
            # This node is added to the list of children of the current
            # parent.
            node = TreeNode(item[0])
            parentstack[-1].children.append(node)
            if item[1]: # has children?
                parentstack.append(node)
        else:
            # end of children for current parent
            parentstack.pop()
    return root

The DWARF spec just mentions the serialization format (the quote I pasted above) – it doesn’t say how to implement it. If you can think of a simpler algorithm to implement this (de)serialization scheme, please let me know.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] In DWARF this scheme is used to serialize a tree of DIEs (Debugging Information Entries) into the .debug_info section.
[2] You may be wondering how this image was generated. I’ve used the excellent Google Visualization API to draw it, with the "orgchart" package. It’s simple to write a bit of Python code that automatically generates the data table given a root of the tree. The visualization API renders the image onto a HTML canvas with Javascript. I then took a screenshot and converted the result to a PNG that’s displayed here.

Related posts:

  1. SICP section 3.3.3
  2. 1136 solved
  3. more thoughts on previous problem
  4. SICP section 2.3.3
  5. SICP section 2.3.4

8 Responses to “An interesting tree serialization algorithm from DWARF”

  1. Marii YonovNo Gravatar Says:

    I was wondering for half an hour why you don’t have two None’s at the end of the list (Before seeing last pop :) ). But with that pop the algorithm isn’t correct. Consider the example:
    tree['A'].children = []
    print(serialize_tree(tree['A']))

    Then you will have empty list while the correct is: [('A', False)].

  2. elibenNo Gravatar Says:

    Marii Yonov,

    You’re right – that corner case wasn’t being handled correctly. Fixed now. Thanks for noticing!

  3. Emilio MontiNo Gravatar Says:

    Good article! :-)

    A python implementation for the deserialization of the DWARF DIEs tree:
    http://code.google.com/p/pydevtools/source/browse/trunk/bintools/dwarf/info.py#107

    Cheers,
    Emilio

  4. elibenNo Gravatar Says:

    Emilio,

    Thanks! I see you’ve implemented it pretty much the same way. I’ve also seen another implementation (in LLVM) where each DIE has a parent link, so no explicit parent stack is required (since a chain of parents always exists – you only need to store the “current parent”).

  5. Michael MolNo Gravatar Says:

    And now to get someone to build this as a task on Rosetta Code…

  6. John FialaNo Gravatar Says:

    This was really interesting – I didn’t know about this way of serializing trees, but it makes a lot of sense and I like it.

  7. cybersolNo Gravatar Says:

    I still think you are missing one None at the end. The first one terminates the X,Y level, so you need another to terminate the B,C,D level. The third one to terminate the A level is the one that isn’t needed.

  8. elibenNo Gravatar Says:

    cybersol,

    You’re right – fixed. Thinking about this, a redundant None isn’t generated for the root level at all, because as the algorithm is currently implemented None is only placed after a list of children, and the root isn’t in any list of children.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)