Parsing C++ in Python with Clang

July 3rd, 2011 at 5:15 am

Note (31.05.2014): Clang’s APIs evolve quickly, and this includes libclang and the Python bindings. Therefore, the samples in this post may no longer work. For working samples that are kept up-to-date with upstream Clang, check out my llvm-clang-samples repository on Github.

People that need to parse and analyze C code in Python are usually really excited to run into pycparser. However, when the task is to parse C++, pycparser is not the solution. When I get asked about plans to support C++ in pycparser, my usual answer is – there are no such plans [1], you should look elsewhere. Specifically, at Clang.

Clang is a front-end compiler for C, C++ and Objective C. It’s a liberally licensed open-source project backed by Apple, which uses it for its own tools. Along with its parent project – the LLVM compiler backend, Clang starts to become a formidable alternative to gcc itself these days. The dev team behind Clang (and LLVM) is top-notch and its source is one of the best designed bodies of C++ code in the wild. Clang’s development is very active, closely following the latest C++ standards.

So what I point people to when I’m asked about C++ parsing is Clang. There’s a slight problem with that, however. People like pycparser because it’s Python, and Clang’s API is C++ – which is not the most high-level hacking friendly language out there, to say the least.

libclang

Enter libclang. Not so long ago, the Clang team wisely recognized that Clang can be used not only as a compiler proper, but also as a tool for analyzing C/C++/ObjC code. In fact, Apple’s own Xcode development tools use Clang as a library under the hood for code completion, cross-referencing, and so on.

The component through which Clang enables such usage is called libclang. It’s a C API [2] that the Clang team vows to keep relatively stable, allowing the user to examine parsed code at the level of an abstract syntax tree (AST) [3].

More technically, libclang is a shared library that packages Clang with a public-facing API defined in a single C header file: clang/include/clang-c/Index.h.

Python bindings to libclang

libclang comes with Python bindings, which reside in clang/bindings/python, in module clang.cindex. This module relies on ctypes to load the dynamic libclang library and tries to wrap as much of libclang as possible with a Pythonic API.

Documentation?

Unfortunately, the state of documentation for libclang and its Python bindings is dire. The official documentation according to the devs is the source (and auto-generated Doxygen HTML). In addition, all I could find online is a presentation and a couple of outdated email messages from the Clang dev mailing list.

On the bright side, if you just skim the Index.h header file keeping in mind what it’s trying to achieve, the API isn’t hard to understand (and neither is the implementation, especially if you’re a bit familiar with Clang’s internals). Another place to look things up is the clang/tools/c-index-test tool, which is used to test the API and demonstrates its usage.

For the Python bindings, there is absolutely no documentation as well, except the source plus a couple of examples that are distributed alongside it. So I hope this article will be helpful!

Setting up

Setting up usage of the Python bindings is very easy:

  • Your script needs to be able to find the clang.cindex module. So either copy it appropriately or set up PYTHONPATH to point to it [4].
  • clang.cindex needs to be able to find the libclang.so shared library. Depending on how you build/install Clang, you will need to copy it appropriately or set up LD_LIBRARY_PATH to point to its location. On Windows, this is libclang.dll and it should be on PATH.

That arranged, you’re ready to import clang.cindex and start rolling.

Simple example

Let’s start with a simple example. The following script uses the Python bindings of libclang to find all references to some type in a given file:

#!/usr/bin/env python
""" Usage: call with <filename> <typename>
"""

import sys
import clang.cindex

def find_typerefs(node, typename):
    """ Find all references to the type named 'typename'
    """
    if node.kind.is_reference():
        ref_node = clang.cindex.Cursor_ref(node)
        if ref_node.spelling == typename:
            print 'Found %s [line=%s, col=%s]' % (
                typename, node.location.line, node.location.column)
    # Recurse for children of this node
    for c in node.get_children():
        find_typerefs(c, typename)

index = clang.cindex.Index.create()
tu = index.parse(sys.argv[1])
print 'Translation unit:', tu.spelling
find_typerefs(tu.cursor, sys.argv[2])

Suppose we invoke it on this dummy C++ code:

class Person {
};


class Room {
public:
    void add_person(Person person)
    {
        // do stuff
    }

private:
    Person* people_in_room;
};


template <class T, int N>
class Bag<T, N> {
};


int main()
{
    Person* p = new Person();
    Bag<Person, 42> bagofpersons;

    return 0;
}

Executing to find referenced to type Person, we get:

Translation unit: simple_demo_src.cpp
Found Person [line=7, col=21]
Found Person [line=13, col=5]
Found Person [line=24, col=5]
Found Person [line=24, col=21]
Found Person [line=25, col=9]

Understanding how it works

To see what the example does, we need to understand its inner workings on 3 levels:

  • Conceptual level – what is the information we’re trying to pull from the parsed source and how it’s stored
  • libclang level – the formal C API of libclang, since it’s much better documented (albeit only in comments in the source) than the Python bindings
  • The Python bindings, since this is what we directly invoke

Creating the index and parsing the source

We’ll start at the beginning, with these lines:

index = clang.cindex.Index.create()
tu = index.parse(sys.argv[1])

An "index" represents a set of translation units compiled and linked together. We need some way of grouping several translation units if we want to reason across them. For example, we may want to find references to some type defined in a header file, in a set of other source files. Index.create() invokes the C API function clang_createIndex.

Next, we use Index‘s parse method to parse a single translation unit from a file. This invokes clang_parseTranslationUnit, which is a key function in the C API. Its comment says:

This routine is the main entry point for the Clang C API, providing the ability to parse a source file into a translation unit that can then be queried by other functions in the API.

This is a powerful function – it can optionally accept the full set of flags normally passed to the command-line compiler. It returns an opaque CXTranslationUnit object, which is encapsulated in the Python bindings as TranslationUnit. This TranslationUnit can be queried, for example the name of the translation unit is available in the spelling property:

print 'Translation unit:', tu.spelling

Its most important property is, however, cursor. A cursor is a key abstraction in libclang, it represents some node in the AST of a parsed translation unit. The cursor unifies the different kinds of entities in a program under a single abstraction, providing a common set of operations, such as getting its location and children cursors. TranslationUnit.cursor returns the top-level cursor of the translation unit, which serves as the stating point for exploring its AST. I will use the terms cursor and node interchangeably from this point on.

Working with cursors

The Python bindings encapsulate the libclang cursor in the Cursor object. It has many attributes, the most interesting of which are:

  • kind – an enumeration specifying the kind of AST node this cursor points at
  • spelling – the source-code name of the node
  • location – the source-code location from which the node was parsed
  • get_children – its children nodes

get_children requires special explanation, because this is a particular point at which the C and Python APIs diverge.

The libclang C API is based on the idea of visitors. To walk the AST from a given cursor, the user code provides a callback function to clang_visitChildren. This function is then invoked on all descendants of a given AST node.

The Python bindings, on the other hand, encapsulate visiting internally, and provide a more Pythonic iteration API via Cursor.get_children, which returns the children nodes (cursors) of a given cursor. It’s still possible to access the original visitation APIs directly through Python, but using get_children is much more convenient. In our example, we use get_children to recursively visit all the children of a given node:

for c in node.get_children():
    find_typerefs(c, typename)

Some limitations of the Python bindings

Unfortunately, the Python bindings aren’t complete and still have some bugs, because it is a work in progress. As an example, suppose we want to find and report all the function calls in this file:

bool foo()
{
    return true;
}

void bar()
{
    foo();
    for (int i = 0; i < 10; ++i)
        foo();
}

int main()
{
    bar();
    if (foo())
        bar();
}

Let’s write this code:

import sys
import clang.cindex

def callexpr_visitor(node, parent, userdata):
    if node.kind == clang.cindex.CursorKind.CALL_EXPR:
        print 'Found %s [line=%s, col=%s]' % (
                node.spelling, node.location.line, node.location.column)
    return 2 # means continue visiting recursively

index = clang.cindex.Index.create()
tu = index.parse(sys.argv[1])
clang.cindex.Cursor_visit(
        tu.cursor,
        clang.cindex.Cursor_visit_callback(callexpr_visitor),
        None)

This time we’re using the libclang visitation API directly. The result is:

Found None [line=8, col=5]
Found None [line=10, col=9]
Found None [line=15, col=5]
Found None [line=16, col=9]
Found None [line=17, col=9]

While the reported locations are fine, why is the node name None? After some perusal of libclang‘s code, it turns out that for expressions, we shouldn’t be printing the spelling, but rather the display name. In the C API it means clang_getCursorDisplayName and not clang_getCursorSpelling. But, alas, the Python bindings don’t have clang_getCursorDisplayName exposed!

We won’t let this stop us, however. The source code of the Python bindings is quite straightforward, and simply uses ctypes to expose additional functions from the C API. Adding these lines to bindings/python/clang/cindex.py:

Cursor_displayname = lib.clang_getCursorDisplayName
Cursor_displayname.argtypes = [Cursor]
Cursor_displayname.restype = _CXString
Cursor_displayname.errcheck = _CXString.from_result

And we can now use Cursor_displayname. Replacing node.spelling by clang.cindex.Cursor_displayname(node) in the script, we now get the desired output:

Found foo [line=8, col=5]
Found foo [line=10, col=9]
Found bar [line=15, col=5]
Found foo [line=16, col=9]
Found bar [line=17, col=9]

Update (06.07.2011): Inspired by this article, I submitted a patch to the Clang project to expose Cursor_displayname, as well as to fix a few other problems with the Python bindings. It was committed by Clang’s core devs in revision 134460 and should now be available from trunk.

Some limitations of libclang

As we have seen above, limitations in the Python bindings are relatively easy to overcome. Since libclang provides a straightforward C API, it’s just a matter of exposing additional functionality with appropriate ctypes constructs. To anyone even moderately experienced with Python, this isn’t a big problem.

Some limitations are in libclang itself, however. For example, suppose we wanted to find all the return statements in a chunk of code. Turns out this isn’t possible through the current API of libclang. A cursory look at the Index.h header file reveals why.

enum CXCursorKind enumerates the kinds of cursors (nodes) we may encounter via libclang. This is the portion related to statements:

/* Statements */
CXCursor_FirstStmt                     = 200,
/**
 * \brief A statement whose specific kind is not exposed via this
 * interface.
 *
 * Unexposed statements have the same operations as any other kind of
 * statement; one can extract their location information, spelling,
 * children, etc. However, the specific kind of the statement is not
 * reported.
 */
CXCursor_UnexposedStmt                 = 200,

/** \brief A labelled statement in a function.
 *
 * This cursor kind is used to describe the "start_over:" label statement in
 * the following example:
 *
 * \code
 *   start_over:
 *     ++counter;
 * \endcode
 *
 */
CXCursor_LabelStmt                     = 201,

CXCursor_LastStmt                      = CXCursor_LabelStmt,

Ignoring the placeholders CXCursor_FirstStmt and CXCursor_LastStmt which are used for validity testing, the only statement recognized here is the label statement. All other statements are going to be represented with CXCursor_UnexposedStmt.

To understand the reason for this limitation, it’s constructive to ponder the main goal of libclang. Currently, this API’s main use is in IDEs, where we want to know everything about types and references to symbols, but don’t particularly care what kind of statement or expression we see [5].

Forgunately, from discussions in the Clang dev mailing lists it can be gathered that these limitations aren’t really intentional. Things get added to libclang on a per-need basis. Apparently no one needed to discern different statement kinds through libclang yet, so no one added this feature. If it’s important enough for someone, he can feel free to suggest a patch to the mailing list. In particular, this specific limitation (lack of statement kinds) is especially easy to overcome. Looking at cxcursor::MakeCXCursor in libclang/CXCursor.cpp, it’s obvious how these "kinds" are generated (comments are mine):

CXCursor cxcursor::MakeCXCursor(Stmt *S, Decl *Parent,
                                CXTranslationUnit TU) {
  assert(S && TU && "Invalid arguments!");
  CXCursorKind K = CXCursor_NotImplemented;

  switch (S->getStmtClass()) {
  case Stmt::NoStmtClass:
    break;

  case Stmt::NullStmtClass:
  case Stmt::CompoundStmtClass:
  case Stmt::CaseStmtClass:

  ... // many other statement classes

  case Stmt::MaterializeTemporaryExprClass:
    K = CXCursor_UnexposedStmt;
    break;

  case Stmt::LabelStmtClass:
    K = CXCursor_LabelStmt;
    break;

  case Stmt::PredefinedExprClass:

  .. //  many other statement classes

  case Stmt::AsTypeExprClass:
    K = CXCursor_UnexposedExpr;
    break;

  .. // more code

This is simply a mega-switch on Stmt.getStmtClass() (which is Clang’s internal statement class), and only for Stmt::LabelStmtClass there is a kind that isn’t CXCursor_UnexposedStmt. So recognizing additional "kinds" is trivial:

  1. Add another enum value to CXCursorKind, between CXCursor_FirstStmt and CXCursor_LastStmt
  2. Add another case to the switch in cxcursor::MakeCXCursor to recognize the appropriate class and return this kind
  3. Expose the enumeration value in (1) to the Python bindings

Conclusion

Hopefully this article has been a useful introduction to libclang‘s Python bindings (and libclang itself along the way). Although there is a dearth of external documentation for these components, they are well written and commented, and their source code is thus straightforward enough to be reasonably self-documenting.

It’s very important to keep in mind that these APIs wrap an extremely powerful C/C++/ObjC parser engine that is being very actively developed. In my personal opinion, Clang is one’s best bet for an up-to-date open-source C++ parsing library these days. Nothing else comes even close.

A small fly in the ointment is some limitations in libclang itself and its Python bindings. These are a by-product of libclang being a relatively recent addition to Clang, which itself is a very young project.

Fortunately, as I hope this article demonstrated, these limitations aren’t terribly difficult to work around. Only a small amount of Python and C expertise is required to extend the Python bindings, while a bit of understanding of Clang lays the path to enhancements to libclang itself. In addition, since libclang is still being actively developed, I’m quite confident that this API will keep improving over time, so it will have less and less limitations and omissions as time goes by.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] For me, there are a few reasons for not wanting to get into C++ parsing. First, I like my projects being born from a need. I needed to parse C, so pycparser was created. I have no need parsing C++. Second, as hard as C is to parse, C++ is much harder since its grammar is even more ambiguous. Third, a great tool for parsing C++ already exists – Clang.
[2] C for better interoperability with non C/C++ based languages and tools. For example, the Python bindings would be much harder to implement on top of a C++ API.
[3] The key word here is stable. While Clang as a whole is designed in a library-based approach and its parts can be used directly, these are internal APIs the development team isn’t obliged to keep stable between releases.
[4] Note that the Python bindings are part of the source distribution of Clang.
[5] Expression kinds are also severely limited in libclang.

Related posts:

  1. Python internals: adding a new statement to Python
  2. Python internals: Working with Python ASTs
  3. Top-Down operator precedence parsing
  4. On parsing the C standard library headers
  5. Creating Python extension modules in C

27 Responses to “Parsing C++ in Python with Clang”

  1. Eike HeinNo Gravatar Says:

    Lovely article at a great time, thanks – I was just recently wondering about the Python bindings to clang’s parser for use in static analysis applications.

  2. Rodrigo ChappaNo Gravatar Says:

    Great article, congratulations!

    I believe there is a small mistake — I might be wrong — where the article says “But, alas, the Python bindings don’t have clang_getCursorSpelling exposed!” it should be “But, alas, the Python bindings don’t have clang_getCursorDisplayName exposed!”

  3. elibenNo Gravatar Says:

    Rodrigo, thanks a lot! Fixed.

  4. Daniel JenningsNo Gravatar Says:

    This is an awesome article and it discusses exactly what I’ve been hunting around for on the internet. I’m running into a problem getting their example code to work, so I’m curious what system you’re using successfully. I’m on a 64-bit Windows Vista running Python 2.7 (tried 2.6 as well) 32-bit, with the Mingw32 build of the clang.dll from LLVM release 2.9 renamed to libclang.dll. I then use the Python bindings from the same tag in SVN, but I run into this error when I try to call any function that returns a non-boring type: ValueError: Procedure called with not enough arguments (4 bytes missing) or wrong calling convention when calling _clang_getDiagnosticLocation. I’m definitely loading the DLL as a CDLL (as I should) and I have seen at least one function work fine (a function that just returns a number, e.g. _clang_getDiagnosticSeverity).

    Any thoughts? Is it because I’m running on Windows instead of on Linux and there’s something to do with struct sizes mismatching?

  5. Daniel JenningsNo Gravatar Says:

    Aha, I have figured it out. Apparently the msys32 clang.dll wasn’t cutting it, so I built (using VS2010) my own libclang.dll and everything works peachy keen :)

  6. elibenNo Gravatar Says:

    Daniel,

    I’m doing all development on Linux (Ubuntu 10.04), but have tested this stuff on Windows too. For that, I’ve been using MSVC 2010 Express – it compiles LLVM&Clang just fine.

  7. Daniel JenningsNo Gravatar Says:

    Yeah, the problem was that I was trying to use the msys32 version of clang.dll from LLVM’s binaries as my libclang.dll, but that doesn’t play well at all with Python’s ctypes. I have since started compiling my own libclang.dll, which was worth doing because I’ve been needing to add more CXCursor_ values for different statements (since I actually care about most of the implementation statements inside of function definitions.

  8. elibenNo Gravatar Says:

    Daniel,

    Which CXCursor_ values are you adding? I’ve also considered adding some such values and getting them committed into Clang trunk.

  9. StoneNo Gravatar Says:

    Great article, and I try clang-index for a while, but found it needs passing enough compiling flags to make clang index full source code. for example
    #if SOME_FLAG
    ….

    code
    #endif //SOME_FLAG
    if create Indexer without SOME_FLAG=1 flag, the code between if and endif will not be parsed, I read the Index.h but not found a solution. could we have clang indexer do static analyzing without flags?

    Thanks

  10. elibenNo Gravatar Says:

    Stone,

    I suggest you ask the clang mailing list (cfe-dev) about that. I’m really not expert enough to answer

  11. RenNo Gravatar Says:

    Hi, thanks for the perfect explanation here.
    I am trying to use python binding to extract information about preprocessor derivatives, but the C API of Clang seems not powerful enough to manipulate the macros.
    So, do you think it is possible to access the derivatives using the python binding to Clang API

    Tanks

  12. elibenNo Gravatar Says:

    Ren,

    The Python bindings only have access to the C API of Clang. So if something is not supported by that C API, it can’t be supported via the Python bindings. You may try to ask in the clang mailing list about the stuff you can’t find a way to do.

  13. ArthurNo Gravatar Says:

    I’m just getting into trying to use Clang to parse C++ and do things with the AST. I’m lost even before we get to this stage…

    My machine (Ubuntu, 11.10) didn’t come with clang installed, and running “sudo apt-get install clang” does not install libclang (nor any Python bindings for it). I see some references online to a package called “libclang-dev”, but I can’t find any such package in the repositories.

    What commands should I run in order to install libclang? Will I have to run more commands to install the Python binding, and if so, what are they?

    Once I can figure out how to install libclang, I look forward to using the rest of the information in this blog post.

  14. elibenNo Gravatar Says:

    Arthur,

    Download the latest available release of LLVM+Clang from http://llvm.org/releases/, and build it from source. It’s the usual configure followed by make (but read the docs for variations), after which you’ll have everything, including libclang.

  15. BryanNo Gravatar Says:

    I have the LLVM and CLANG 3.0 downloaded and working, meaning I can run your Python code above and get the results you have posted. What I am trying to do is to simply parse out the location of a Macro, and for that I am completely lost…

    If I want to get the extent (node.extent) of the existence of the macro _FOO(arg1, arg2) in some source files, what should I be using? Would you have any sample code that could show this? I can see that a Cursor Node (not sure that is the correct usage) node.kind never identifies Macros, and I think that makes sense given that it is a preprocessor function, but I have been unable to figure this out so far.

  16. elibenNo Gravatar Says:

    Bryan,

    I really don’t have any samples beyond what was posted here, sorry. I suggest you ask in the Clang mailing list or on Stackoverflow. I’m sure someone there will be able to help you.

  17. BryanNo Gravatar Says:

    OK, Thanks. I have posted the question to Stack Overflow in case I get an answer and someone else comes here with the same problem.

    http://stackoverflow.com/questions/10113586/how-can-i-parse-macros-in-c-code-using-clang-as-the-parser-and-python-as-the

  18. SebastianNo Gravatar Says:

    Thanks for writing this article. It helped a lot to get started with libclang.

    I wrote a small script to dump the whole AST of a translation unit, in order to better understand it by browsing the AST of some sample files.Maybe it’s of some use for someone else – Here it is:
    https://gist.github.com/2503232

    Sebastian

  19. OeufcoqueNo Gravatar Says:

    Sebastian,

    Your code is exactly what I am looking for, but for some unknown reason I keep getting

    Python quit unexpectedly while using libclang.dylib plug-in.

    I was able to run the sample code on this page thou.

    Any clues on this?

    Thanks.

  20. OeufcoqueNo Gravatar Says:

    Hi,

    I found out the problem, you can see the discussion and solution on cfe-dev archive thru this link.

    http://clang-developers.42468.n3.nabble.com/libclang-Is-there-any-method-like-get-children-python-cindex-available-for-libclang-or-alternative-t-td4024637.html

    Thank you for providing both this post on that code, they were all I need to make sense of how to use everything within so little time :)

  21. ChenNo Gravatar Says:

    Thanks you. This article is really helpful for someone like me, who doesn’t have much exposure to compiler technologies and kinda intimidated by itself. But this helps a lot. thank you!

  22. HenriqueNo Gravatar Says:

    For the initial example to work with a recent clang version, ref_node must be retrieved by using get_definition, i.e.
    ref_node = node.get_definition()

  23. HenriqueNo Gravatar Says:

    Similarly, get_definition is also used in the second example. Also, the visitors call doesn’t exist, so a get_children recursion must be done.

  24. LiMarNo Gravatar Says:

    Great intro!
    It works in my case but I cannot understand how to get the size in bytes of the type.

    The type.kind.name is TypeKind.INT for either ‘int’ or ‘uint8_t’ variables!
    Need help …

  25. voodooNo Gravatar Says:
    def callexpr_visitor(node):
        if node.kind == clang.cindex.CursorKind.CALL_EXPR:
            print 'Found %s [line=%s, col=%s]' % (
                    node.displayname, node.location.line, node.location.column)
    
    def visit(node, func):
        func(node)
        for c in node.get_children():
            visit(c, func)
    
    visit(tu.cursor, callexpr_visitor)
  26. ShiningNo Gravatar Says:

    In python sample, the code
    “ref_node = clang.cindex.Cursor_ref(node)”
    need to update according the clang code on master branch:
    “ref_node = clang.cindex.Cursor.referenced.__get__(node)”.
    Because the interface of Cursor_ref have deleted.

  27. Gernot KlinglerNo Gravatar Says:

    Thx for this intro and your samples! Found it very useful!

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)