Note (01.05.2014): take a look at an updated version of this post that uses libTooling to achieve the same goal.

Note (25.12.2013): this code doesn't work with the newest Clang. For up-to-date code, check out my llvm-clang-samples repository.

Source-to-source transformation of C/C++ code is known to be a hard problem. And yet, with the recent maturity of Clang as a powerful and library-friendly C++ compiler, I think there may finally be some light at the end of the tunnel.

This post serves as a demonstration of basic source-to-source transformations with Clang. Specifically, it builds a simple program that links to Clang's libraries (statically) and directly operates on Clang's C++ API to achieve its goals. The C++ API of Clang is a moving target, so there's a good chance this code will require modifications with next versions of Clang. At this point I verified that it works with release 3.1 and today's trunk.

The transformation itself done here is trivial and not really interesting - the program just adds comments in a few places (before and after function definitions, and inside if statements). The main goal here is to show how to set up the whole Clang machinery to enable this, and how to build the thing so it compiles and links correctly.

The code

This is rewritersample.cpp:

//-------------------------------------------------------------------------
//
// rewritersample.cpp: Source-to-source transformation sample with Clang,
// using Rewriter - the code rewriting interface.
//
// Eli Bendersky (eliben@gmail.com)
// This code is in the public domain
//
#include <cstdio>
#include <string>
#include <sstream>

#include "clang/AST/ASTConsumer.h"
#include "clang/AST/RecursiveASTVisitor.h"
#include "clang/Basic/Diagnostic.h"
#include "clang/Basic/FileManager.h"
#include "clang/Basic/SourceManager.h"
#include "clang/Basic/TargetOptions.h"
#include "clang/Basic/TargetInfo.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Lex/Preprocessor.h"
#include "clang/Parse/ParseAST.h"
#include "clang/Rewrite/Rewriter.h"
#include "clang/Rewrite/Rewriters.h"
#include "llvm/Support/Host.h"
#include "llvm/Support/raw_ostream.h"

using namespace clang;
using namespace std;


// By implementing RecursiveASTVisitor, we can specify which AST nodes
// we're interested in by overriding relevant methods.
class MyASTVisitor : public RecursiveASTVisitor<MyASTVisitor>
{
public:
    MyASTVisitor(Rewriter &R)
        : TheRewriter(R)
    {}

    bool VisitStmt(Stmt *s) {
        // Only care about If statements.
        if (isa<IfStmt>(s)) {
            IfStmt *IfStatement = cast<IfStmt>(s);
            Stmt *Then = IfStatement->getThen();

            TheRewriter.InsertText(Then->getLocStart(),
                                   "// the 'if' part\n",
                                   true, true);

            Stmt *Else = IfStatement->getElse();
            if (Else)
                TheRewriter.InsertText(Else->getLocStart(),
                                       "// the 'else' part\n",
                                       true, true);
        }

        return true;
    }

    bool VisitFunctionDecl(FunctionDecl *f) {
        // Only function definitions (with bodies), not declarations.
        if (f->hasBody()) {
            Stmt *FuncBody = f->getBody();

            // Type name as string
            QualType QT = f->getResultType();
            string TypeStr = QT.getAsString();

            // Function name
            DeclarationName DeclName = f->getNameInfo().getName();
            string FuncName = DeclName.getAsString();

            // Add comment before
            stringstream SSBefore;
            SSBefore << "// Begin function " << FuncName << " returning "
                     << TypeStr << "\n";
            SourceLocation ST = f->getSourceRange().getBegin();
            TheRewriter.InsertText(ST, SSBefore.str(), true, true);

            // And after
            stringstream SSAfter;
            SSAfter << "\n// End function " << FuncName << "\n";
            ST = FuncBody->getLocEnd().getLocWithOffset(1);
            TheRewriter.InsertText(ST, SSAfter.str(), true, true);
        }

        return true;
    }

private:
    void AddBraces(Stmt *s);

    Rewriter &TheRewriter;
};


// Implementation of the ASTConsumer interface for reading an AST produced
// by the Clang parser.
class MyASTConsumer : public ASTConsumer
{
public:
    MyASTConsumer(Rewriter &R)
        : Visitor(R)
    {}

    // Override the method that gets called for each parsed top-level
    // declaration.
    virtual bool HandleTopLevelDecl(DeclGroupRef DR) {
        for (DeclGroupRef::iterator b = DR.begin(), e = DR.end();
             b != e; ++b)
            // Traverse the declaration using our AST visitor.
            Visitor.TraverseDecl(*b);
        return true;
    }

private:
    MyASTVisitor Visitor;
};


int main(int argc, char *argv[])
{
    if (argc != 2) {
        llvm::errs() << "Usage: rewritersample <filename>\n";
        return 1;
    }

    // CompilerInstance will hold the instance of the Clang compiler for us,
    // managing the various objects needed to run the compiler.
    CompilerInstance TheCompInst;
    TheCompInst.createDiagnostics(0, 0);

    // Initialize target info with the default triple for our platform.
    TargetOptions TO;
    TO.Triple = llvm::sys::getDefaultTargetTriple();
    TargetInfo *TI = TargetInfo::CreateTargetInfo(
        TheCompInst.getDiagnostics(), TO);
    TheCompInst.setTarget(TI);

    TheCompInst.createFileManager();
    FileManager &FileMgr = TheCompInst.getFileManager();
    TheCompInst.createSourceManager(FileMgr);
    SourceManager &SourceMgr = TheCompInst.getSourceManager();
    TheCompInst.createPreprocessor();
    TheCompInst.createASTContext();

    // A Rewriter helps us manage the code rewriting task.
    Rewriter TheRewriter;
    TheRewriter.setSourceMgr(SourceMgr, TheCompInst.getLangOpts());

    // Set the main file handled by the source manager to the input file.
    const FileEntry *FileIn = FileMgr.getFile(argv[1]);
    SourceMgr.createMainFileID(FileIn);
    TheCompInst.getDiagnosticClient().BeginSourceFile(
        TheCompInst.getLangOpts(),
        &TheCompInst.getPreprocessor());

    // Create an AST consumer instance which is going to get called by
    // ParseAST.
    MyASTConsumer TheConsumer(TheRewriter);

    // Parse the file to AST, registering our consumer as the AST consumer.
    ParseAST(TheCompInst.getPreprocessor(), &TheConsumer,
             TheCompInst.getASTContext());

    // At this point the rewriter's buffer should be full with the rewritten
    // file contents.
    const RewriteBuffer *RewriteBuf =
        TheRewriter.getRewriteBufferFor(SourceMgr.getMainFileID());
    llvm::outs() << string(RewriteBuf->begin(), RewriteBuf->end());

    return 0;
}

The makefile

CXX = g++
CFLAGS = -fno-rtti

LLVM_SRC_PATH = ___PATH-TO-LLVM-SOURCE-DIR___
LLVM_BUILD_PATH = ___PATH-TO-LLVM-BUILD-DIR___

LLVM_BIN_PATH = $(LLVM_BUILD_PATH)/Debug+Asserts/bin
LLVM_LIBS=core mc
LLVM_CONFIG_COMMAND = $(LLVM_BIN_PATH)/llvm-config --cxxflags --ldflags \
                                        --libs $(LLVM_LIBS)
CLANG_BUILD_FLAGS = -I$(LLVM_SRC_PATH)/tools/clang/include \
                                      -I$(LLVM_BUILD_PATH)/tools/clang/include

CLANGLIBS = \
  -lclangFrontendTool -lclangFrontend -lclangDriver \
  -lclangSerialization -lclangCodeGen -lclangParse \
  -lclangSema -lclangStaticAnalyzerFrontend \
  -lclangStaticAnalyzerCheckers -lclangStaticAnalyzerCore \
  -lclangAnalysis -lclangARCMigrate -lclangRewrite \
  -lclangEdit -lclangAST -lclangLex -lclangBasic

all: rewritersample

rewritersample: rewritersample.cpp
      $(CXX) rewritersample.cpp $(CFLAGS) -o rewritersample \
              $(CLANG_BUILD_FLAGS) $(CLANGLIBS) `__abENT__lt;__abENT__#8260;span__abENT__gt;__abENT__lt;span style=__abENT__quot;color: #00007f; font-weight: bold__abENT__quot;__abENT__gt;$(__abENT__lt;__abENT__#8260;span__abENT__gt;LLVM_CONFIG_COMMAND__abENT__lt;span style=__abENT__quot;color: #00007f; font-weight: bold__abENT__quot;__abENT__gt;)__abENT__lt;__abENT__#8260;span__abENT__gt;__abENT__lt;span style=__abENT__quot;color: #7f007f__abENT__quot;__abENT__gt;`

clean:
      rm -rf *.o *.ll rewritersample

First, let's discuss the makefile and what's important to look for.

You must replace __PATH_TO... with the correct paths. The SRC path is where LLVM source root lives. BUILD path is where it was built. Note that this implies a source checkout and build with configure. If you use a CMake build, or build against binaries, you may have to fiddle with the paths a bit (including LLVM_BIN_PATH).

llvm-config does a great job of figuring out the compile and link flags needed for LLVM and Clang. However, it currently only handles LLVM libs, and Clang libs have to be specified explicitly. The problem with this is that linkers, being sensitive to the order of libraries, are fickle, and it's easy to get link errors if the libs are not specified in the correct order. A good place to see the up-to-date library list for Clang is tools/driver/Makefile - the makefile for the main Clang driver.

Note also that the include dirs have to be speficied explicitly for Clang. This is important - if you have some version of Clang installed and these are not specified explicitly, you may get nasty linking errors (complaining about things like classof).

What the code does - general

Now, back to the source code. Our goal is to set up the Clang libraries to parse some source code into an AST, and then let us somehow traverse the AST and modify the source code.

A major challenge in writing a tool using Clang as a library is setting everything up. The Clang frontend is a complex beast and consists of many parts. For the sake of modularity and testability, these parts are decoupled and hence take some work to set up. Fortunately, the Clang developers have provided a convenience class named CompilerInstance that helps with this task by collecting together everything needed to set up a functional Clang-based frontend. The bulk of the main function in my sample deals with setting up a CompilerInstance.

The key call in main is to ParseAST. This function parses the input into an AST, and passes this AST to an implementation of the ASTConsumer interface, which represents some entity consuming the AST and acting upon it.

ASTConsumer

My implementation of ASTConsumer is MyASTConsumer. It's a very simple class that only implements one method of the interface - HandleTopLevelDecl. This gets called by Clang whenever a top-level declaration (which also counts function definitions) is completed.

RecursiveASTVisitor

The main work-horse of AST traversal is MyASTVisitor, an implementation of RecursiveASTVisitor. This is the classical visitor pattern, with a method per interesting AST node. My code defines only a couple of visitor methods - to handle statements and function declarations. Note how the class itself is defined - this is a nice example of the curiously recurring template pattern (and actually the one I used in my earlier article on CRTP).

Rewriter

The Rewriter is a key component in the source-to-source transformation scheme implemented by this code. Instead of handling every possible AST node to spit back code from the AST, the approach taken here is to surgically change the original code at key places to perform the transformation. The Rewriter class is crucial for this. It's a sophisticated buffer manager that uses a rope data structure to enable efficient slicing-and-dicing of the source. Coupled with Clang's excellent preservation of source locations for all AST nodes, Rewriter enables to remove and insert code very accurately. Read its source code for more insights.

Other resources

Many thanks for the maintainers of the Clang-tutorial repository - my code is based on one of the examples taken from there.

Another source of information is the "tooling" library that's starting to emerge in Clang (include/clang/Tooling). It's being developed by members of the Clang community that are writing in-house refactoring and code-transformation tools based on Clang as a library, so it's a relevant source.

Finally, due to the scarcity of Clang's external documentation, the best source of information remains the code itself. While at first somewhat formidable, Clang's code is actually very well organized and is readable enough.