Basic source-to-source transformation with Clang

June 8th, 2012 at 7:38 am

Note (25.12.2013): this code doesn’t work with the newest Clang. For up-to-date code, check out my llvm-clang-samples repository.

Source-to-source transformation of C/C++ code is known to be a hard problem. And yet, with the recent maturity of Clang as a powerful and library-friendly C++ compiler, I think there may finally be some light at the end of the tunnel.

This post serves as a demonstration of basic source-to-source transformations with Clang. Specifically, it builds a simple program that links to Clang’s libraries (statically) and directly operates on Clang’s C++ API to achieve its goals. The C++ API of Clang is a moving target, so there’s a good chance this code will require modifications with next versions of Clang. At this point I verified that it works with release 3.1 and today’s trunk.

The transformation itself done here is trivial and not really interesting – the program just adds comments in a few places (before and after function definitions, and inside if statements). The main goal here is to show how to set up the whole Clang machinery to enable this, and how to build the thing so it compiles and links correctly.

The code

This is rewritersample.cpp:

//-------------------------------------------------------------------------
//
// rewritersample.cpp: Source-to-source transformation sample with Clang,
// using Rewriter - the code rewriting interface.
//
// Eli Bendersky (eliben@gmail.com)
// This code is in the public domain
//
#include <cstdio>
#include <string>
#include <sstream>

#include "clang/AST/ASTConsumer.h"
#include "clang/AST/RecursiveASTVisitor.h"
#include "clang/Basic/Diagnostic.h"
#include "clang/Basic/FileManager.h"
#include "clang/Basic/SourceManager.h"
#include "clang/Basic/TargetOptions.h"
#include "clang/Basic/TargetInfo.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Lex/Preprocessor.h"
#include "clang/Parse/ParseAST.h"
#include "clang/Rewrite/Rewriter.h"
#include "clang/Rewrite/Rewriters.h"
#include "llvm/Support/Host.h"
#include "llvm/Support/raw_ostream.h"

using namespace clang;
using namespace std;


// By implementing RecursiveASTVisitor, we can specify which AST nodes
// we're interested in by overriding relevant methods.
class MyASTVisitor : public RecursiveASTVisitor<MyASTVisitor>
{
public:
    MyASTVisitor(Rewriter &R)
        : TheRewriter(R)
    {}

    bool VisitStmt(Stmt *s) {
        // Only care about If statements.
        if (isa<IfStmt>(s)) {
            IfStmt *IfStatement = cast<IfStmt>(s);
            Stmt *Then = IfStatement->getThen();

            TheRewriter.InsertText(Then->getLocStart(),
                                   "// the 'if' part\n",
                                   true, true);

            Stmt *Else = IfStatement->getElse();
            if (Else)
                TheRewriter.InsertText(Else->getLocStart(),
                                       "// the 'else' part\n",
                                       true, true);
        }

        return true;
    }

    bool VisitFunctionDecl(FunctionDecl *f) {
        // Only function definitions (with bodies), not declarations.
        if (f->hasBody()) {
            Stmt *FuncBody = f->getBody();

            // Type name as string
            QualType QT = f->getResultType();
            string TypeStr = QT.getAsString();

            // Function name
            DeclarationName DeclName = f->getNameInfo().getName();
            string FuncName = DeclName.getAsString();

            // Add comment before
            stringstream SSBefore;
            SSBefore << "// Begin function " << FuncName << " returning "
                     << TypeStr << "\n";
            SourceLocation ST = f->getSourceRange().getBegin();
            TheRewriter.InsertText(ST, SSBefore.str(), true, true);

            // And after
            stringstream SSAfter;
            SSAfter << "\n// End function " << FuncName << "\n";
            ST = FuncBody->getLocEnd().getLocWithOffset(1);
            TheRewriter.InsertText(ST, SSAfter.str(), true, true);
        }

        return true;
    }

private:
    void AddBraces(Stmt *s);

    Rewriter &TheRewriter;
};


// Implementation of the ASTConsumer interface for reading an AST produced
// by the Clang parser.
class MyASTConsumer : public ASTConsumer
{
public:
    MyASTConsumer(Rewriter &R)
        : Visitor(R)
    {}

    // Override the method that gets called for each parsed top-level
    // declaration.
    virtual bool HandleTopLevelDecl(DeclGroupRef DR) {
        for (DeclGroupRef::iterator b = DR.begin(), e = DR.end();
             b != e; ++b)
            // Traverse the declaration using our AST visitor.
            Visitor.TraverseDecl(*b);
        return true;
    }

private:
    MyASTVisitor Visitor;
};


int main(int argc, char *argv[])
{
    if (argc != 2) {
        llvm::errs() << "Usage: rewritersample <filename>\n";
        return 1;
    }

    // CompilerInstance will hold the instance of the Clang compiler for us,
    // managing the various objects needed to run the compiler.
    CompilerInstance TheCompInst;
    TheCompInst.createDiagnostics(0, 0);

    // Initialize target info with the default triple for our platform.
    TargetOptions TO;
    TO.Triple = llvm::sys::getDefaultTargetTriple();
    TargetInfo *TI = TargetInfo::CreateTargetInfo(
        TheCompInst.getDiagnostics(), TO);
    TheCompInst.setTarget(TI);

    TheCompInst.createFileManager();
    FileManager &FileMgr = TheCompInst.getFileManager();
    TheCompInst.createSourceManager(FileMgr);
    SourceManager &SourceMgr = TheCompInst.getSourceManager();
    TheCompInst.createPreprocessor();
    TheCompInst.createASTContext();

    // A Rewriter helps us manage the code rewriting task.
    Rewriter TheRewriter;
    TheRewriter.setSourceMgr(SourceMgr, TheCompInst.getLangOpts());

    // Set the main file handled by the source manager to the input file.
    const FileEntry *FileIn = FileMgr.getFile(argv[1]);
    SourceMgr.createMainFileID(FileIn);
    TheCompInst.getDiagnosticClient().BeginSourceFile(
        TheCompInst.getLangOpts(),
        &TheCompInst.getPreprocessor());

    // Create an AST consumer instance which is going to get called by
    // ParseAST.
    MyASTConsumer TheConsumer(TheRewriter);

    // Parse the file to AST, registering our consumer as the AST consumer.
    ParseAST(TheCompInst.getPreprocessor(), &TheConsumer,
             TheCompInst.getASTContext());

    // At this point the rewriter's buffer should be full with the rewritten
    // file contents.
    const RewriteBuffer *RewriteBuf =
        TheRewriter.getRewriteBufferFor(SourceMgr.getMainFileID());
    llvm::outs() << string(RewriteBuf->begin(), RewriteBuf->end());

    return 0;
}

The makefile

CXX = g++
CFLAGS = -fno-rtti

LLVM_SRC_PATH = ___PATH-TO-LLVM-SOURCE-DIR___
LLVM_BUILD_PATH = ___PATH-TO-LLVM-BUILD-DIR___

LLVM_BIN_PATH = $(LLVM_BUILD_PATH)/Debug+Asserts/bin
LLVM_LIBS=core mc
LLVM_CONFIG_COMMAND = $(LLVM_BIN_PATH)/llvm-config --cxxflags --ldflags \
                                        --libs $(LLVM_LIBS)
CLANG_BUILD_FLAGS = -I$(LLVM_SRC_PATH)/tools/clang/include \
                                      -I$(LLVM_BUILD_PATH)/tools/clang/include

CLANGLIBS = \
  -lclangFrontendTool -lclangFrontend -lclangDriver \
  -lclangSerialization -lclangCodeGen -lclangParse \
  -lclangSema -lclangStaticAnalyzerFrontend \
  -lclangStaticAnalyzerCheckers -lclangStaticAnalyzerCore \
  -lclangAnalysis -lclangARCMigrate -lclangRewrite \
  -lclangEdit -lclangAST -lclangLex -lclangBasic

all: rewritersample

rewritersample: rewritersample.cpp
      $(CXX) rewritersample.cpp $(CFLAGS) -o rewritersample \
              $(CLANG_BUILD_FLAGS) $(CLANGLIBS) </span><span style="color: #00007f; font-weight: bold">$(</span>LLVM_CONFIG_COMMAND<span style="color: #00007f; font-weight: bold">)</span><span style="color: #7f007f">

clean:
      rm -rf *.o *.ll rewritersample

First, let’s discuss the makefile and what’s important to look for.

You must replace __PATH_TO... with the correct paths. The SRC path is where LLVM source root lives. BUILD path is where it was built. Note that this implies a source checkout and build with configure. If you use a CMake build, or build against binaries, you may have to fiddle with the paths a bit (including LLVM_BIN_PATH).

llvm-config does a great job of figuring out the compile and link flags needed for LLVM and Clang. However, it currently only handles LLVM libs, and Clang libs have to be specified explicitly. The problem with this is that linkers, being sensitive to the order of libraries, are fickle, and it’s easy to get link errors if the libs are not specified in the correct order. A good place to see the up-to-date library list for Clang is tools/driver/Makefile – the makefile for the main Clang driver.

Note also that the include dirs have to be speficied explicitly for Clang. This is important – if you have some version of Clang installed and these are not specified explicitly, you may get nasty linking errors (complaining about things like classof).

What the code does – general

Now, back to the source code. Our goal is to set up the Clang libraries to parse some source code into an AST, and then let us somehow traverse the AST and modify the source code.

A major challenge in writing a tool using Clang as a library is setting everything up. The Clang frontend is a complex beast and consists of many parts. For the sake of modularity and testability, these parts are decoupled and hence take some work to set up. Fortunately, the Clang developers have provided a convenience class named CompilerInstance that helps with this task by collecting together everything needed to set up a functional Clang-based frontend. The bulk of the main function in my sample deals with setting up a CompilerInstance.

The key call in main is to ParseAST. This function parses the input into an AST, and passes this AST to an implementation of the ASTConsumer interface, which represents some entity consuming the AST and acting upon it.

ASTConsumer

My implementation of ASTConsumer is MyASTConsumer. It’s a very simple class that only implements one method of the interface – HandleTopLevelDecl. This gets called by Clang whenever a top-level declaration (which also counts function definitions) is completed.

RecursiveASTVisitor

The main work-horse of AST traversal is MyASTVisitor, an implementation of RecursiveASTVisitor. This is the classical visitor pattern, with a method per interesting AST node. My code defines only a couple of visitor methods – to handle statements and function declarations. Note how the class itself is defined – this is a nice example of the curiously recurring template pattern (and actually the one I used in my earlier article on CRTP).

Rewriter

The Rewriter is a key component in the source-to-source transformation scheme implemented by this code. Instead of handling every possible AST node to spit back code from the AST, the approach taken here is to surgically change the original code at key places to perform the transformation. The Rewriter class is crucial for this. It’s a sophisticated buffer manager that uses a rope data structure to enable efficient slicing-and-dicing of the source. Coupled with Clang’s excellent preservation of source locations for all AST nodes, Rewriter enables to remove and insert code very accurately. Read its source code for more insights.

Other resources

Many thanks for the maintainers of the Clang-tutorial repository – my code is based on one of the examples taken from there.

Another source of information is the "tooling" library that’s starting to emerge in Clang (include/clang/Tooling). It’s being developed by members of the Clang community that are writing in-house refactoring and code-transformation tools based on Clang as a library, so it’s a relevant source.

Finally, due to the scarcity of Clang’s external documentation, the best source of information remains the code itself. While at first somewhat formidable, Clang’s code is actually very well organized and is readable enough.

Related posts:

  1. Parsing C++ in Python with Clang
  2. Analyzing C source code
  3. Choosing an open-source license for my code
  4. Announcing pss: a tool for searching inside source code

41 Responses to “Basic source-to-source transformation with Clang”

  1. Elazar LeibovichNo Gravatar Says:

    Why not use uniq_ptr and save those deletes?

  2. elibenNo Gravatar Says:

    Elazar,

    Do you mean the C++11 std::unique_ptr or something else? I really just wanted to minimize the dependencies here on anything external or any specific compiler version. I also wanted to keep it simple – it’s just a sample, not production-quality code (the very little error checking is probably a bigger problem than manual deletes).

  3. Nemanja TrifunovicNo Gravatar Says:

    I don’t understand why you are creating all these objects with “new” in the first place. Is there something in the Clang library that prevents creating objects on the stack?

  4. Matthieu M.No Gravatar Says:

    Actually, why dynamic allocation at all ? There is little point allocation something dynamically just to delete it at the end of the function body, or are the instances so big there is fear of a stack overflow ?

  5. NicoNo Gravatar Says:

    Hi Eli,

    have you looked at the tooling infrastructure that landed recently? It’s documented here http://clang.llvm.org/docs/Tooling.html , http://clang.llvm.org/docs/LibTooling.html . With that, your main function could probably be a bit shorter.

    Nico

  6. elibenNo Gravatar Says:

    Nemanja, Matthieu,

    Oh, this is what happens when borrowing code from another tutorial without paying too much mind to the details. Lest I help propagate this folly, I changed the code to use stack allocation for CompilerInstance and MyASTConsumer :-) .

    Nico,

    Yes, I’m familiar (a bit) with the tooling infrastructure, and have mentioned it in the Other Resources section of the post.

  7. VespaNo Gravatar Says:

    I liked the example, but I need to transform the function name too.
    How do I do?

    For example:
    If the function is named:
    int AddPlaces()
    Transform to:
    int DeletePlaces()

    If anyone can help me. thanks

  8. Elazar LeibovichNo Gravatar Says:

    Eli,

    I did mean C++11′s std::uniq_ptr, and I think it’ll make the code simpler, shorter for sure. The good thing about c++11, is that it’s a “–std=c++xx” away.

  9. elibenNo Gravatar Says:

    Vespa,

    It shouldn’t be hard: use the source location information attached to the function name in the definition to find the range it occupies, and then use the “replace” capabilities of Rewriter to replace text in that range with new text.

  10. VespaNo Gravatar Says:

    Eli,
    Thanks for your advices, I already manage to change the name of the functions. However, I would like the changes to be stored in the file. I have tried the function overwriteChangedFiles() with the rewriter object, but is not working. Any idea of what should I use?

    Thanks a lot for your help, my best regards
    Yadira

  11. SlavNo Gravatar Says:

    Thank you very much!
    I tried to use Clang to transform C++ code to AS3 like that, but was sent to tooling/libclang/driver tutorials which wasn’t able to acomplish my task (and telling to that ones who advice to use tooling/libclang/driver):
    driver: enforces to build code into DLL
    libclang: outdated C interface with much less possibilities then any other approach
    tooling: does not allow to accomodate the behaviour
    Why I cannot use adviced approaches and must use that one, which was kindley sentenced by Eli:
    1. Apply custom FileManager, which would allow to search headers within different (not one) folder – like any compiler does
    2. Apply custom Preprocessor to try to keep comments
    3. Full control over the Clang not being dependent on narrow and outdated interfaces, don’t even know what I will need later…
    Eli, hope your example will be commited into Clang’s SVN and help other newbies like me.

  12. Unix GuyNo Gravatar Says:

    This is a great example. A couple of things a) The code gives error when input is a C++ file and how can we add some debug information like:

    void show()
    {
    printf(“Show”);
    }

    int main()
    {
    printf(“Test”);
    if(1)
    printf(“Test1″);
    else
    {
    printf(“Test2″);
    }
    }

    to have an additional statement like :
    printf(“%s %s %d\n”, , __FILE__, __LINE__); where statement are all regular statement ex printf(“Test”);
    after Test, Test2 and Show print statements?

  13. HamidNo Gravatar Says:

    That makefile is harder to understand than the code, LOL! Anyway, great article, I will try to compile and run it.

  14. test2No Gravatar Says:

    Very nice post. I just stumbled upon your blog and wanted to say that I’ve truly loved surfing around your blog posts. After all I’ll be subscribing in your feed and I’m hoping you write again soon!

  15. zNo Gravatar Says:

    your makefile doesn’t work , so please clarify it

  16. elibenNo Gravatar Says:

    z,

    Clang’s code is constantly changing. This should work with release 3.1

  17. ziadNo Gravatar Says:

    Hello eli,

    I don’t have datatypes.h I have datatypes.h.ini, how can I fix this problem??

  18. ziadNo Gravatar Says:

    other errors Cannot open include file: ‘clang/Basic/AttrList.inc’, I need help

  19. elibenNo Gravatar Says:

    ziad,

    I may eventually get to updating this article for a more recent Clang, but a better investment of your time would be to ask in its mailing list, or on Stack Overflow. Good luck.

  20. ziadNo Gravatar Says:

    I don’t need a more recent clang I need the clang that works for your code, please eli point me to the right link:
    http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_31/ Does this one works.
    I see all the release all the releases has issues with Clang/Basic/Attrkinds.h it is pointing to clang/Basic/AttrList.inc that don’t exist

  21. ProjectNo Gravatar Says:

    Great post! I like your blog and visit it a few times a month to see what’s new. You never disappoint.

  22. CartiNo Gravatar Says:

    I had the same issues with data types as ziad did and tried the most recent clang but they persisted. I don’t know what to do. Should I do something different from this article only because I use another clang?

  23. elibenNo Gravatar Says:

    Carti,

    Clang is constantly changing, that’s a good thing. But it also means problem like these happen. Is there anything particular you need the newest trunk Clang for. Won’t 3.1 (the latest official release do)? Keeping up with trunk Clang is a difficult task, and would mean formal releases won’t work.

    I will consider updating this article once Clang 3.2 is out.

  24. jjNo Gravatar Says:

    Hi Eli,
    Great post, thanks! I have a question.
    I tried it on some code and find that it correctly inserts the ‘if’ statement markers for statements within non-member functions, but does not do so within member functions. I am wondering if there is a way of ensuring that all ‘if’ statements are marked, be it within member functions or template functions.

  25. shushengyuNo Gravatar Says:

    Thanks your work!
    I would like to know how can get the comment in the source file, so I can insert it to the correct place, and I
    only want to rewrite the main soure, that is, exclude the include files and reserve the #include directive .

  26. Peeter JootNo Gravatar Says:

    fyi. Your makefile has some html span range stuff in it.

  27. Ira D. BaxterNo Gravatar Says:

    Transforming source code (not just C++) is known to be a hard problem.

    Semantic Designs’s DMS Software Reengineering Toolkit is a system for building analysis and transformation tools for arbitrary languages. See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html DMS parses souce code, applies analyses and transformations to the ASTs using source-to-source transformations written using the surface language (e.g., C++) syntax, and then regenerates compilable code with comments retained.

    DMS has an option for C++, including C++11. DMS and its C++ front end have been used for a variety of commercial C++ transformation tasks over the last 7 years, especially applications involving massive re-architecting of C++ code. Such applications may see as much as 30% of the code automatically modified. See

    Akers, R., Baxter, I., Mehlich, M. , Ellis, B. , Luecke, K., Case Study: Re-engineering C++ Component Models Via Automatic Program Transformation, Information & Software Technology 49(3):275-291 2007. Available from publisher. More accessible, covering much of the same ground, is
    Akers, R., Baxter, I., Mehlich, M., Ellis, B., Luecke, K., Re-engineering C++ Component Models Via Automatic Program Transformation, Twelfth Working Conference on Reverse Engineering, IEEE, 2005
    http://www.semanticdesigns.com/Company/Publications/WCRE05.pdf This particular application was used to support the implementation of UAV avionics software successfully demonstrated at White Sands in 2006.

    I’m glad to see the Clang community now moving along this arc. I’m a little dismayed that we get so little mention in this space.

  28. elibenNo Gravatar Says:

    Ira,

    I’m sure your product is impressive, but it’s hardly surprising that people get more excited about an open-source project that starts adding support for this.

  29. Ira D. BaxterNo Gravatar Says:

    What made me respond to this blog was the phrase “there might be some light at the end of this tunnel”, as if Clang was the only potential solution out there. I understand it is a likely candidate open source solution, although OpenC++ and Puma were also valiant attempts, and I think Rose is considerably further along in performing significant C++ transformation tasks.

  30. AdamNo Gravatar Says:

    When I tried your example in clang 3.2, it crashed at the end of main. I fixed it by putting the TargetOptions variable in an IntrusiveRefCntPtr (it was trying to delete it twice).

  31. IuriNo Gravatar Says:

    Hi Eli,

    Thank you for the tutorial, I just used it recently to take advantage of clang.
    I have one more question, if you happen to know: in the following statement

    llvm::outs() <begin(), RewriteBuf->end());

    I get the original source code (transformed) on my console, however, I also get the clang diagnostics (like warnings or errors) about the source code. Would you know how to separate these two outputs? I would want to get the transformed source code to store it somewhere and the diagnostics to somewhere else.

    Thank you.

  32. RamboNo Gravatar Says:

    This example is only for single C file, but if multiple files in one project and there are some relationship between files,the relationship maybe caused by “#include”,the example will be infeasible.How are you deal with it?anybody have a idea?

  33. Paul GaiNo Gravatar Says:

    Here’s the makefile I adapted for LLVM 3.3; After compilation and execution, I got the following runtime

    *** glibc detected *** ./rewritersample: free(): invalid pointer: 0x00000000030a5238 ***
    ======= Backtrace: =========

    I guess somewhere in the code, a pointer was free’d twice?

    ===================== My Makefile =======================

    CXX = g++
    CFLAGS = -fno-rtti

    LLVM_SRC_PATH = /home/jiading/LLVM/llvm-3.3.src
    LLVM_BUILD_PATH = /home/jiading/LLVM/llvm-3.3.build

    LLVM_BIN_PATH = $(LLVM_BUILD_PATH)/Debug+Asserts/bin
    LLVM_LIBS=core mc
    CLANG_BUILD_FLAGS = -I$(LLVM_SRC_PATH)/tools/clang/include \
    -I$(LLVM_BUILD_PATH)/tools/clang/include

    CLANGLIBS = \
    -lclangFrontendTool -lclangFrontend -lclangDriver \
    -lclangSerialization -lclangCodeGen -lclangParse \
    -lclangSema -lclangStaticAnalyzerFrontend \
    -lclangStaticAnalyzerCheckers -lclangStaticAnalyzerCore \
    -lclangAnalysis -lclangARCMigrate -lclangRewriteFrontend -lclangRewriteCore \
    -lclangEdit -lclangAST -lclangLex -lclangBasic \
    -lLLVMBitReader -lLLVMMCParser -lLLVMSupport

    all: rewritersample

    rewritersample: rewritersample.cpp
    $(CXX) rewritersample.cpp $(CFLAGS) -o rewritersample \
    $(CLANG_BUILD_FLAGS) $(CLANGLIBS) \
    $(LLVM_BIN_PATH)/llvm-config --cxxflags --ldflags --libs $(LLVM_LIBS)

    clean:
    rm -rf *.o *.ll rewritersample

  34. Paul GaiNo Gravatar Says:

    How was void AddBraces(Stmt *s); used in this tutorial?

  35. elibenNo Gravatar Says:

    @Paul,

    It appears to be a leftover, not used in this sample.

  36. Paul GaiNo Gravatar Says:

    Thanks a lot, Eli. You have quite a talent to explain difficult topics in an easy-to-understand manner

  37. TomasNo Gravatar Says:

    Same here the code has a bug:
    free(): invalid pointer: 0x08e32b00 ***

    I’m trying to debug it if someone has the solution that would save me some time :D

  38. JonesNo Gravatar Says:

    It’s a very nice blog, but I need your help. When I do some source to source transformation on simple code, it works awesome. However, if the code contains some defines which are in include header file, it cannot recognize those defines. Is there anyway that this custom compiler can find those defines? Thanks.

  39. elibenNo Gravatar Says:

    @Jones,

    Is the included file in the same directory? The Clang frontend (cc1) needs -I flags to tell it where to look for includes. The main driver has some default locations it looks at, but the frontend does not.

  40. vinsonNo Gravatar Says:

    HI, Eli. I recently want to implement a source-to-source compiler from OpenMP to CUDA. Is Clang suitable for it? In other words, dose Clang have any API for transforming from the AST nodes to the high level programming language?

  41. elibenNo Gravatar Says:

    @vinson,

    Clang is definitely being used for non-trivial transformations. See tools like clang-modernize, for example. Also, read about the “tooling” library of Clang.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)