Welcome


... to my place on the web - a personal website where I keep articles, freeware programs and code snippets I've written, and a weblog where I share my ideas and insights on programming, technology, books, and life in general.

To browse the website, use the site map on the right-hand side of the screen. Below, you will find the most recent posts.

AST matchers and Clang refactoring tools

July 29th, 2014 at 8:22 pm

Clang tooling sees lots of interest and development focus in the past few years. At last, we have a convenient, accurate, open-source and well supported framework for programmatically analyzing and refactoring C++ code; I find this very exciting.

A great outcome of this rapid pace of development is that new APIs and tools spring up all the time. For example, some time ago the Clang tooling developers figured out folks doing AST traversals have to write a lot of repetitive code to find interesting AST nodes, so they came up with a great new API called AST matchers, which I want to discuss here.

Visitors vs. matchers

Here’s a motivating example. Suppose we’re looking for pointer-typed variables being used in if comparisons. To make this more specific, let’s say we’re looking for cases where the pointer-typed variable is on the left-hand-side of an equality comparison (==). To find such nodes in a recursive visitor, we’d have to write something like this:

bool VisitIfStmt(IfStmt *s) {
  if (const BinaryOperator *BinOP =
          llvm::dyn_cast<BinaryOperator>(s->getCond())) {
    if (BinOP->getOpcode() == BO_EQ) {
      const Expr *LHS = BinOP->getLHS();
      if (const ImplicitCastExpr *Cast =
              llvm::dyn_cast<ImplicitCastExpr>(LHS)) {
        LHS = Cast->getSubExpr();
      }

      if (const DeclRefExpr *DeclRef = llvm::dyn_cast<DeclRefExpr>(LHS)) {
        if (const VarDecl *Var =
                llvm::dyn_cast<VarDecl>(DeclRef->getDecl())) {
          if (Var->getType()->isPointerType()) {
            Var->dump();  // YAY found it!!
          }
        }
      }
    }
  }
  return true;
}

This is quite a bit of code, but nothing out of the ordinary if you’ve been working with Clang ASTs for a while. Perhaps it can be golfed down into a somewhat shorter form, but the main problem is that to write this one has to go through quite a bit of documentation and header files to figure out which methods to call and what kinds of objects they return.

Here’s the equivalent AST matcher:

Finder.addMatcher(
    ifStmt(hasCondition(binaryOperator(
        hasOperatorName("=="),
        hasLHS(ignoringParenImpCasts(declRefExpr(
            to(varDecl(hasType(pointsTo(AnyType))).bind("lhs")))))))),
    &HandlerForIf);

Some difference, right? The declarative nature of matcher definitions makes this very natural to read and to map to the actual problem. HandlerForIf is a MatchCallback object that has direct access to the bound nodes of the matcher:

struct IfStmtHandler : public MatchFinder::MatchCallback {
  virtual void run(const MatchFinder::MatchResult &Result) {
    const VarDecl *lhs = Result.Nodes.getNodeAs<VarDecl>("lhs");
    lhs->dump();   // YAY found it!!
  }
};

There’s actually quite a bit of documentation available about AST matchers on the official Clang website. For a complete example that can be built outside of the LLVM tree, I redid the tooling sample from the previous article, now with AST matchers (all available in the llvm-clang-samples repository).

Using clang-query to test matchers and explore the AST

An interesting new tool in Clang-land is clang-query. It’s an interactive evaluator for AST matchers that can be used both to test your matchers and do some programmatic exploration of the AST. Let’s say we want to develop an AST matcher for pointer comparisons, similar to the one shown above. Here’s a sample input file we’re going to work on:

int foo(int* p, int v) {
  if (p == 0) {
    return v + 1;
  } else {
    return v - 1;
  }
}

Let’s fire up clang-query and see what it can do:

$ clang-query /tmp/iflhsptr.c --
clang-query> set output diag
clang-query> match functionDecl()

Match #1:

/tmp/iflhsptr.c:1:1: note: "root" binds here
int foo(int* p, int v) {
^~~~~~~~~~~~~~~~~~~~~~~~
1 match.

This is a basic smoke test to see how it matches the function declaration. The output mode set in the first command could also ask the tool to dump or print the AST, but for our purpose the diagnostic output is convenient.

Here’s how we can match deeper nodes and bind them:

clang-query> match ifStmt(hasCondition(binaryOperator(hasOperatorName("==")).bind("op")))

Match #1:

/tmp/iflhsptr.c:2:7: note: "op" binds here
  if (p == 0) {
      ^~~~~~
/tmp/iflhsptr.c:2:3: note: "root" binds here
  if (p == 0) {
  ^~~~~~~~~~~~~
1 match.

If we intend to provide our own bindings, the root binding can be turned off:

clang-query> set bind-root false

Let’s see multiple matches:

clang-query> match varDecl().bind("var")

Match #1:

/tmp/iflhsptr.c:1:9: note: "var" binds here
int foo(int* p, int v) {
        ^~~~~~

Match #2:

/tmp/iflhsptr.c:1:17: note: "var" binds here
int foo(int* p, int v) {
                ^~~~~
2 matches.

At this point I’ll stop because long matchers don’t format conveniently in a blog post, but I’m sure you got the idea. It’s very obvious how helpful this tool can be with developing matchers. It’s still new and has some rough edges, but is already quite useful.

Refactoring tools and replacements

With the growing usage of libTooling, it’s hardly surprising that its developers keep coming up with higher levels of abstraction that help writing new tools with less and less effort. The AST matchers framework presented above is one example. Another is RefactoringTool, a subclass of ClangTool that makes it possible to craft new tools with very little code. I’ll show an example soon, but first a word about replacements.

The tools I was demonstrating so far used a Rewriter to change the underlying source code in response to finding interesting things in the AST. This is a good approach, but it has a problem scaling for large projects. Imagine running a tool over a large project with many source files and many header files. Some of rewritings may need to happen in the header files, but how to manage that, given that the same headers get included into multiple translation units? Some edits may end up being duplicated or even conflicting, and that’s a problem.

Replacements are the solution. The source transformation task is divided into two distinct steps:

  1. Custom tools go through the source base, finding the refactoring patterns to apply, and generating serialized replacements into files. Think of replacements as something like patch files (precise directions of how to modify a source file), but in a somewhat friendlier format.
  2. clang-apply-replacements can then run with access to all replacements, perform the required de-duplication and conflict resolution and actually apply the changes to the source.

This approach also allows nice parallelization of the refactoring over huge code-bases, though there aren’t many projects and companies in the world with source code large enough to make this a real problem.

Back to an example then. I took the simple sample tool from the previous article (just finding interesting if nodes and adding some comments into them) and rewrote it once again, using RefactoringTool and replacements. The full code is available in the samples project, but it’s so short that I can show most of it here.

Here’s the complete main function. For ease of hacking it just dumps the replacements to stdout instead of serializing or applying them:

int main(int argc, const char **argv) {
  CommonOptionsParser op(argc, argv, ToolingSampleCategory);
  RefactoringTool Tool(op.getCompilations(), op.getSourcePathList());

  // Set up AST matcher callbacks.
  IfStmtHandler HandlerForIf(&Tool.getReplacements());

  MatchFinder Finder;
  Finder.addMatcher(ifStmt().bind("ifStmt"), &HandlerForIf);

  // Run the tool and collect a list of replacements. We could call
  // runAndSave, which would destructively overwrite the files with
  // their new contents. However, for demonstration purposes it's
  // interesting to show the replacements.
  if (int Result = Tool.run(newFrontendActionFactory(&Finder).get())) {
    return Result;
  }

  llvm::outs() << "Replacements collected by the tool:\n";
  for (auto &r : Tool.getReplacements()) {
    llvm::outs() << r.toString() << "\n";
  }

  return 0;
}

IfStmtHandler is just a MatchCallback that is triggered on if statements:

class IfStmtHandler : public MatchFinder::MatchCallback {
public:
  IfStmtHandler(Replacements *Replace) : Replace(Replace) {}

  virtual void run(const MatchFinder::MatchResult &Result) {
    // The matched 'if' statement was bound to 'ifStmt'.
    if (const IfStmt *IfS =
          Result.Nodes.getNodeAs<clang::IfStmt>("ifStmt")) {
      const Stmt *Then = IfS->getThen();
      Replacement Rep(*(Result.SourceManager), Then->getLocStart(), 0,
                      "// the 'if' part\n");
      Replace->insert(Rep);

      if (const Stmt *Else = IfS->getElse()) {
        Replacement Rep(*(Result.SourceManager), Else->getLocStart(), 0,
                        "// the 'else' part\n");
        Replace->insert(Rep);
      }
    }
  }

private:
  Replacements *Replace;
};

Note how little boilerplate this code contains. The tool is set up in just a handful of lines of code, and most of my code deals with the actual refactoring at hand. This definitely makes writing tools quicker and easier than ever before.

Payload server in Python 3 for Github webhooks

July 9th, 2014 at 5:50 am

The Github Webhooks API is powerful and flexible, making it simple to integrate services with your source repository. Lately I’ve been tinkering with it a bit, but all the examples Github has are in Ruby. So I put together a simple demo server in Python 3. Though simple (it’s completely self contained and only needs Python 3 to run), it’s complete, covering even webhook security by verifying the signature created with the API’s secret token.

Here it is:

import argparse
import hashlib
import hmac
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import pprint
import os
import sys

# It's not recommended to store the key within the code. Following
# http://12factor.net/config, we'll store this in the environment.
# Note that the key must be a bytes object.
HOOK_SECRET_KEY = os.environb[b'HOOK_SECRET_KEY']


class GithubHookHandler(BaseHTTPRequestHandler):
    """Base class for webhook handlers.

    Subclass it and implement 'handle_payload'.
    """
    def _validate_signature(self, data):
        sha_name, signature = self.headers['X-Hub-Signature'].split('=')
        if sha_name != 'sha1':
            return False

        # HMAC requires its key to be bytes, but data is strings.
        mac = hmac.new(HOOK_SECRET_KEY, msg=data, digestmod=hashlib.sha1)
        return hmac.compare_digest(mac.hexdigest(), signature)

    def do_POST(self):
        data_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(data_length)

        if not self._validate_signature(post_data):
            self.send_response(401)
            return

        payload = json.loads(post_data.decode('utf-8'))
        self.handle_payload(payload)
        self.send_response(200)


class MyHandler(GithubHookHandler):
    def handle_payload(self, json_payload):
        """Simple handler that pretty-prints the payload."""
        print('JSON payload')
        pprint.pprint(json_payload)


if __name__ == '__main__':
    argparser = argparse.ArgumentParser(description='Github hook handler')
    argparser.add_argument('port', type=int, help='TCP port to listen on')
    args = argparser.parse_args()

    server = HTTPServer(('', args.port), MyHandler)
    server.serve_forever()

Just run it at some port on your server and point the webhook you create to it. Currently it just runs on the server’s root path (e.g. http://myserver.com:1234), but should be trivial to modify to any path.

By the way, I found ngrok to be invaluable for testing this. It creates a tunnel from your localhost’s port to a unique URL you can set as the webhook destination on Github. This makes it possible to quickly iterate and test the server on your local machine.

Summary of reading: April – June 2014

June 29th, 2014 at 2:39 pm
  • "A Guide to the Good Life: The Ancient Art of Stoic Joy" by William B. Irvine – An overview of Stoicism, one of the ancient Greece’s schools of philosophy. It’s interesting to draw parallels between it and modern "self help" approaches. Even more interestingly, it’s an important lesson to keep in mind that people more than 2000 years ago were concerned about very similar things as today, and came up with very similar solutions. I can’t say I liked how the book itself was written, but Stoicism is definitely interesting.
  • "Mastery" by Robert Greene – The author explores what it takes to become a master in some profession/craft, and provides mini-biographies of a number of successful historical figures, such as Mozart or Benjamin Franklin. While the biographical parts are pretty interesting, I found the thesis overall not too insightful. Greene tries to demonstrate that mastery comes from practice rather than inborn talent, but his biographical examples mostly show the opposite, it seems to me. That said, the book isn’t bad all in all. A particularly positive mention goes to the chapter about social intelligence which is pretty good.
  • "The Five Major Pieces to the Life Puzzle" by Jim Rohn – not what I expected :-/ This book is a brief rehash of common self-help slogans, without a lot of substance.
  • "Got Fight?" by Forrest Griffin – extremely funny and extremely politically incorrect. Don’t expect too much real advice from this book – mainly jokes and anecdotes. I’ve actually liked the last part of the book, where Griffin shows "tips" for actual MMA moves and techniques the least useful. You can’t really learn martial arts from a book… If you’re up for a quick read and a good laugh, though, this book will certainly deliver.
  • "Dr. Atkins’ New Diet Revolution" by Robert Atkins – The new edition of the classic book that launched the low-carb diet revolution. My expectations weren’t high here, and I was mainly looking for a more practical complement to Gary Taubes’s books, which explain why refined carbs are bad very well, but don’t give much in terms of practical advice of how to eat. While Atkins easily hits all the check-points of a sleazy self-help book, in terms of practical advice and todo-s it’s not bad. It provides reasonably detailed paths to weight loss and maintenance using a ultra low-carb diet, as well as helpful advice in terms of what foods to use to achieve it. One thing that really bugs me about the book, though, is that its claims of "no caloric restrictions" are disingenuous. On one hand the author says you don’t have to count calories at all when limiting carbs; on the other hand, he uses every opportunity to mention that all meals should be small. The "recommended daily menus" at the end of the book are very ascetic indeed. I’d say that it you eat three small meals a day, and your only snack in between is cantaloupe, that’s a diet any way you want to call it, because it will be very low calorie too.
  • "Two Scoops of Django: Best Practices For Django 1.6" by Daniel Greenfeld and Audrey Roy – with the surging popularity of Python for web development, and Django being its most widely used framework, this book fills a needed niche. The Django documentation is known for its high quality, but after having gone through it and having built a few simple applications, one may wonder about some of the more advanced techniques used by experienced Django developers. "Two Scoops" aims to provide a wide overview of such techniques. It has a lot of useful information, which can only be fully appreciated when you intend to apply it to a real project. An interesting fact about this book is that it’s self published – while the efforts of the authors with this aspect are admirable, the quality leaves something to be desired (both proofreading and in general the way the content is laid out graphically). That said, I’ve seen lower-quality books from established publishers, so this may not mean much.
  • "The Invisible Man" by H.G. Wells (Audiobook) – A very light and entertaining read. The audio recording available for free on Librivox is outstanding.
  • "Winter of the World" by Ken Follett – Second part of the trilogy, and also very good. The only thing that really bothered me is how involved the main characters are in the key historic events around World War II. I think the author went a bit overboard on this one. I realize it would be difficult to describe these events with the same amount of intimacy, but he did succeed in one of the cases. For example, Greg was a secret service supervisor on the Manhattan project – he didn’t have to be one of the scientists in it. A similar role could be carved for the other characters, putting them a bit away from the lime-lights. In general though, it’s a very good book.

Moving to Github – one year recap

June 8th, 2014 at 8:50 am

It’s been a year since I switched all my personal projects from Bitbucket / Mercurial to Github / Git, so I think the time is right for a brief recap.

http://eli.thegreenplace.net/wp-content/uploads/2014/06/pythocat-e1401975025529.png

As I mentioned back then, the platform (Github vs. Bitbucket) served a larger role in the decision to switch than the source control system (Git vs. Mercurial). Bitbucket lets you use Git now if you want, so the latter difference is actually moot. It’s the platform that remains the difference, and if I felt that Github has "won" the platform war a year ago, I feel even more so now. Even so far as being called the new résumé for developers, Github has definitely captured a very prominent place in the jargon and habits of our craft.

Amazingly, Bitbucket still hasn’t caught up in terms of front-page information. On Github it’s fairly easy, in a quick glance, to assess how active a developer is, which projects he contributes to, etc. On Bitbucket, all you see is a dull list of project names, not sorted in any useful way. Organizations, another cool Github concept, is also nowhere to be seen. Perhaps Bitbucket gave up on this fight, though, focusing on corporate users.

My main qualm with Bitbucket was that due to its lack of popularity I didn’t get as much contribution to my projects as I hoped for. Had this improved? Certainly! Instead of "why don’t you use Github" complaints, I now routinely get pull requests (or at least opened issues) for my projects. Several of my projects are starred by dozens of developers, and the most popular one – pycparser – now has over 40 forks!

Github pull requests are indeed a thing of beauty. Not that there’s anything technically complex about them. It’s a cultural thing. If you find a problem in a project, just send in a pull request. Most chances are it will be accepted. Pull requests have become a ubiquitous contribution currency, because of one very subtle, yet powerful factor. Github pull requests are part of your online identity. Once your pull request is merged into a project, it’s forever retained in the log that you have contributed to that project. You being the direct link to your Github account / profile. For example, I commited a small fix to Django a few months ago – here it is, forever recorded and linked to my Github profile. The power of Github as the de-facto social network of developers cannot be ignored. It is way, way more significant than Linkedin.

Let’s put it this way – actual commits beat empty endorsements, any day.

Github indeed serves the résumé role well, at least for me. I routinely receive job / interview offers or recruiter pitches based on my "impressive Github profile".

Even projects that are not formally hosted on Github take care to maintain a mirror there, because Github is such a familiar low-entry way to examine their code and even fork it. It’s not uncommon for people to develop their changes in branches vs. the Github mirror, and sending patches to the real source repository when done. I know this happens in the LLVM & Clang world a lot. Maintaining Github mirrors of projects is quite easy – for example, it took me just a few hours to set up this semi-official Github mirror for CPython, and as for maintaining it… well, there’s cron for that.

Github’s popularity is also the cause of it being the first target of other projects that aim to make the life of developers easier. Travis CI is a blessing, and I use its Github integration extensively. All developers know how painful it is to set up and maintain continuous integration builds for their project. Travis & Github make this ridiculously easy, at least for the limited set of platforms Travis currently supports. Automatically run tests for each proposed pull request and tell me if it breaks anything? Shut up and take my money! Oh, it’s free for open-source projects…

Using ASDL to describe ASTs in compilers

June 4th, 2014 at 6:25 am

ASTs (Abstract Syntax Trees) are an important data structure in compiler front-ends. If you’ve written a few parsers, you almost definitely ran into the need to describe the result of the parsing in terms of an AST. While the kinds of nodes such ASTs have and their structure is very specific to the source language, many commonalities come up. In other words, coding "yet another AST" gets really old after you’ve done it a few times.

Worry not, as you’d expect from the programmer crowd, this problem was "solved" by adding another level of abstraction. Yes, an abstraction over Abstract Syntax Trees, oh my! The abstraction here is some textual format (let’s call it a DSL to sound smart) that describes what the AST looks like, along with machinery to auto-generate the code that implements this AST.

Most solutions in this domain are ad-hoc, but one that I’ve seen used more than once is ASDL – Abstract Syntax Definition Language. The self-description from the website sounds about right:

The Zephyr Abstract Syntax Description Lanuguage (ASDL) is a language designed to describe the tree-like data structures in compilers. Its main goal is to provide a method for compiler components written in different languages to interoperate. ASDL makes it easier for applications written in a variety of programming languages to communicate complex recursive data structures.

To given an example, here’s a short snippet from an ASDL definition of a simple programming language:

program = Program(class* classes)

class = Class(identifier name, identifier? parent, feature* features)

[...]

expression = Assign(identifier name, expression expr)
           | StaticDispatch(expression expr, identifier type_name,
                            identifier name, expression* actual)
           | Dispatch(expression expr, identifier name, expression* actual)

[...]

The way to read this is: a program node consists of one or more classes. Each class has these children: a name which is an identifier, an optional parent which is also an identifier, and a (potentially empty) list of features, each of which is a feature node. And so on.

The full details are available in the paper "The Zephyr Abstract Syntax Definition Language" by Wang et.al. Unfortunately, a link to this paper isn’t always trivial to find, so I have a PDF copy in the docs directory of my asdl_parser project, which I’m going to discuss soon.

Type safety in ASDL

In addition to providing a concise description of nodes from which code (in many languages) can be generated automatically, I like ASDL for another reason. It provides some type safety when constructing the AST in the parser.

Take the snippet above, for example. A program has the classes attribute, which is a (potentially empty) sequence of class nodes. Each such class has to be a Class, which is precisely defined. It can be nothing else. The expression below that shows that differently – an expression can be either a Assign, StaticDispatch, etc.

The set of possibilities is statically defined. This makes it possible to insert some degree of static checking into the (auto-generated) AST construction code. So the constructed AST can’t be completely bogus even before semantic analysis is applied. Even though I love Python, I do appreciate a bit of static type checking in the right places. Key data structures like ASTs are, I believe, one of the places when such type checking makes sense.

ASDL in upstream CPython

Starting with Python 2.5, the CPython compiler (the part responsible for emitting bytecode from Python source) uses an ASDL description to create an AST for Python source. The AST is created by the parser (from the parse tree – more details in PEP 339), and is then used to create the control-flow graph, from which bytecode is emitted.

The ASDL description lives in Parser/Python.asdl in the CPython source tree. Parser/asdl_c.py is a script that runs whenever someone modifies this ASDL description. It uses the Parser/asdl.py module to parse the ASDL file into an internal form and then emits C code that describes the ASTs. This C code lives in Include/Python-ast.h and Python/Python-ast.c [1].

This may be more details than you wanted to hear :-) The gist of it, however, is – CPython’s ASTs are described in ASDL, and if you want a quick glance of how these ASTs look, Parser/Python.asdl is the file to look at.

My rewrite of the ASDL parser

Up until very recently, the ASDL description of the CPython AST was parsed by a tool that relied on the SPARK parsing toolkit. In fact, Parser/spark.py was carried around in the distribution just for this purpose.

A few months ago I was looking for something to conveniently implement the AST for a toy compiler I was hacking on. Being a CPython developer, ASDL immediately sprang to mind, but I was reluctant to carry the SPARK dependency and/or learn how to use it. The ASDL language seemed simple enough to not require such machinery. Surely a simple recursive-descent parser would do. So I implemented my own stand-alone parser for ASDL, using modern Python 3.x – and it’s available in a public Github repository right here. Feel free to use it, and let me know how it goes!

Since my parser turned out to be much simpler and easier to grok than upstream CPython’s SPARK-based parser, I proposed to replace it in Issue 19655. After some delays (caused mainly by waiting for 3.4 release and then getting distracted by other stuff), the change landed in the default branch (on its way to 3.5) about a month ago. The result is pleasing – the new parser is shorter, doesn’t require the SPARK dependency (which was now dropped), has tests and is much more maintainable.

In the interest of not changing too much at once, I left the interface to the C code generator (Parser/asdl_c.py) the same, so there is absolutely no difference in the produced C code. Some time in the future it may make sense to revise this decision. The C generator is also fairly old code that could use some modernization and tests.

Historical note – AST description in pycparser

I first ran into this problem (high-level description of ASTs) when I was working on pycparser (which is quite an old project by now).

Back at the time, I looked at the compiler module of Python 2.x and liked its approach of simple textual description of the AST which is then parsed and from which the code for AST nodes is emitted. The compiler module was a maintenance headache (because it duplicated a lot of the AST logic from the actual compiler) and is gone in Python 3.x, replaced by the ast module which provides access to the same C-based AST generated by Parser/asdl_c.py as is used by the CPython compiler.

pycparser’s AST description is a simple textual file that’s very similar in spirit to ASDL. If I were to do this today, I’d probably also pick ASDL since it’s more "standard", as well as for the extra type safety guarantees it provides.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] Even though these files are auto-generated, they are also checked into the CPython Mercurial repository. This is because we don’t want people building Python from source to depend on the tools required to generate such files. Only core CPython developers who want to play with the internals need them.