Welcome


... to my place on the web - a personal website where I keep articles, freeware programs and code snippets I've written, and a weblog where I share my ideas and insights on programming, technology, books, and life in general.

To browse the website, use the site map on the right-hand side of the screen. Below, you will find the most recent posts.

Payload server in Python 3 for Github webhooks

July 9th, 2014 at 5:50 am

The Github Webhooks API is powerful and flexible, making it simple to integrate services with your source repository. Lately I’ve been tinkering with it a bit, but all the examples Github has are in Ruby. So I put together a simple demo server in Python 3. Though simple (it’s completely self contained and only needs Python 3 to run), it’s complete, covering even webhook security by verifying the signature created with the API’s secret token.

Here it is:

import argparse
import hashlib
import hmac
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import pprint
import os
import sys

# It's not recommended to store the key within the code. Following
# http://12factor.net/config, we'll store this in the environment.
# Note that the key must be a bytes object.
HOOK_SECRET_KEY = os.environb[b'HOOK_SECRET_KEY']


class GithubHookHandler(BaseHTTPRequestHandler):
    """Base class for webhook handlers.

    Subclass it and implement 'handle_payload'.
    """
    def _validate_signature(self, data):
        sha_name, signature = self.headers['X-Hub-Signature'].split('=')
        if sha_name != 'sha1':
            return False

        # HMAC requires its key to be bytes, but data is strings.
        mac = hmac.new(HOOK_SECRET_KEY, msg=data, digestmod=hashlib.sha1)
        return hmac.compare_digest(mac.hexdigest(), signature)

    def do_POST(self):
        data_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(data_length)

        if not self._validate_signature(post_data):
            self.send_response(401)
            return

        payload = json.loads(post_data.decode('utf-8'))
        self.handle_payload(payload)
        self.send_response(200)


class MyHandler(GithubHookHandler):
    def handle_payload(self, json_payload):
        """Simple handler that pretty-prints the payload."""
        print('JSON payload')
        pprint.pprint(json_payload)


if __name__ == '__main__':
    argparser = argparse.ArgumentParser(description='Github hook handler')
    argparser.add_argument('port', type=int, help='TCP port to listen on')
    args = argparser.parse_args()

    server = HTTPServer(('', args.port), MyHandler)
    server.serve_forever()

Just run it at some port on your server and point the webhook you create to it. Currently it just runs on the server’s root path (e.g. http://myserver.com:1234), but should be trivial to modify to any path.

By the way, I found ngrok to be invaluable for testing this. It creates a tunnel from your localhost’s port to a unique URL you can set as the webhook destination on Github. This makes it possible to quickly iterate and test the server on your local machine.

Summary of reading: April – June 2014

June 29th, 2014 at 2:39 pm
  • "A Guide to the Good Life: The Ancient Art of Stoic Joy" by William B. Irvine – An overview of Stoicism, one of the ancient Greece’s schools of philosophy. It’s interesting to draw parallels between it and modern "self help" approaches. Even more interestingly, it’s an important lesson to keep in mind that people more than 2000 years ago were concerned about very similar things as today, and came up with very similar solutions. I can’t say I liked how the book itself was written, but Stoicism is definitely interesting.
  • "Mastery" by Robert Greene – The author explores what it takes to become a master in some profession/craft, and provides mini-biographies of a number of successful historical figures, such as Mozart or Benjamin Franklin. While the biographical parts are pretty interesting, I found the thesis overall not too insightful. Greene tries to demonstrate that mastery comes from practice rather than inborn talent, but his biographical examples mostly show the opposite, it seems to me. That said, the book isn’t bad all in all. A particularly positive mention goes to the chapter about social intelligence which is pretty good.
  • "The Five Major Pieces to the Life Puzzle" by Jim Rohn – not what I expected :-/ This book is a brief rehash of common self-help slogans, without a lot of substance.
  • "Got Fight?" by Forrest Griffin – extremely funny and extremely politically incorrect. Don’t expect too much real advice from this book – mainly jokes and anecdotes. I’ve actually liked the last part of the book, where Griffin shows "tips" for actual MMA moves and techniques the least useful. You can’t really learn martial arts from a book… If you’re up for a quick read and a good laugh, though, this book will certainly deliver.
  • "Dr. Atkins’ New Diet Revolution" by Robert Atkins – The new edition of the classic book that launched the low-carb diet revolution. My expectations weren’t high here, and I was mainly looking for a more practical complement to Gary Taubes’s books, which explain why refined carbs are bad very well, but don’t give much in terms of practical advice of how to eat. While Atkins easily hits all the check-points of a sleazy self-help book, in terms of practical advice and todo-s it’s not bad. It provides reasonably detailed paths to weight loss and maintenance using a ultra low-carb diet, as well as helpful advice in terms of what foods to use to achieve it. One thing that really bugs me about the book, though, is that its claims of "no caloric restrictions" are disingenuous. On one hand the author says you don’t have to count calories at all when limiting carbs; on the other hand, he uses every opportunity to mention that all meals should be small. The "recommended daily menus" at the end of the book are very ascetic indeed. I’d say that it you eat three small meals a day, and your only snack in between is cantaloupe, that’s a diet any way you want to call it, because it will be very low calorie too.
  • "Two Scoops of Django: Best Practices For Django 1.6" by Daniel Greenfeld and Audrey Roy – with the surging popularity of Python for web development, and Django being its most widely used framework, this book fills a needed niche. The Django documentation is known for its high quality, but after having gone through it and having built a few simple applications, one may wonder about some of the more advanced techniques used by experienced Django developers. "Two Scoops" aims to provide a wide overview of such techniques. It has a lot of useful information, which can only be fully appreciated when you intend to apply it to a real project. An interesting fact about this book is that it’s self published – while the efforts of the authors with this aspect are admirable, the quality leaves something to be desired (both proofreading and in general the way the content is laid out graphically). That said, I’ve seen lower-quality books from established publishers, so this may not mean much.
  • "The Invisible Man" by H.G. Wells (Audiobook) – A very light and entertaining read. The audio recording available for free on Librivox is outstanding.
  • "Winter of the World" by Ken Follett – Second part of the trilogy, and also very good. The only thing that really bothered me is how involved the main characters are in the key historic events around World War II. I think the author went a bit overboard on this one. I realize it would be difficult to describe these events with the same amount of intimacy, but he did succeed in one of the cases. For example, Greg was a secret service supervisor on the Manhattan project – he didn’t have to be one of the scientists in it. A similar role could be carved for the other characters, putting them a bit away from the lime-lights. In general though, it’s a very good book.

Moving to Github – one year recap

June 8th, 2014 at 8:50 am

It’s been a year since I switched all my personal projects from Bitbucket / Mercurial to Github / Git, so I think the time is right for a brief recap.

http://eli.thegreenplace.net/wp-content/uploads/2014/06/pythocat-e1401975025529.png

As I mentioned back then, the platform (Github vs. Bitbucket) served a larger role in the decision to switch than the source control system (Git vs. Mercurial). Bitbucket lets you use Git now if you want, so the latter difference is actually moot. It’s the platform that remains the difference, and if I felt that Github has "won" the platform war a year ago, I feel even more so now. Even so far as being called the new résumé for developers, Github has definitely captured a very prominent place in the jargon and habits of our craft.

Amazingly, Bitbucket still hasn’t caught up in terms of front-page information. On Github it’s fairly easy, in a quick glance, to assess how active a developer is, which projects he contributes to, etc. On Bitbucket, all you see is a dull list of project names, not sorted in any useful way. Organizations, another cool Github concept, is also nowhere to be seen. Perhaps Bitbucket gave up on this fight, though, focusing on corporate users.

My main qualm with Bitbucket was that due to its lack of popularity I didn’t get as much contribution to my projects as I hoped for. Had this improved? Certainly! Instead of "why don’t you use Github" complaints, I now routinely get pull requests (or at least opened issues) for my projects. Several of my projects are starred by dozens of developers, and the most popular one – pycparser – now has over 40 forks!

Github pull requests are indeed a thing of beauty. Not that there’s anything technically complex about them. It’s a cultural thing. If you find a problem in a project, just send in a pull request. Most chances are it will be accepted. Pull requests have become a ubiquitous contribution currency, because of one very subtle, yet powerful factor. Github pull requests are part of your online identity. Once your pull request is merged into a project, it’s forever retained in the log that you have contributed to that project. You being the direct link to your Github account / profile. For example, I commited a small fix to Django a few months ago – here it is, forever recorded and linked to my Github profile. The power of Github as the de-facto social network of developers cannot be ignored. It is way, way more significant than Linkedin.

Let’s put it this way – actual commits beat empty endorsements, any day.

Github indeed serves the résumé role well, at least for me. I routinely receive job / interview offers or recruiter pitches based on my "impressive Github profile".

Even projects that are not formally hosted on Github take care to maintain a mirror there, because Github is such a familiar low-entry way to examine their code and even fork it. It’s not uncommon for people to develop their changes in branches vs. the Github mirror, and sending patches to the real source repository when done. I know this happens in the LLVM & Clang world a lot. Maintaining Github mirrors of projects is quite easy – for example, it took me just a few hours to set up this semi-official Github mirror for CPython, and as for maintaining it… well, there’s cron for that.

Github’s popularity is also the cause of it being the first target of other projects that aim to make the life of developers easier. Travis CI is a blessing, and I use its Github integration extensively. All developers know how painful it is to set up and maintain continuous integration builds for their project. Travis & Github make this ridiculously easy, at least for the limited set of platforms Travis currently supports. Automatically run tests for each proposed pull request and tell me if it breaks anything? Shut up and take my money! Oh, it’s free for open-source projects…

Using ASDL to describe ASTs in compilers

June 4th, 2014 at 6:25 am

ASTs (Abstract Syntax Trees) are an important data structure in compiler front-ends. If you’ve written a few parsers, you almost definitely ran into the need to describe the result of the parsing in terms of an AST. While the kinds of nodes such ASTs have and their structure is very specific to the source language, many commonalities come up. In other words, coding "yet another AST" gets really old after you’ve done it a few times.

Worry not, as you’d expect from the programmer crowd, this problem was "solved" by adding another level of abstraction. Yes, an abstraction over Abstract Syntax Trees, oh my! The abstraction here is some textual format (let’s call it a DSL to sound smart) that describes what the AST looks like, along with machinery to auto-generate the code that implements this AST.

Most solutions in this domain are ad-hoc, but one that I’ve seen used more than once is ASDL – Abstract Syntax Definition Language. The self-description from the website sounds about right:

The Zephyr Abstract Syntax Description Lanuguage (ASDL) is a language designed to describe the tree-like data structures in compilers. Its main goal is to provide a method for compiler components written in different languages to interoperate. ASDL makes it easier for applications written in a variety of programming languages to communicate complex recursive data structures.

To given an example, here’s a short snippet from an ASDL definition of a simple programming language:

program = Program(class* classes)

class = Class(identifier name, identifier? parent, feature* features)

[...]

expression = Assign(identifier name, expression expr)
           | StaticDispatch(expression expr, identifier type_name,
                            identifier name, expression* actual)
           | Dispatch(expression expr, identifier name, expression* actual)

[...]

The way to read this is: a program node consists of one or more classes. Each class has these children: a name which is an identifier, an optional parent which is also an identifier, and a (potentially empty) list of features, each of which is a feature node. And so on.

The full details are available in the paper "The Zephyr Abstract Syntax Definition Language" by Wang et.al. Unfortunately, a link to this paper isn’t always trivial to find, so I have a PDF copy in the docs directory of my asdl_parser project, which I’m going to discuss soon.

Type safety in ASDL

In addition to providing a concise description of nodes from which code (in many languages) can be generated automatically, I like ASDL for another reason. It provides some type safety when constructing the AST in the parser.

Take the snippet above, for example. A program has the classes attribute, which is a (potentially empty) sequence of class nodes. Each such class has to be a Class, which is precisely defined. It can be nothing else. The expression below that shows that differently – an expression can be either a Assign, StaticDispatch, etc.

The set of possibilities is statically defined. This makes it possible to insert some degree of static checking into the (auto-generated) AST construction code. So the constructed AST can’t be completely bogus even before semantic analysis is applied. Even though I love Python, I do appreciate a bit of static type checking in the right places. Key data structures like ASTs are, I believe, one of the places when such type checking makes sense.

ASDL in upstream CPython

Starting with Python 2.5, the CPython compiler (the part responsible for emitting bytecode from Python source) uses an ASDL description to create an AST for Python source. The AST is created by the parser (from the parse tree – more details in PEP 339), and is then used to create the control-flow graph, from which bytecode is emitted.

The ASDL description lives in Parser/Python.asdl in the CPython source tree. Parser/asdl_c.py is a script that runs whenever someone modifies this ASDL description. It uses the Parser/asdl.py module to parse the ASDL file into an internal form and then emits C code that describes the ASTs. This C code lives in Include/Python-ast.h and Python/Python-ast.c [1].

This may be more details than you wanted to hear :-) The gist of it, however, is – CPython’s ASTs are described in ASDL, and if you want a quick glance of how these ASTs look, Parser/Python.asdl is the file to look at.

My rewrite of the ASDL parser

Up until very recently, the ASDL description of the CPython AST was parsed by a tool that relied on the SPARK parsing toolkit. In fact, Parser/spark.py was carried around in the distribution just for this purpose.

A few months ago I was looking for something to conveniently implement the AST for a toy compiler I was hacking on. Being a CPython developer, ASDL immediately sprang to mind, but I was reluctant to carry the SPARK dependency and/or learn how to use it. The ASDL language seemed simple enough to not require such machinery. Surely a simple recursive-descent parser would do. So I implemented my own stand-alone parser for ASDL, using modern Python 3.x – and it’s available in a public Github repository right here. Feel free to use it, and let me know how it goes!

Since my parser turned out to be much simpler and easier to grok than upstream CPython’s SPARK-based parser, I proposed to replace it in Issue 19655. After some delays (caused mainly by waiting for 3.4 release and then getting distracted by other stuff), the change landed in the default branch (on its way to 3.5) about a month ago. The result is pleasing – the new parser is shorter, doesn’t require the SPARK dependency (which was now dropped), has tests and is much more maintainable.

In the interest of not changing too much at once, I left the interface to the C code generator (Parser/asdl_c.py) the same, so there is absolutely no difference in the produced C code. Some time in the future it may make sense to revise this decision. The C generator is also fairly old code that could use some modernization and tests.

Historical note – AST description in pycparser

I first ran into this problem (high-level description of ASTs) when I was working on pycparser (which is quite an old project by now).

Back at the time, I looked at the compiler module of Python 2.x and liked its approach of simple textual description of the AST which is then parsed and from which the code for AST nodes is emitted. The compiler module was a maintenance headache (because it duplicated a lot of the AST logic from the actual compiler) and is gone in Python 3.x, replaced by the ast module which provides access to the same C-based AST generated by Parser/asdl_c.py as is used by the CPython compiler.

pycparser’s AST description is a simple textual file that’s very similar in spirit to ASDL. If I were to do this today, I’d probably also pick ASDL since it’s more "standard", as well as for the extra type safety guarantees it provides.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] Even though these files are auto-generated, they are also checked into the CPython Mercurial repository. This is because we don’t want people building Python from source to depend on the tools required to generate such files. Only core CPython developers who want to play with the internals need them.

Compilation databases for Clang-based tools

May 21st, 2014 at 10:09 pm

If you’re interested in writing analysis and source-rewriting tools with Clang’s libTooling, you may have run into the following ominous error while trying to invoke a tool on some code:

$ clang-check -analyze div0.c
LLVM ERROR: Could not auto-detect compilation database for file "div0.c"
No compilation database found in /tmp or any parent directory
json-compilation-database: Error while opening JSON database:
     No such file or directory

So what’s a compilation database, why do Clang tools need it, and how do you go about creating one?

Motivation – faithfully reproducing a compilation

Unlike many other source analysis tools (for example – syntax coloring in editors) which only provide approximate parsing of C++ source, Clang tools are the real thing. The same compiler frontend that’s used to actually parse and compile source is used to build the AST for analysis. This is great because it means you never get false positives; but it also means the analysis tools need the complete information available to the compiler when looking at source files.

When we compile code we pass all kinds of flags to the compiler. Warning flags, language-version flags, etc. But most importantly – macro definitions (-D)) and include directories (-I). Without the latter, it’s not even possible to parse the source code properly. Historically, a "classical" C compiler pipeline used to run the preprocessor (cpp) to take care of these before the compiler would even see the file. These days modern compilers like Clang combine preprocessing with parsing, but the fundamentals remain in place.

OK then, we need to know which flags the code was compiled with. How do we pass this information to tools?

Fixed compilation database

This is where the concept of "compilation database" comes in. In simple terms, it’s a collection of exact compilation commands for a set of files. I’ll discuss it in more detail shortly, but first a brief detour into specifying the commands in a simple way that doesn’t require a special file.

A "fixed" compilation database allows us to pass the compilation flags to a tool on the command-line, following a special token --. Here’s a complete example that will demonstrate what I mean. Consider this code:

#define DODIV(a, b) ((a) / (b))

int test(int z) {
  if (z == 0) {
#ifdef FOO
    return DODIV(1, z);
#else
    return 1 - z;
#endif
  }
  return 1 + z;
}

Running clang-check simply as shown in the beginning of the post results in an error message. If we tack a -- to the end of the command-line, however, it works:

$ clang-check -analyze div0.c --

By "works" here I mean "does not die with an error". But it doesn’t report anything either, while I’d expect it to detect a division by zero in the if (z == 0) case [1].

This is because we didn’t provide any compiler flags. So the analysis assumed the file is compiled like so:

$ clang -c div0.c

Indeed, note that the "divide by 0" error happens only if the FOO macro is defined. It’s not defined here, so the analyzer is quiet [2]. Let’s define it then:

$ $clang-check -analyze div0.c -- -DFOO
/tmp/div0.c:6:12: warning: Division by zero
    return DODIV(1, z);
           ^~~~~~~~~~~
/tmp/div0.c:1:26: note: expanded from macro 'DODIV'
#define DODIV(a, b) ((a) / (b))
                     ~~~~^~~~~
1 warning generated.

So providing compilation commands to tools on the command-line is easy. However, if you want to run analyses/transformations over larger projects for which some sort of build system already exists, you’ll probably find a real compilation database more useful.

JSON compilation database

When Clang tools complain they can’t find a compilation database, what they actually mean is a specially named JSON file in either the same directory as the file being processed or in one of its parent directories. The JSON compilation database is very simple. Here’s an example:

[
{
  "directory": "/tmp",
  "command": "gcc div0.c",
  "file": "/tmp/div0.c"
},
{
  "directory": "/tmp",
  "command": "gcc -DFOO div0.c",
  "file": "/tmp/div0.c"
}
]

It’s just a list of entries, each of which consists of these fields:

  • File: the file to which the compilation applies
  • Command: the exact compilation command used
  • Directory: the directory from which the compilation is executed [3]

As you can see above, there may be multiple entries for the same file. This is not a mistake – it’s entirely plausible that the same file gets compiled multiple times inside a project, each time with different options.

If you paste this into a file name compile_commands.json and place it in the same directory (or any of its parents) with the file you want to run the analysis on, the tool will work without requiring the -- part, because it can find the file in the compilation database and infer the compilation command on its own. If the tool finds more than one entry for a file, it just runs multiple times, once per entry. As far as the tool is concerned, two compilations of the same file can be entirely different due to differences in flags.

Compilation database for transformation tools

Source transformation tools use a compilation database similarly to analysis tools. Consider this contrived example:

#include <stdlib.h>

int* foo() {
#ifdef FOO
  return 0;
#else
  return NULL;
#endif
}

Let’s save this file as nullptr.cpp and run clang-modernize -use-nullptr on it to transform all "NULL-pointer like" constants to an actual nullptr:

$ $LLVMGH/clang-modernize -use-nullptr -summary nullptr.cpp --
Transform: UseNullptr - Accepted: 1
$ cat nullptr.cpp
#include <stdlib.h>

int* foo() {
#ifdef FOO
  return 0;
#else
  return nullptr;
#endif
}

As expected, clang-modernize only replaced within the #else clause because FOO is not defined. We already know how to define it on the command line. We also know that a hypothetical compilation database could provide two entries for nullptr.cpp – one with and one without -DFOO. In this case, clang-modernize would actually run twice over the same file and replace both the 0 and the NULL.

Creating a compilation database for your project

By now we have a good understanding of how to provide Clang tools with compilation flags for simple files. What about whole projects, however? Assume you have a large existing project and you want to run tools on its source code. You already have a build system of some sort that compiles all the files. How do you tell Clang tools which flags are suitable for any file in the project?

There are a few good options. A reasonably recent version of the CMake build tool supports emitting compilation databases [4]. All you need is to run the cmake step with the -DCMAKE_EXPORT_COMPILE_COMMANDS flag, and CMake will take it from there, emitting a compile_commands.json file into your build directory.

If you’re not using CMake, there are other options. The Ninja build tool can also emit a compilation database since version 1.2, so a Gyp/Ninja combination should be good too.

If your project doesn’t use either, you should be able to roll your own without too much difficulty. Tools like Build EAR may be helpful here.

By the way, it should be clear that large projects is precisely the raison d’être of compilation databases. A single "database" file contains complete information about all the source files in the project, providing Clang tools with the compilation commands required to do their tasks.

A custom compilation database

It may be the case that you have a very specialized build system that already keeps some sort of record of the flags used to build each file. This is sometimes the case in large companies with monolithic code bases. For such scenarios, you’ll be happy to find out that this aspect of Clang tools is fully customizable, because compilation database readers are based on a plugin system. The CompilationDatabase interface (clang/include/clang/Tooling/CompilationDatabase.h) is something you can implement on your own. The same header file that defines the interface also defines CompilationDatabasePlugin, which can be used to link your own compilation database readers to Clang tools.

The existing JSON compilation database implementation (clang/lib/Tooling/JSONCompilationDatabase.cpp) is implemented as such a plugin, so there’s a handy in-tree example for rolling your own.

Final words

For most users of Clang tools and people interested in writing custom tools, this post contains way too much information. Most chances are you won’t need all of this. But I felt it’s important, for the sake of completeness, to describe in full detail what compilation databases are, and how they tie into the large picture.

This will help me focus on more internals and examples of Clang tooling in future posts without worrying about compilation databases again.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] clang-check is the Clang static analysis tool; it performs control-flow based analysis that can detect cases like this.
[2] To motivate this – you wouldn’t want the analyzer to bug you about "errors" in code that’s #if 0-ed out, or hidden behind an #ifdef for a different compiler/platform, would you?
[3] Note that this is critical for things like relative paths to -I – the tool needs to know where the compiler was actually invoked from to find the directories.
[4] This is what the upstream LLVM project uses for its own needs.