Compilation databases for Clang-based tools

If you're interested in writing analysis and source-rewriting tools with Clang's libTooling, you may have run into the following ominous error while trying to invoke a tool on some code:

$ clang-check -analyze div0.c
LLVM ERROR: Could not auto-detect compilation database for file "div0.c"
No compilation database found in /tmp or any parent directory
json-compilation-database: Error while opening JSON database:
     No such file or directory

So what's a compilation database, why do Clang tools need it, and how do you go about creating one?

Motivation - faithfully reproducing a compilation

Unlike many other source analysis tools (for example - syntax coloring in editors) which only provide approximate parsing of C++ source, Clang tools are the real thing. The same compiler frontend that's used to actually parse and compile source is used to build the AST for analysis. This is great because it means you never get false positives; but it also means the analysis tools need the complete information available to the compiler when looking at source files.

When we compile code we pass all kinds of flags to the compiler. Warning flags, language-version flags, etc. But most importantly - macro definitions (-D)) and include directories (-I). Without the latter, it's not even possible to parse the source code properly. Historically, a "classical" C compiler pipeline used to run the preprocessor (cpp) to take care of these before the compiler would even see the file. These days modern compilers like Clang combine preprocessing with parsing, but the fundamentals remain in place.

OK then, we need to know which flags the code was compiled with. How do we pass this information to tools?

Fixed compilation database

This is where the concept of "compilation database" comes in. In simple terms, it's a collection of exact compilation commands for a set of files. I'll discuss it in more detail shortly, but first a brief detour into specifying the commands in a simple way that doesn't require a special file.

A "fixed" compilation database allows us to pass the compilation flags to a tool on the command-line, following a special token --. Here's a complete example that will demonstrate what I mean. Consider this code:

#define DODIV(a, b) ((a) / (b))

int test(int z) {
  if (z == 0) {
#ifdef FOO
    return DODIV(1, z);
#else
    return 1 - z;
#endif
  }
  return 1 + z;
}

Running clang-check simply as shown in the beginning of the post results in an error message. If we tack a -- to the end of the command-line, however, it works:

$ clang-check -analyze div0.c --

By "works" here I mean "does not die with an error". But it doesn't report anything either, while I'd expect it to detect a division by zero in the if (z == 0) case [1].

This is because we didn't provide any compiler flags. So the analysis assumed the file is compiled like so:

$ clang -c div0.c

Indeed, note that the "divide by 0" error happens only if the FOO macro is defined. It's not defined here, so the analyzer is quiet [2]. Let's define it then:

$ $clang-check -analyze div0.c -- -DFOO
/tmp/div0.c:6:12: warning: Division by zero
    return DODIV(1, z);
           ^~~~~~~~~~~
/tmp/div0.c:1:26: note: expanded from macro 'DODIV'
#define DODIV(a, b) ((a) / (b))
                     ~~~~^~~~~
1 warning generated.

So providing compilation commands to tools on the command-line is easy. However, if you want to run analyses/transformations over larger projects for which some sort of build system already exists, you'll probably find a real compilation database more useful.

JSON compilation database

When Clang tools complain they can't find a compilation database, what they actually mean is a specially named JSON file in either the same directory as the file being processed or in one of its parent directories. The JSON compilation database is very simple. Here's an example:

[
{
  "directory": "/tmp",
  "command": "gcc div0.c",
  "file": "/tmp/div0.c"
},
{
  "directory": "/tmp",
  "command": "gcc -DFOO div0.c",
  "file": "/tmp/div0.c"
}
]

It's just a list of entries, each of which consists of these fields:

File: the file to which the compilation applies
Command: the exact compilation command used
Directory: the directory from which the compilation is executed [3]

As you can see above, there may be multiple entries for the same file. This is not a mistake - it's entirely plausible that the same file gets compiled multiple times inside a project, each time with different options.

If you paste this into a file name compile_commands.json and place it in the same directory (or any of its parents) with the file you want to run the analysis on, the tool will work without requiring the -- part, because it can find the file in the compilation database and infer the compilation command on its own. If the tool finds more than one entry for a file, it just runs multiple times, once per entry. As far as the tool is concerned, two compilations of the same file can be entirely different due to differences in flags.

Compilation database for transformation tools

Source transformation tools use a compilation database similarly to analysis tools. Consider this contrived example:

#include <stdlib.h>

int* foo() {
#ifdef FOO
  return 0;
#else
  return NULL;
#endif
}

Let's save this file as nullptr.cpp and run clang-modernize -use-nullptr on it to transform all "NULL-pointer like" constants to an actual nullptr:

$ $LLVMGH/clang-modernize -use-nullptr -summary nullptr.cpp --
Transform: UseNullptr - Accepted: 1
$ cat nullptr.cpp
#include <stdlib.h>

int* foo() {
#ifdef FOO
  return 0;
#else
  return nullptr;
#endif
}

As expected, clang-modernize only replaced within the #else clause because FOO is not defined. We already know how to define it on the command line. We also know that a hypothetical compilation database could provide two entries for nullptr.cpp - one with and one without -DFOO. In this case, clang-modernize would actually run twice over the same file and replace both the 0 and the NULL.

Creating a compilation database for your project

By now we have a good understanding of how to provide Clang tools with compilation flags for simple files. What about whole projects, however? Assume you have a large existing project and you want to run tools on its source code. You already have a build system of some sort that compiles all the files. How do you tell Clang tools which flags are suitable for any file in the project?

There are a few good options. A reasonably recent version of the CMake build tool supports emitting compilation databases [4]. All you need is to run the cmake step with the -DCMAKE_EXPORT_COMPILE_COMMANDS flag, and CMake will take it from there, emitting a compile_commands.json file into your build directory.

If you're not using CMake, there are other options. The Ninja build tool can also emit a compilation database since version 1.2, so a Gyp/Ninja combination should be good too.

If your project doesn't use either, you should be able to roll your own without too much difficulty. Tools like Build EAR may be helpful here.

By the way, it should be clear that large projects is precisely the raison d'être of compilation databases. A single "database" file contains complete information about all the source files in the project, providing Clang tools with the compilation commands required to do their tasks.

A custom compilation database

It may be the case that you have a very specialized build system that already keeps some sort of record of the flags used to build each file. This is sometimes the case in large companies with monolithic code bases. For such scenarios, you'll be happy to find out that this aspect of Clang tools is fully customizable, because compilation database readers are based on a plugin system. The CompilationDatabase interface (clang/include/clang/Tooling/CompilationDatabase.h) is something you can implement on your own. The same header file that defines the interface also defines CompilationDatabasePlugin, which can be used to link your own compilation database readers to Clang tools.

The existing JSON compilation database implementation (clang/lib/Tooling/JSONCompilationDatabase.cpp) is implemented as such a plugin, so there's a handy in-tree example for rolling your own.

Final words

For most users of Clang tools and people interested in writing custom tools, this post contains way too much information. Most chances are you won't need all of this. But I felt it's important, for the sake of completeness, to describe in full detail what compilation databases are, and how they tie into the large picture.

This will help me focus on more internals and examples of Clang tooling in future posts without worrying about compilation databases again.

[1]	`clang-check` is the Clang static analysis tool; it performs control-flow based analysis that can detect cases like this.

[2]	To motivate this - you wouldn't want the analyzer to bug you about "errors" in code that's `#if 0`-ed out, or hidden behind an `#ifdef` for a different compiler/platform, would you?

[3]	Note that this is critical for things like relative paths to `-I` - the tool needs to know where the compiler was actually invoked from to find the directories.

[4]	This is what the upstream LLVM project uses for its own needs.