The Normal Equation and matrix calculus



A few months ago I wrote a post on formulating the Normal Equation for linear regression. A crucial part in the formulation is using matrix calculus to compute a scalar-by-vector derivative. I didn't spend much time explaining how this step works, instead remarking:

Deriving by a vector may feel uncomfortable, but there's nothing to worry about. Recall that here we only use matrix notation to conveniently represent a system of linear formulae. So we derive by each component of the vector, and then combine the resulting derivatives into a vector again.

According to the comments received on the post, folks didn't find this convincing and asked for more details. One commenter even said that "matrix calculus feels handwavy", something which I fully agree with. The reason matrix calculus feels handwavy is that it's not as commonly encountered as "regular" calculus, and hence its identities and intuitions are not as familiar. However, there's really not that much to it, as I want to show here.

Let's get started with a simple example, which I'll use to demonstrate the principles. Say we have the function:

<math>

Where a and v are vectors with n components [1]. We want to compute its derivative by v. But wait, while a "regular" derivative by a scalar is clearly defined (using limits), what does deriving by a vector mean? It simply means that we derive by each component of the vector separately, and then combine the results into a new vector [2]. In other words:

<math>

Let's see how this works out for our function f. It may be more convenient to rewrite it by using components rather than vector notation:

<math>

Computing the derivatives by each component, we'll get:

<math>

So we have a sequence of partial derivatives, which we combine into a vector:

<math>

Or, in other words <math>.

This example demonstrates the algorithm for computing scalar-by-vector derivatives:

  1. Figure out what the dimensions of all vectors and matrices are.
  2. Expand the vector equations into their full form (a multiplication of two vectors is either a scalar or a matrix, depending on their orientation, etc.) Note that this will end up with a scalar.
  3. Compute the derivative of the scalar by each component of the variable vector separately.
  4. Combine the derivatives into a vector.

Similarly to regular calculus, matrix and vector calculus rely on a set of identities to make computations more manageable. We can either go the hard way (computing the derivative of each function from basic principles using limits), or the easy way - applying the plethora of convenient identities that were developed to make this task simpler. The identity for computing the derivative of <math> shown above plays the role of <math> in regular calculus.

Now we have the tools to understand how the vector derivatives in the normal equation article were computed. As a reminder, this is the matrix form of the cost function J:

<math>

And we're interested in computing <math>. The equation for J consists of three terms added together. The last one <math> doesn't contribute to the derivative because it doesn't depend on the variable. Let's start looking at the second (since it's simpler than the first) - and give it a name, for convenience:

<math>

We'll start by recalling what all the dimensions are. <math> is a vector of n components. <math> is a vector of m components. <math> is a m-by-n matrix.

Let's see what P expands to [3]:

<math>

Computing the matrix-by-vector multiplication inside the parens:

<math>

And finally, multiplying the two vectors together:

<math>

Working with such formulae makes you appreciate why mathematicians have long ago come up with shorthand notations like "sigma" summation:

<math>

OK, so we've finally completed step 2 of the algorithm - we have the scalar equation for P. Now it's time to compute its derivative by each <math>:

<math>

Now comes the most interesting part. If we treat <math> as a vector of n components, we can rewrite this set of equations using a matrix-by-vector multiplication:

<math>

Take a moment to convince yourself this is true. It's just collecting the individual components of X into a matrix and the individual components of y into a vector. Since X is a m-by-n matrix and y is a m-by-1 column vector, the dimensions work out and the result is a n-by-1 column vector.

So we've just computed the second term of the vector derivative of J. In the process, we've discovered a useful vector derivative identity for a matrix X and vectors <math> and y:

<math>

OK, now let's get back to the full definition of J and see how to compute the derivative of its first term. We'll give it the name Q:

<math>

This derivation is somewhat more complex, since <math> appears twice in the equation. Here's the equation again with all the matrices and vectors fully laid out (note that I've already done the transposes):

<math>

I'll just multiply the two matrices in the middle together. The result is a "X squared" matrix, which is n-by-n. The element in row r and column c of this square matrix is:

<math>

Note that "X squared" is a symmetric matrix (this fact will be important later on). For simplicity of notation, we'll call its elements <math>. Multiplying by the <math> vector on the right we get:

<math>

And left-multiplying by <math> to get the fully unwrapped formula for Q:

<math>

Once again, it's now time to compute the derivatives. Let's focus on <math>, from which we can infer the rest:

<math>

Using the fact that X squared is symmetric, we know that <math> and so on. Therefore:

<math>

The partial derivatives by other <math> components are similar. Collecting the sequence of partial derivatives back into a vector equation, we get:

<math>

Now back to J. Recall that for convenience we broke J up into three parts: P, Q and <math>; the latter doesn't depend on <math> so it doesn't play a role in the derivative. Collecting our results from this post, we then get:

<math>

Which is exactly the equation we were expecting to see.

To conclude - if matrix calculus feels handwavy, it's because its identities are less familiar. In a sense, it's handwavy in the same way <math> is handwavy. We remember the identity so we don't have to recalculate it every time from first principles. Once you get some experience with matrix calculus, parts of equations start looking familiar and you no longer need to engage in the long and tiresome computations demonstrated here. It's perfectly fine to just remember that the derivative of <math> with a symmetric X is <math>. See the "identities" section of the wikipedia article on matrix calculus for many more examples.


[1]A few words on notation: by default, a vector v is a column vector. To get its row version, we transpose it. Moreover, in the vector derivative equations that follow I'm using denominator layout notation. This is not super-important though; as the Wikipedia article suggests, many mathematical papers and writings aren't consistent about this and it's perfectly possible to understand the derivations regardless.
[2]Yes, this is exactly like computing a gradient of a multivariate function.
[3]Take a minute to convince yourself that the dimensions of this equation work out and the result is a scalar.

Book review: "How To Prove It" by Daniel Velleman



Full book name: "How To Prove It - A Structured Approach, 2nd ed."

It's not often than a single book makes me write a full review (rather than just include it in a quarterly summary) these days. For this to happen, the book has to be truly remarkable and I should have so many impressions of it that it certainly won't fit into a summary.

"How To Prove It" is one such book. It takes upon itself the challenging task of teaching readers how to read, understand and come up with mathematical proofs. An audacious goal for sure, but the author succeeds beautifully.

The book starts with basic chapters on mathematical logic, as a prelude to proof-writing - translating simple word problems to logical equations and showing how to reason about them with tools like truth tables, operations on sets and logical quantifiers. In chapter 3, a structured approach to proof writing is presented. The approach is very mechanical - just the way it should be. You're taught to carefully consider and analyze each and every logical statement in your proof. Every statement comes from applying known rules to previously discovered statements. Careful proofs are beautiful things - they leave nothing unexplained. I recall back from my math courses in University, when I was losing points on proofs it was always because I missed to explain some crucial steps. The approach demonstrated in this book is a very nice way to keep track of everything and make sure nothing goes amiss. In fact, as a programmer it's immediately obvious how such proof techniques can be automated (and indeed the author has developed a proof-aid system called Proof Designer, which he talks about in an appendix).

After the reader has basic proof-writing tools in his hands, subsequent chapters present more constructs from logic and set theory, with new proof techniques and examples relevant to them. There's a chapter on relations, including closures. Another chapter on functions. Finally, the book ends with a chapter on induction (including strong induction) and a chapter on infinite sets.

Throughout the book, there are numerous examples, carefully crafted to take the reader through the material in tiny steps, and encouraging experemintation. There's a huge number of exercises (only some of which have solutions, unfortunately). The exercises smartly progress from very simple to challenging, giving the reader an opportunity to practice proof techniques and really understand how to use the matehmatical tools needed for the task.

I thoroughly enjoyed this book - it took me months to work through it, and it was an immensely rewarding experience. I worked through most exercises in the first chapters and at least some in the later chapters. I wish I had more patience to do even more exercises on my own - but it's quite a time-consuming endeavor. Overall, I think that if one is looking for a great book to explain what mathematics is all about, there's no need to look further.

I want to emphasize this last point. There are many books out there contending to be "the introduction to mathematics" for sophisticated readers. Almost always, these books turn out to be roller-coaster rides through dozens of mathematical topics, shortly touching on each. IMHO that's absolutely the worst possible approach to present mathematics, and this is where Velleman gets this right. In terms of total presented material, this book wouldn't even cover a single undergrad course in logic and set theory. But you see, there's no need to cover as many topics as possible. On the contrary, that would be detrimental. "How To Prove It" brilliantly chooses a small number of key topics and ideas and really focuses on teaching readers how to undestand them and reason about proofs using them. Since these ideas are fundemental in mathematics, they permeate every topic studied and are universally useful. Not much different from teaching a man to fish as opposed to dumping a truckload of exotic fish on him, really.

To conclude, if you are looking for a book to understand what mathematics is about and get a great introduction to mathematical thinking - I highly recommend "How To Prove It".


On parsing C, type declarations and fake headers



pycparser has become fairly popular in the past couple of years (especially following its usage in cffi). This means I get more questions by email, which leads me to getting tired of answering the same questions :-)

So this blog post is a one-stop shop for the (by far) most frequently asked question about pycparser - how to handle headers that your code #includes.

I've certainly written about this before, and it's mentioned in the README, but I feel that additional details are needed to provide a more complete answer to the different variations of this question.

First, a disclaimer. This post assumes some level of familiarity with the C programming language and how it's compiled. You must know about the C preprocessor (the thing that handles directives like #include and #define), and have a general understanding of how multiple source files (most often a .c file and any number of .h files) get combined into a single translation unit for compilation. If you don't have a strong grasp of these concepts, I would hold off using pycparser until you learn more about them.

So what's the problem?

The problem arises when the code you want to analyze with pycparser #includes a header file:

#include <someheader.h>

int foo() {
    // my code
}

Since this is true of virtually all real-life code, it's a problem almost everyone faces.

How to handle headers with pycparser

In general, pycparser does not concern itself with headers, or C preprocessor directives in general. The CParser object expects preprocessed code in its parse method, period. So you have two choices:

  1. Provide preprocessed code to pycparser. This means you first preprocess the code by invoking, say, gcc -E (or clang -E, or cpp, or whatever way you have to preprocess code [1]).
  2. Use pycparser's parse_file convenience function; it will invoke the preprocessor for you. Here's an example.

Great, so now you can handle headers. However, this is unlikely to solve all your problems, because pycparser will have trouble parsing some library headers; first and foremost, it will probably have trouble parsing the standard library headers.

Why? Because while pycparser fully supports C99, many library headers are full of compiler extensions and other clever tricks for compatibility across multiple platforms. While it's entirely possible to parse them with pycparser [2], this requires work. Work that you may not have the skills or the time to do. Work that, fortunately, is almost certainly unnecessary.

Why isn't it necessary? Because, in all likeness, you don't really need pycparser to parse those headers at all.

What pycparser actually needs to parse headers for

To understand this bold claim, you must first understand why pycparser needs to parse headers. Let's start with a more basic question - why does the C compiler need to parse headers your file includes?

For a number of reasons; some of them syntactic, but most of them semantic. Syntactic issues are those that may prevent the compiler from parsing the code. #defines are one, types are another.

For example, the C code:

{
    T * x;
}

Cannot be properly parsed unless we know whether:

  1. Either T or x are macros #defined to something.
  2. T is a type that was previously created with a typedef.

For a thorough explanation of this issue, look at this article and other related postings on my website.

Semantic reasons are those that won't prevent the compiler from parsing the code, but will prevent it from properly understanding and verifying it. For example, declarations of functions being used. Full declarations of structs, and so on. These take up the vast majority of real-world header files. But as it turns out, since pycparser only cares about parsing the code into an AST, and doesn't do any semantic analysis or further processing, it doesn't care about these issues. In other words, given the code:

{
    foo(a.b);
}

pycparser can construct a proper AST (given that none of foo, a or b are type names). It doesn't care what the actual declaration of foo is, whether a is indeed a variable of struct type, or whether it has a field named b [3].

So pycparser requires very little from header files. This is how the idea of "fake headers" was born.

Fake headers

Let's get back to this simple code sample:

#include <someheader.h>

int foo() {
    // my code
}

So we've established two key ideas:

  1. pycparser needs to know what someheader.h contains so it can properly parse the code.
  2. pycparser needs only a very small subset of someheader.h to perform its task.

The idea of fake headers is simple. Instead of actually parsing someheader.h and all the other headers it transitively includes (this probably includes lots of system and standard library headers too), why not create a "fake" someheader.h that only contains the parts of the original that are necessary for parsing - the #defines and the typedefs.

The cool part about typedefs is that pycparser doesn't actually care what a type is defined to be. T may be a pointer to function accepting an array of struct types, but all pycparser needs to see is:

typedef int T;

So it knows that T is a type. It doesn't care what kind of type it is.

So what do you have to do to parse your program?

OK, so now you hopefully have a better understanding of what headers mean for pycparser, and how to work around having to parse tons of system headers. What does this actually mean for your program, though? Will you now have to scour through all your headers, "faking them out"? Unlikely. If your code is standards-compliant C, then most likely pycparser will have no issue parsing all your headers. But you probably don't want it to parse the system hedaders. In addition to being nonstandard, these headers are usually large, which means longer parsing time and larger ASTs.

So my suggestion would be: let pycparser parse your headers, but fake out the system headers, and possibly any other large library headers used by your code. As far as the standard headers, pycparser already provides you with nice fakes in its utils folder. All you need to do is provide this flag to the preprocessor [4]:

-I<PATH-TO-PYCPARSER>/utils/fake_libc_include

And it will be able to find header files like stdio.h and sys/types.h with the proper types defined.

I'll repeat: the flag shown above is almost certainly sufficient to parse a C99 program that only relies on the C runtime (i.e. has no other library dependencies).

Real-world example

OK, enough theory. Now I want to work through an example to help ground these suggestions in reality. I'll take some well-known open-source C project and use pycparser to parse one of its files, fully showing all the steps taken until a successful parse is done. I'll pick Redis.

Let's start at the beginning, by cloning the Redis git repo:

/tmp$ git clone git@github.com:antirez/redis.git

I'll be using the latest released pycparser (version 2.13 at the time of writing). I'll also clone its repository into /tmp so I can easily access the fake headers:

/tmp$ git clone git@github.com:eliben/pycparser.git

A word on methodology - when initially exploring how to parse a new project, I'm always preprocessing separately. Once I figure out the flags/settings/extra faking required to successfully parse the code, it's all very easy to put in a script.

Let's take the main Redis file (redis/src/redis.c) and attempt to preprocess it. The first preprocessor invocation simply adds the include paths for Redis's own headers (they live in redis/src) and pycparser's fake libc headers:

/tmp$ gcc -E -Iredis/src -Ipycparser/utils/fake_libs_include redis/src/redis.c > redis_pp.c
# 48 "redis/src/redis.h" 2
In file included from redis/src/redis.c:30:0:
redis/src/redis.h:48:17: fatal error: lua.h: No such file or directory
 #include <lua.h>
             ^
compilation terminated.

Oops, no good. Redis is looking for Lua headers. Let's see if it carries this dependency along:

/tmp$ find redis -name lua
redis/deps/lua

Indeed! We should be able to add the Lua headers to the preprocessor path too:

/tmp$ gcc -E -Iredis/src -Ipycparser/utils/fake_libs_include \
             -Iredis/deps/lua/src redis/src/redis.c > redis_pp.c

Great, no more errors. Now let's try to parse it with pycparser. I'll load pycparser in an interactive terminal, but any other technique (such as running one of the example scripts will work):

: import pycparser
: pycparser.parse_file('/tmp/redis_pp.c')
... backtrace
---> 55         raise ParseError("%s: %s" % (coord, msg))

ParseError: /usr/include/x86_64-linux-gnu/sys/types.h:194:20: before: __attribute__

This error is strange. Note where it occurs: in a system header included in the preprocessed file. But we should have no system headers there - we specified the fake headers path. What gives?

The reason this is happening is that gcc knows about some pre-set system header directories and will add them to its search path. We can block this, making sure it only looks in the directories we explicitly specify with -I, by providing it with the -nostdinc flag. Let's re-run the preprocessor:

/tmp$ gcc -nostdinc -E -Iredis/src -Ipycparser/utils/fake_libc_include \
                       -Iredis/deps/lua/src redis/src/redis.c > redis_pp.c

Now I'll try to parse the preprocessed code again:

: pycparser.parse_file('/tmp/redis_pp.c')
... backtrace
---> 55         raise ParseError("%s: %s" % (coord, msg))

ParseError: redis/src/sds.h:74:5: before: __attribute__

OK, progress! If we look in the code where this error occurs, we'll note a GNU-specific __attribute__ pycparser doesn't support. No problem, let's just #define it away:

$ gcc -nostdinc -E -D'__attribute__(x)=' -Iredis/src \
                   -Ipycparser/utils/fake_libc_include \
                   -Iredis/deps/lua/src redis/src/redis.c > redis_pp.c

If I try to parse again, it works:

: pycparser.parse_file('/tmp/redis_pp.c')
<pycparser.c_ast.FileAST at 0x7f15fc321cf8>

I can also run one of the example scripts now to see we can do something more interesting with the AST:

/tmp$ python pycparser/examples/func_defs.py redis_pp.c
sdslen at redis/src/sds.h:47
sdsavail at redis/src/sds.h:52
rioWrite at redis/src/rio.h:93
rioRead at redis/src/rio.h:106
rioTell at redis/src/rio.h:119
rioFlush at redis/src/rio.h:123
redisLogRaw at redis/src/redis.c:299
redisLog at redis/src/redis.c:343
redisLogFromHandler at redis/src/redis.c:362
ustime at redis/src/redis.c:385
mstime at redis/src/redis.c:396
exitFromChild at redis/src/redis.c:404
dictVanillaFree at redis/src/redis.c:418
... many more lines
main at redis/src/redis.c:3733

This lets us see all the functions defined in redis.c and the headers included in it using pycparser.

This was fairly straightforward - all I had to do is set the right preprocessor flags, really. In some cases, it may be a bit more difficult. The most obvious problem that you may encounter is a new header you'll need to fake away. Luckily, that's very easy - just take a look at the existing ones (say at stdio.h). These headers can be copied to other names/directories, to make sure the preprocessor will find them properly. If you think there's a standard header I forgot to include in the fake headers, please open an issue and I'll add it.

Note that we didn't have to fake out the headers of Redis (or Lua for that matter). pycparser handled them just fine. The same has a high chance of being true for your C project as well.


[1]On Linux, at least gcc should be there on the command line. On OS X, you'll need to install "command-line developer tools" to get a command-line clang. If you're in Microsoft-land, I recommend downloading pre-built clang binaries for Windows.
[2]And this has been done by many folks. pycparser was made to parse the standard C library, windows.h, parts of the Linux kernel headers, and so on.
[3]Note that this describes the most common use of pycparser, which is to perform simple analyses on the source, or rewrite parts of existing source in some way. More complex uses may actually require full parsing of type definitions, structs and function declarations. In fact, you can certainly create a real C compiler using pycparser as the frontend. These uses will require full parsing of headers, so fake headers won't do. As I mentioned above, it's possible to make pycparser parse the actual headers of libraries and so on; it just takes more work.
[4]Depending on the exact preprocessor you're using, you may need to provide it with another flag telling it to ignore the system headers whose paths are hard-coded in it. Read on to the example for more details.