The Softmax function and its derivative

The softmax function takes an N-dimensional vector of arbitrary real values and produces another N-dimensional vector with real values in the range (0, 1) that add up to 1.0. It maps S(\mathbf{a}):\mathbb{R}^{N}\rightarrow \mathbb{R}^{N}:

\[S(\mathbf{a}):\begin{bmatrix} a_1\\ a_2\\ \cdots\\ a_N \end{bmatrix} \rightarrow \begin{bmatrix} S_1\\ S_2\\ \cdots\\ S_N \end{bmatrix}\]

And the actual per-element formula is:

\[S_j=\frac{e^{a_j}}{\sum_{k=1}^{N}e^{a_k}} \qquad \forall j \in 1..N\]

It's easy to see that S_j is always positive (because of the exponents); moreover, since the numerator appears in the denominator summed up with some other positive numbers, S_j<1. Therefore, it's in the range (0, 1).

For example, the 3-element vector [1.0, 2.0, 3.0] gets transformed into [0.09, 0.24, 0.67]. The order of elements by relative size is preserved, and they add up to 1.0. Let's tweak this vector slightly into: [1.0, 2.0, 5.0]. We get the output [0.02, 0.05, 0.93], which still preserves these properties. Note that as the last element is farther away from the first two, it's softmax value is dominating the overall slice of size 1.0 in the output. Intuitively, the softmax function is a "soft" version of the maximum function. Instead of just selecting one maximal element, softmax breaks the vector up into parts of a whole (1.0) with the maximal input element getting a proportionally larger chunk, but the other elements getting some of it as well [1].

Probabilistic interpretation

The properties of softmax (all output values in the range (0, 1) and sum up to 1.0) make it suitable for a probabilistic interpretation that's very useful in machine learning. In particular, in multiclass classification tasks, we often want to assign probabilities that our input belongs to one of a set of output classes.

If we have N output classes, we're looking for an N-vector of probabilities that sum up to N; sounds familiar?

We can interpret softmax as follows:


Where y is the output class numbered 1..N. a is any N-vector. The most basic example is multiclass logistic regression, where an input vector x is multiplied by a weight matrix W, and the result of this dot product is fed into a softmax function to produce probabilities. This architecture is explored in detail later in the post.

It turns out that - from a probabilistic point of view - softmax is optimal for maximum-likelihood estimation of the model's parameters. This is beyond the scope of this post, though. See chapter 5 of the "Deep Learning" book for more details.

Some preliminaries from vector calculus

Before diving into computing the derivative of softmax, let's start with some preliminaries from vector calculus.

Softmax is fundamentally a vector function. It takes a vector as input and produces a vector as output; in other words, it has multiple inputs and multiple outputs. Therefore, we cannot just ask for "the derivative of softmax"; We should instead specify:

  1. Which component (output element) of softmax we're seeking to find the derivative of.
  2. Since softmax has multiple inputs, with respect to which input element the partial derivative is computed.

If this sounds complicated, don't worry. This is exactly why the notation of vector calculus was developed. What we're looking for is the partial derivatives:

\[\frac{\partial S_i}{\partial a_j}\]

This is the partial derivative of the i-th output w.r.t. the j-th input. A shorter way to write it that we'll be using going forward is: D_{j}S_i.

Since softmax is a \mathbb{R}^{N}\rightarrow \mathbb{R}^{N} function, the most general derivative we compute for it is the Jacobian matrix:

\[DS=\begin{bmatrix} D_1 S_1 & \cdots & D_N S_1 \\ \vdots & \ddots & \vdots \\ D_1 S_N & \cdots & D_N S_N \end{bmatrix}\]

In ML literature, the term "gradient" is commonly used to stand in for the derivative. Strictly speaking, gradients are only defined for scalar functions (such as loss functions in ML); for vector functions like softmax it's imprecise to talk about a "gradient"; the Jacobian is the fully general derivate of a vector function, but in most places I'll just be saying "derivative".

Derivative of softmax

Let's compute D_j S_i for arbitrary i and j:

\[D_j S_i=\frac{\partial S_i}{\partial a_j}= \frac{\partial \frac{e^{a_i}}{\sum_{k=1}^{N}e^{a_k}}}{\partial a_j}\]

We'll be using the quotient rule of derivatives. For f(x) = \frac{g(x)}{h(x)}:

\[f'(x) = \frac{g'(x)h(x) - h'(x)g(x)}{[h(x)]^2}\]

In our case, we have:

\[\begin{align*} g_i&=e^{a_i} \\ h_i&=\sum_{k=1}^{N}e^{a_k} \end{align*}\]

Note that no matter which a_j we compute the derivative of h_i for, the answer will always be e^{a_j}. This is not the case for g_i, howewer. The derivative of g_i w.r.t. a_j is e^{a_j} only if i=j, because only then g_i has a_j anywhere in it. Otherwise, the derivative is 0.

Going back to our D_j S_i; we'll start with the i=j case. Then, using the quotient rule we have:

\[\frac{\partial \frac{e^{a_i}}{\sum_{k=1}^{N}e^{a_k}}}{\partial a_j}= \frac{{}e^{a_i}\Sigma-e^{a_j}e^{a_i}}{\Sigma^2}\]

For simplicity \Sigma stands for \sum_{k=1}^{N}e^{a_k}. Reordering a bit:

\[\begin{align*} \frac{\partial \frac{e^{a_i}}{\sum_{k=1}^{N}e^{a_k}}}{\partial a_j}&= \frac{e^{a_i}\Sigma-e^{a_j}e^{a_i}}{\Sigma^2}\\ &=\frac{e^{a_i}}{\Sigma}\frac{\Sigma - e^{a_j}}{\Sigma}\\ &=S_i(1-S_j) \end{align*}\]

The final formula expresses the derivative in terms of S_i itself - a common trick when functions with exponents are involved.

Similarly, we can do the i\ne j case:

\[\begin{align*} \frac{\partial \frac{e^{a_i}}{\sum_{k=1}^{N}e^{a_k}}}{\partial a_j}&= \frac{0-e^{a_j}e^{a_i}}{\Sigma^2}\\ &=-\frac{e^{a_j}}{\Sigma}\frac{e^{a_i}}{\Sigma}\\ &=-S_j S_i \end{align*}\]

To summarize:

\[D_j S_i=\left\{\begin{matrix} S_i(1-S_j) & i=j\\ -S_j S_i & i\ne j \end{matrix}\right\]

I like seeing this explicit breakdown by cases, but if anyone is taking more pride in being concise and clever than programmers, it's mathematicians. This is why you'll find various "condensed" formulations of the same equation in the literature. One of the most common ones is using the Kronecker delta function:

\[\delta_{ij}=\left\{\begin{matrix} 1 & i=j\\ 0 & i\ne j \end{matrix}\right\]

To write:

\[D_j S_i = S_i (\delta_{ij}-S_j)\]

Which is, of course, the same thing. There are a couple of other formulations one sees in the literature:

  1. Using the matrix formulation of the Jacobian directly to replace \delta with I - the identity matrix, whose elements are expressing \delta in matrix form.
  2. Using "1" as the function name instead of the Kroneker delta, as follows: D_j S_i = S_i (1(i=j)-S_j). Here 1(i=j) means the value 1 when i=j and the value 0 otherwise.

The condensed notation comes useful when we want to compute more complex derivatives that depend on the softmax derivative; otherwise we'd have to propagate the condition everywhere.

Computing softmax and numerical stability

A simple way of computing the softmax function on a given vector in Python is:

def softmax(x):
    """Compute the softmax of vector x."""
    exps = np.exp(x)
    return exps / np.sum(exps)

Let's try it with the sample 3-element vector we've used as an example earlier:

In [146]: softmax([1, 2, 3])
Out[146]: array([ 0.09003057,  0.24472847,  0.66524096])

However, if we run this function with larger numbers (or large negative numbers) we have a problem:

In [148]: softmax([1000, 2000, 3000])
Out[148]: array([ nan,  nan,  nan])

The numerical range of the floating-point numbers used by Numpy is limited. For float64, the maximal representable number is on the order of 10^{308}. Exponentiation in the softmax function makes it possible to easily overshoot this number, even for fairly modest-sized inputs.

A nice way to avoid this problem is by normalizing the inputs to be not too large or too small, by observing that we can use an arbitrary constant C as follows:


And then pushing the constant into the exponent, we get:


Since C is just an arbitrary constant, we can instead write:


Where D is also an arbitrary constant. This formula is equivalent to the original S_j for any D, so we're free to choose a D that will make our computation better numerically. A good choice is the maximum between all inputs, negated:

\[D=-max(a_1, a_2, \cdots, a_N)\]

This will shift the inputs to a range close to zero, assuming the inputs themselves are not too far from each other. Crucially, it shifts them all to be negative (except the maximal a_j which turns into a zero). Negatives with large exponents "saturate" to zero rather than infinity, so we have a better chance of avoiding NaNs.

def stablesoftmax(x):
    """Compute the softmax of vector x in a numerically stable way."""
    shiftz = z - np.max(z)
    exps = np.exp(shiftz)
    return exps / np.sum(exps)

And now:

In [150]: stablesoftmax([1000, 2000, 3000])
Out[150]: array([ 0.,  0.,  1.])

Note that this is still imperfect, since mathematically softmax would never really produce a zero, but this is much better than NaNs, and since the distance between the inputs is very large it's expected to get a result extremely close to zero anyway.

The softmax layer and its derivative

A common use of softmax appears in machine learning, in particular in logistic regression: the softmax "layer", wherein we apply softmax to the output of a fully-connected layer (matrix multiplication):

Generic softmax layer diagram

In this diagram, we have an input x with N features, and T possible output classes. The weight matrix W is used to transform x into a vector with T elements (called "logits" in ML folklore), and the softmax function is used to "collapse" the logits into a vector of probabilities denoting the probability of x belonging to each one of the T output classes.

How do we compute the derivative of this "softmax layer" (fully-connected matrix multiplication followed by softmax)? Using the chain rule, of course! You'll find any number of derivations of this derivative online, but I want to approach it from first principles, by carefully applying the multivariate chain rule to the Jacobians of the functions involved.

An important point before we get started: you may think that x is a natural variable to compute the derivative for. But it's not. In fact, in machine learning we usually want to find the best weight matrix W, and thus it is W we want to update with every step of gradient descent. Therefore, we'll be computing the derivative of this layer w.r.t. W.

Let's start by rewriting this diagram as a composition of vector functions. First, we have the matrix multiplication, which we denote g(W). It maps \mathbb{R}^{NT}\rightarrow \mathbb{R}^{T}, because the input (matrix W) has N times T elements, and the output has T elements.

Next we have the softmax. If we denote the vector of logits as \lambda, we have S(\lambda):\mathbb{R}^{T}\rightarrow \mathbb{R}^{T}. Overall, we have the function composition:

\[\begin{align*} P(W)&=S(g(W)) \\ &=(S\circ g)(W) \end{align*}\]

By applying the multivariate chain rule, the Jacobian of P(W) is:

\[DP(W)=D(S\circ g)(W)=DS(g(W))\cdot Dg(W)\]

We've computed the Jacobian of S(a) earlier in this post; what's remaining is the Jacobian of g(W). Since g is a very simple function, computing its Jacobian is easy; the only complication is dealing with the indices correctly. We have to keep track of which weight each derivative is for. Since g(W):\mathbb{R}^{NT}\rightarrow \mathbb{R}^{T}, its Jacobian has T rows and NT columns:

\[Dg=\begin{bmatrix} D_1 g_1 & \cdots & D_{NT} g_1 \\ \vdots & \ddots & \vdots \\ D_1 g_T & \cdots & D_{NT} g_T \end{bmatrix}\]

In a sense, the weight matrix W is "linearized" to a vector of length NT. If you're familiar with the memory layout of multi-dimensional arrays, it should be easy to understand how it's done. In our case, one simple thing we can do is linearize it in row-major order, where the first row is consecutive, followed by the second row, etc. Mathematically, W_{ij} will get column number iN+j in the Jacobian. To populate Dg, let's recall what g_1 is:

\[g_1=W_{11}x_1+W_{12}x_2+\cdots +W_{1N}x_N\]


\[\begin{align*} D_1g_1&=x_1 \\ D_2g_1&=x_2 \\ \cdots \\ D_Ng_1&=x_N \\ D_{N+1}g_1&=0 \\ \cdots \\ D_{NT}g_1&=0 \end{align*}\]

If we follow the same approach to compute g_2...g_T, we'll get the Jacobian matrix:

\[Dg=\begin{bmatrix} x_1 & x_2 & \cdots & x_N & \cdots & 0 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & x_1 & x_2 & \cdots & x_N \end{bmatrix}\]

Looking at it differently, if we split the index of W to i and j, we get:

\[\begin{align*} D_{ij}g_t&=\frac{\partial(W_{t1}x_1+W_{t2}x_2+\cdots+W_{tN}x_N)}{\partial W_{ij}} &= \left\{\begin{matrix} x_j & i = t\\ 0 & i \ne t \end{matrix}\right. \end{align*}\]

This goes into row t, column iN+j in the Jacobian matrix.

Finally, to compute the full Jacobian of the softmax layer, we just do a dot product between DS and Dg. Note that P(W):\mathbb{R}^{NT}\rightarrow \mathbb{R}^{T}, so the Jacobian dimensions work out. Since DS is TxT and Dg is TxNT, their dot product DP is TxNT.

In literature you'll see a much shortened derivation of the derivative of the softmax layer. That's fine, since the two functions involved are simple and well known. If we carefully compute a dot product between a row in DS and a column in Dg:

\[D_{ij}P_t=\sum_{k=1}^{T}D_kS_t\cdot D_{ij}g_k\]

Dg is mostly zeros, so the end result is simpler. The only k for which D_{ij}g_k is nonzero is when i=k; then it's equal to x_j. Therefore:

\[\begin{align*} D_{ij}P_t&=D_iS_tx_j \\ &=S_t(\delta_{ti}-S_i)x_j \end{align*}\]

So it's entirely possible to compute the derivative of the softmax layer without actual Jacobian matrix multiplication; and that's good, because matrix multiplication is expensive! The reason we can avoid most computation is that the Jacobian of the fully-connected layer is sparse.

That said, I still felt it's important to show how this derivative comes to life from first principles based on the composition of Jacobians for the functions involved. The advantage of this approach is that it works exactly the same for more complex compositions of functions, where the "closed form" of the derivative for each element is much harder to compute otherwise.

Softmax and cross-entropy loss

We've just seen how the softmax function is used as part of a machine learning network, and how to compute its derivative using the multivariate chain rule. While we're at it, it's worth to take a look at a loss function that's commonly used along with softmax for training a network: cross-entropy.

Cross-entropy has an interesting probabilistic and information-theoretic interpretation, but here I'll just focus on the mechanics. For two discrete probability distributions p and q, the cross-entropy function is defined as:


Where k goes over all the possible values of the random variable the distributions are defined for. Specifically, in our case there are T output classes, so k would go from 1 to T.

If we start from the softmax output P - this is one probability distribution [2]. The other probability distribution is the "correct" classification output, usually denoted by Y. This is a one-hot encoded vector of size T, where all elements except one are 0.0, and one element is 1.0 - this element marks the correct class for the data being classified. Let's rephrase the cross-entropy loss formula for our domain:

\[xent(Y, P)=-\sum_{k=1}^{T}Y(k)log(P(k))\]

k goes over all the output classes. P(k) is the probability of the class as predicted by the model. Y(k) is the "true" probability of the class as provided by the data. Let's mark the sole index where Y(k)=1.0 by y. Since for all k\ne y we have Y(k)=0, the cross-entropy formula can be simplified to:

\[xent(Y, P)=-log(P(y))\]

Actually, let's make it a function of just P, treating y as a constant. Moreover, since in our case P is a vector, we can express P(y) as the y-th element of P, or P_y:


The Jacobian of xent is a 1xT matrix (a row vector), since the output is a scalar and we have T inputs (the vector P has T elements):

\[Dxent=\begin{bmatrix} D_1xent & D_2xent & \cdots & D_Txent \end{bmatrix}\]

Now recall that P can be expressed as a function of input weights: P(W)=S(g(W)). So we have another function composition:

\[xent(W)=(xent\circ P)(W)=xent(P(W))\]

And we can, once again, use the multivariate chain rule to find the gradient of xent w.r.t. W:

\[Dxent(W)=D(xent\circ P)(W)=Dxent(P(W))\cdot DP(W)\]

Let's check that the dimensions of the Jacobian matrices work out. We already computed DP(W); it's TxNT. Dxent(P(W)) is 1xT, so the resulting Jacobian Dxent(W) is 1xNT, which makes sense because the whole network has one output (the cross-entropy loss - a scalar value) and NT inputs (the weights).

Here again, there's a straightforward way to find a simple formula for Dxent(W), since many elements in the matrix multiplication end up cancelling out. Note that xent(P) depends only on the y-th element of P. Therefore, only D_{y}xent is non-zero in the Jacobian:

\[Dxent=\begin{bmatrix} 0 & 0 & D_{y}xent & \cdots & 0 \end{bmatrix}\]

And D_{y}xent=-\frac{1}{P_y}. Going back to the full Jacobian Dxent(W), we multiply Dxent(P) by each column of D(P(W)) to get each element in the resulting row-vector. Recall that the row vector represents the whole weight matrix W "linearized" in row-major order. We'll index into it with i and j for clarity (D_{ij} points to element number iN+j in the row vector):

\[D_{ij}xent(W)=\sum_{k=1}^{T}D_{k}xent(P)\cdot D_{ij}P_k(W)\]

Since only the y-th element in D_{k}xent(P) is non-zero, we get the following, also substituting the derivative of the softmax layer from earlier in the post:

\[\begin{align*} D_{ij}xent(W)&=D_{y}xent(P)\cdot D_{ij}P_y(W) \\ &=-\frac{1}{P_y}\cdot S_y(\delta_{yi}-S_i)x_j \end{align*}\]

By our definition, P_y=S_y, so we get:

\[\begin{align*} D_{ij}xent(W)&=-\frac{1}{S_y}\cdot S_y(\delta_{yi}-S_i)x_j \\ &=-(\delta_{yi}-S_i)x_j \\ &=(S_i-\delta_{yi})x_j \end{align*}\]

Once again, even though in this case the end result is nice and clean, it didn't necessarily have to be so. The formula for D_{ij}xent(W) could end up being a fairly involved sum (or sum of sums). The technique of multiplying Jacobian matrices is oblivious to all this, as the computer can do all the sums for us. All we have to do is compute the individial Jacobians, which is usually easier because they are for simpler, non-composed functions. This is the beauty and utility of the multivariate chain rule.

[1]To play more with sample inputs and Softmax outputs, Michael Nielsen's online book has a nice interactive Javascript visualization - check it out.
[2]Take a moment to recall that, by definition, the output of the softmax function is indeed a valid discrete probability distribution.

The Chain Rule of Calculus

The chain rule of derivatives is, in my opinion, the most important formula in differential calculus. In this post I want to explain how the chain rule works for single-variable and multivariate functions, with some interesting examples along the way.

Preliminaries: composition of functions and differentiability

We denote a function f that maps from the domain X to the codomain Y as f:X \rightarrow Y. With this f and given g:Y \rightarrow Z, we can define g \circ f:X \rightarrow Z as the composition of g and f. It's defined for \forall x \in X as:

\[(g \circ f)(x)=g(f(x))\]

In calculus we are usually concerned with the real number domain of some dimensionality. In the single-variable case, we can think of f and g as two regular real-valued functions: f:\mathbb{R} \rightarrow \mathbb{R} and g:\mathbb{R} \rightarrow \mathbb{R}.

As an example, say f(x)=x+1 and g(x)=x^2. Then:

\[(g \circ f)(x)=g(f(x))=g(x+1)=(x+1)^2\]

We can compose the functions the other way around as well:

\[(f \circ g)(x)=f(g(x))=f(x^2)=x^2+1\]

Obviously, we shouldn't expect composition to be commutative. It is, however, associative. h \circ (g \circ f) and (h \circ g) \circ f are equivalent, and both end up being h(g(f(x))) for \forall x \in X.

To better handle compositions in one's head it sometimes helps to denote the independent variable of the outer function (g in our case) by a different letter (such as g(a)). For simple cases it doesn't matter, but I'll be using this technique occasionally throughout the article. The important thing to remember here is that the name of the independent variable is completely arbitrary, and we should always be able to replace it by another name throughout the formula without any semantic change.

The other preliminary I want to mention is differentiability. The function f is differentiable at some point x_0 if the following limit exists:

\[\lim_{h \to 0}\frac{f(x_0+h)-f(x_0)}{h}\]

This limit is then the derivative of f at the point x_0, or {f}&#x27;(x_0). Another way to express this is \frac{d}{dx}f(x_0). Note that x_0 can be any arbitrary point on the real line. I sometimes say something like "f is differentiable at g(x_0)". Here too, g(x_0) is just a real value that happens to be the value of the function g at x_0.

The single-variable chain rule

The chain rule for single-variable functions states: if g is differentiable at x_0 and f is differentiable at g(x_0), then f \circ g is differentiable at x_0 and its derivative is:

\[(f \circ g)&#x27;(x_0)={f}&#x27;(g(x_0)){g}&#x27;(x_0)\]

The proof of the chain rule is a bit tricky - I left it for the appendix. However, we can get a better feel for it using some intuition and a couple of examples.

First, the intuituion. By definition:

\[{g}&#x27;(x_0)=\lim_{h \to 0}\frac{g(x_0+h)-g(x_0)}{h}\]

Multiplying both sides by h we get [1]:

\[{g}&#x27;(x_0)h=\lim_{h \to 0}g(x_0+h)-g(x_0)\]

Therefore we can say that when x_0 changes by some very small amount, g(x_0) changes by {g}&#x27;(x_0) times that small amount.

Similarly {f}&#x27;(a_0) is the amount of change in the value of f for some very small change from a_0. However, since in our case we compose f \circ g, we can say that a_0=g(x_0), evaluating f(g(x_0)). Suppose we shift x_0 by a small amount h. This causes g(x_0) to shift by {g}&#x27;(x_0)h. So the input a_0 of f shifted by {g}&#x27;(x_0)h - this is still a small amount! Therefore, the total change in the value of f should be {f}&#x27;(g(x_0)){g}&#x27;(x_0)h [2].

Now, a couple of simple examples. Let's take the function f(x)=(x+1)^2. The idea is to think of this function as a composition of simpler functions. In this case, one option is: g(x)=x+1 and then w(g(x))=g(x)^2, so the original f is now the composition w \circ g.

The derivative of this composition is {w}&#x27;(g(x)){g}&#x27;(x), or 2(x+1) since {g}&#x27;(x)=1. Note that w is differentiable at any point, so this derivative always exists.

Another example will use a longer chain of composition. Let's differentiate f(x)=sin[(x+1)^2]. This is a composition of three functions:

\[\begin{align*} g(x)&amp;=x+1\\ w(x)&amp;=x^2\\ v(x)&amp;=sin(x) \end{align*}\]

Function composition is associative, so f can be expressed as either v \circ (w \circ g) or (v \circ w) \circ g. Since we already know what the derivative of w \circ g is, let's use the former:

\[\begin{align*} \frac{df(x)}{dx}=\frac{d v(w(g(x)))}{dx}&amp;={v}&#x27;(w(g(x))){w(g(x))}&#x27;(x)\\                                         &amp;=cos(w(g(x)))2(x+1)\\                                         &amp;=2cos[(x+1)^2](x+1) \end{align*}\]

The chain rule as a computational procedure

As the last example demonstrates, the chain rule can be applied multiple times in a single derivation. This makes the chain rule a powerful tool for computing derivatives of very complex functions, which can be broken up into compositions of simpler functions. I like to draw a parallel between this process and programming; a function in a programming language can be seen as a computational procedure - we have a set of input parameters and we produce outputs. On the way, several transformations happen that can be expressed mathematically. These transformations are composed, so their derivatives can be computed naturally with the chain rule.

This may be somewhat abstract, so let's use another example. We'll compute the derivative of the Sigmoid function - a very important function in machine learning:


To make the equivalence between functions and computational procedures clearer, let's think how we'd compute S in Python:

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

This doesn't look much different, but that's just because Python is a high level language with arbitrarily nested expressions. Its VM (or the CPU in general) would execute this computation step by step. Let's break it up to be clearer, assuming we can only apply a single operation at every step:

def sigmoid(x):
    f = -x
    g = math.exp(f)
    w = 1 + g
    v = 1 / w
    return v

I hope you're starting to see the resemblance to our chain rule examples at this point. Sacrificing some rigor in the notation for the sake of expressiveness, we can write:


This is the chain rule applied to v \circ (w \circ (g \circ f)). Solving this is easy because every single derivative in the chain above is trivial:

\[\begin{align*} S&#x27;&amp;=v&#x27;(w)w&#x27;(g)g&#x27;(f)(-1)\\   &amp;=v&#x27;(w)w&#x27;(g)e^{-x}(-1)\\   &amp;=v&#x27;(w)(1)e^{-x}(-1)\\   &amp;=\frac{-1}{(1+e^{-x})^2}e^{-x}(-1)\\   &amp;=\frac{e^{-x}}{(1+e^{-x})^2} \end{align*}\]

Now you may be thinking:

  1. Every function computable by a program can be broken down to trivial steps like our sigmoid above.
  2. Using the chain rule, we can easily find the derivative of such a sequence of steps... therefore:
  3. We can easily find the derivative of any function computable by a program!!

An you'll be right. This is precisely the basis for the technique known as automatic differentiation, which is widely used in scienctific computing. The most notable use of automatic differentiation in recent times is the backpropagation algorithm - an essential backbone of modern machine learning. I personally find automatic differentiation fascinating, and will write a more dedicated article about it in the future.

Multivariate chain rule - general formulation

So far this article has been looking at functions with a single input and output: f:\mathbb{R} \to \mathbb{R}. In the most general case of multi-variate calculus, we're dealing with functions that map from n dimensions to m dimensions: f:\mathbb{R}^{n} \to \mathbb{R}^{m}. Because every one of the m outputs of f can be considered a separate function dependent on n variables, it's very natural to deal with such math using vectors and matrices.

First let's define some notation. We'll consider the outputs of f to be numbered from 1 to m as f_1,f_2 \dots f_m. For each such f_i we can compute its partial derivative by any of the n inputs as:

\[D_j f_i(a)=\frac{\partial f_i}{\partial a_j}(a)\]

Where j goes from 1 to n and a is a vector with n components. If f is differentiable at a [3] then the derivative of f at a is the Jacobian matrix:

\[Df(a)=\begin{bmatrix} D_1 f_1(a) &amp; \cdots &amp; D_n f_1(a) \\ \vdots &amp;  &amp; \vdots \\ D_1 f_m(a) &amp; \cdots &amp; D_n f_m(a) \\ \end{bmatrix}\]

The multivariate chain rule states: given g:\mathbb{R}^n \to \mathbb{R}^m and f:\mathbb{R}^m \to \mathbb{R}^p and a point a \in \mathbb{R}^n, if g is differentiable at a and f is differentiable at g(a) then the composition f \circ g is differentiable at a and its derivative is:

\[D(f \circ g)(a)=Df(g(a)) \cdot Dg(a)\]

Which is the matrix multiplication of Df(g(a)) and Dg(a) [4]. Intuitively, the multivariate chain rule mirrors the single-variable one (and as we'll soon see, the latter is just a special case of the former) with derivatives replaced by derivative matrices. From linear algebra, we represent linear transformations by matrices, and the composition of two linear transformations is the product of their matrices. Therefore, since derivative matrices - like derivatives in one dimension - are a linear approximation to the function, the chain rule makes sense. This is a really nice connection between linear algebra and calculus, though a full proof of the multivariate rule is very technical and outside the scope of this article.

Multivariate chain rule - examples

Since the chain rule deals with compositions of functions, it's natural to present examples from the world of parametric curves and surfaces. For example, suppose we define f(x,y,z) as a scalar function \mathbb{R}^3 \to \mathbb{R} giving the temperature at some point in 3D. Now imagine that we're moving through this 3D space on a curve defined by a function g:\mathbb{R} \to \mathbb{R}^3 which takes time t and gives the coordinates x(t),y(t),z(t) at that time. We want to compute how the temperature changes as a function of time t - how do we do that? Recall that the temprerature is not a direct function of time, but rather is a function of location, while location is a function of time. Therefore, we'll want to compose f \circ g. Here's a concrete example:

\[g(t)=\begin{pmatrix} t\\ t^2\\ t^3 \end{pmatrix}\]


\[f\begin{pmatrix} x \\ y \\ z \end{pmatrix}=x^2+xyz+5y\]

If we reformulate x, y and z as functions of t:


Composing f \circ g, we get:

\[(f \circ g)(t)=f(g(t))=f(t,t^2,t^3)=t^2+t^6+5t^2=6t^2+t^6\]

Since this is a simple function, we can find its derivative directly:

\[(f \circ g)&#x27;(t)=12t+6t^5\]

Now let's repeat this exercise using the multivariate chain rule. To compute D(f \circ g)(t) we need Df(g(t)) and Dg(t). Let's start with Dg(t). g(t) maps \mathbb{R} \to \mathbb{R}^3, so its Jacobian is a 3-by-1 matrix, or column vector:

\[Dg(t)=\begin{bmatrix} 1 \\ 2t \\ 3t^2 \end{bmatrix}\]

To compute Df(g(t)) let's first find Df(x,y,z). Since f(x,y,z) maps \mathbb{R}^3 \to \mathbb{R}, its Jacobian is a 1-by-3 matrix, or row vector:

\[Df(x,y,z)=\begin{bmatrix} 2x+yz &amp; xz+5 &amp; xy \end{bmatrix}\]

To apply the chain rule, we need Df(g(t)):

\[Df(g(t))=\begin{bmatrix} 2t+t^5 &amp; t^4+5 &amp; t^3 \end{bmatrix}\]

Finally, multiplying Df(g(t)) by Dg(t), we get:

\[\begin{align*} D(f \circ g)(t)=Df(g(t)) \cdot Dg(t)&amp;=\begin{bmatrix} 2t+t^5 &amp; t^4+5 &amp; t^3 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 2t \\ 3t^2 \end{bmatrix}\\ &amp;=2t+t^5+2t^6+10t+3t^5\\ &amp;=12t+6t^5 \end{align*}\]

Another interesting way to interpret this result for the case where f:\mathbb{R}^3 \to \mathbb{R} and g:\mathbb{R} \to \mathbb{R}^3 is to recall that the directional derivative of f in the direction of some vector \vec{v} is:

\[D_{\vec{v}}f=(\nabla f) \cdot \vec{v}\]

In our case (\nabla f) is the Jacobian of f (because of its dimensionality). So if we take \vec{v} to be the vector Dg(t), and evaluate the gradient at g(t) we get [5]:

\[D_{\vec{Dg(t)}}f(t)=(\nabla f(g(t))) \cdot Dg(t)\]

This gives us some additional intuition for the temperature change question. The change in temperature as a function of time is the directional derivative of f in the direction of the change in location (Dg(t)).

For additional examples of applying the chain rule, see my post about softmax.

Tricks with the multivariate chain rule - derivative of products

Earlier in the article we've seen how the chain rule helps find derivatives of complicated functions by decomposing them into simpler functions. The multivariate chain rule allows even more of that, as the following example demonstrates. Suppose h(x)=f(x)g(x). Then, the well-known product rule of derivatives states that:


Proving this from first principles (the definition of the derivative as a limit) isn't hard, but I want to show how it stems very easily from the multivariate chain rule.

Let's begin by re-formulating h(x) as a composition of two functions. The first takes a vector \vec{s} in \mathbb{R}^2 and maps it to \mathbb{R} by computing the product of its two components:

\[p(\vec{s})=s_1 s_2\]

The second is a vector-valued function that maps a number x \in \mathbb{R} to \mathbb{R}^2 :

\[s(x)=\begin{pmatrix} f(x)\\ g(x) \end{pmatrix}\]

We can compose p \circ s, producing a function that takes a scalar an returns a scalar: (p \circ s) : \mathbb{R} \to \mathbb{R}. We get:

\[h(x)=(p \circ s)(x) = f(x)g(x)\]

Since we're composing two multivariate functions, we can apply the multivariate chain rule here:

\[\begin{align*} D(p \circ s) &amp;= Dp(s(x)) \cdot Ds(x)\\              &amp;=\begin{bmatrix}                 \frac{\partial p}{\partial s_1}(x) &amp; \frac{\partial p}{\partial s_2}(x)             \end{bmatrix}\cdot                 \begin{bmatrix}                 {s_1}&#x27;(x)\\                 {s_2}&#x27;(x)               \end{bmatrix}\\              &amp;=\begin{bmatrix}                 s_2(x) &amp; s_1(x)                 \end{bmatrix}                 \cdot                 \begin{bmatrix}                 {s_1}&#x27;(x)\\                 {s_2}&#x27;(x)               \end{bmatrix}\\               &amp;={s_1}&#x27;(x)s_2(x)+{s_2}&#x27;(x)s_1(x) \end{align*}\]

Since s_1(x)=f(x) and s_2(x)=g(x), this is exactly the product rule.

Connecting the single-variable and multivariate chain rules

Given function f(x) : \mathbb{R} \to \mathbb{R}, its Jacobian matrix has a single entry:

\[Df(a)=\begin{bmatrix}D_{x}f(a)\end{bmatrix}=       \begin{bmatrix}\frac{df}{dx}(a)\end{bmatrix}\]

Therefore, given two functions mapping \mathbb{R} \to \mathbb{R}, the derivative of their composition using the multivariate chain rule is:

\[D(f \circ g)(a)=Df(g(a))\cdot Dg(a)=f&#x27;(g(a))g&#x27;(a)\]

Which is precisely the single-variable chain rule. This results from matrix multiplication between two 1x1 matrices, which ends up being just the product of their single entries.

Appendix: proving the single-variable chain rule

It turns out that many online resources (including Khan Academy) provide a flawed proof for the chain rule. It's flawed due to a careless division by a quantity that may be zero. This flaw can be corrected by making the proof somewhat more complicated; I won't take that road here - for details see Spivak's Calculus. Instead, I'll present a simpler proof inspired by the one I found at Casey Douglas's site.

We want to prove that:

\[(f \circ g)&#x27;(x)={f}&#x27;(g(x)){g}&#x27;(x)\]

Note that previously we defined derivatives at some concrete point x_0. Here for the sake of brevity I'll just use x as an arbitrary point, assuming the derivative exists.

Let's start with the definition of g&#x27;(x):

\[{g}&#x27;(x)=\lim_{h \to 0}\frac{g(x+h)-g(x)}{h}\]

We can reorder it as follows:

\[\lim_{h \to 0}\left [ \frac{g(x+h)-g(x)}{h} - g&#x27;(x) \right ] = 0\]

Let's give the part in the brackets the name \Delta g(x).

Similarly, if the function f is differentiable at the point a=g(x), we have:

\[f&#x27;(a)=\lim_{k \to 0}\frac{f(a+k)-f(a)}{k}\]

We reorder:

\[\lim_{k \to 0}\left [ \frac{f(a+k)-f(a)}{k} - f&#x27;(a) \right ] = 0\]

And call the part in the brackets \Delta f(a). The choice of the variable used to go to zero: k instead of h is arbitrary and is useful to simplify the discussion that follows.

Let's reorder the definition of \Delta g(x) a bit:

\[g(x+h)=g(x)+[g&#x27;(x)+\Delta g(x)]h\]

We can apply f to both sides:

\[\begin{equation} f(g(x+h))=f(g(x)+[g&#x27;(x)+\Delta g(x)]h) \tag{1} \end{equation}\]

By reordering the definition of \Delta f(a) we get:

\[\begin{equation} f(a+k)=f(a)+[f&#x27;(a)+\Delta f(a)]k \tag{2} \end{equation}\]

Now taking the right-hand side of (1), we can look at it as f(a+k) since a=g(x) and we can define k=[g&#x27;(x)+\Delta g(x)]h. We still have k going to zero when h goes to zero. Assigning these a and k into (2) we get:

\[f(a+k)=f(g(x))+[f&#x27;(g(x))+\Delta f(g(x))][g&#x27;(x)+\Delta g(x)]h\]

So, starting from (1) again, we have:

\[\begin{align*} f(g(x+h))&amp;=f(a+k) \\          &amp;=f(g(x))+[f&#x27;(g(x))+\Delta f(g(x))][g&#x27;(x)+\Delta g(x)]h \end{align*}\]

Subtracting f(g(x)) from both sides and dividing by h (which is legal, since h is not zero, it's just very small) we get:

\[\frac{f(g(x+h))-f(g(x))}{h}=[f&#x27;(g(x))+\Delta f(g(x))][g&#x27;(x)+\Delta g(x)]\]

Apply a limit to both sides:

\[\lim_{h \to 0} \frac{f(g(x+h))-f(g(x))}{h}= \lim_{h \to 0} [f&#x27;(g(x))+\Delta f(g(x))][g&#x27;(x)+\Delta g(x)]\]

Now recall that both \Delta f(g(x)) and \Delta g(x) go to 0 when h goes to 0. Taking this into account, we get:

\[\lim_{h \to 0} \frac{f(g(x+h))-f(g(x))}{h}= f&#x27;(g(x))g&#x27;(x)\]


[1]Here, as in the rest of the post, I'm being careless with the usage of \lim, sometimes leaving its existence to be implicit. In general, wherever h appears in a formula we know there's a \lim_{h \to 0} there, whether explicitly or not.
[2]An alternative way to think about it is: suppose the functions f and g are linear: f(x)=ax+b and g(x)=cx+d. Then the chain rule is trivially true. But now recall what the derivative is. The derivative at some point x_0 is the best linear approximation for the function at that point. Therefore the chain rule is true for any pair of differentiable functions - even when the functions are not linear, we approximate their rate of change in an infinitisemal area around x_0 with a linear function.
[3]The condition for f being differentiable at a is stronger than simply saying that all partial derivatives exist at a, but I won't spend more time on this subtlety here.
[4]As an exercise, verify that the matrix dimensions of Df and Dg make this multiplication valid.
[5]It shouldn't be surprising we get here, since the definition of the directional derivative as the gradient was derived using the multivariate chain rule.

Summary of reading: July - September 2016

  • "Everydata" by John H. Johnson and Mike Gluck - in this book the authors set to show how we consume data every day in suboptimal ways; specifically, how laymen unfamiliar with the basics of statistics and probability can often be fooled by the media and bogus research "conclusions". The idea of the book is great, but I found the execution mediocre. The authors have some good ideas to pass along and interesting examples and demonstrations, but unfortunately these drown in incoherent rambling-like style of writing, jumping from topic to topic haphazardly and a tone reminiscent of tabloids more than of non-fiction books. I mean, it's OK to use some quips, jargon and personal tone from time to time, but this book overdoes it. It also comes across as a not-so-subtle advertisement of the authors' respective private consultancies for statistical analysis and training/coaching. On the brighter side, the book does appear well researched and there are plenty of references provided for every chapter if one is interested in deeper dives into the topics presented.
  • "JavaScript & jQuery" by Jon Duckett - a companion to the author's "HTML & CSS" book, focusing on the other side of modern web programming - JS. It uses the same colorful glossy format, though with somewhat more information density, and manages to cover the web-scriptability aspects of JS pretty well. What I mean to say is that the book doesn't spend too much time discussing programming paradigms in JS, but rather focuses on how to use it to manipulate the DOM, process events and interact with users. There's also quite a bit of focus on jQuery and modern HTML5 elements and DOM-stuff relevant to JS. The book tries to cater to beginner programmers though I seriously doubt someone new to programming will really manage to learn JavaScript from it. However, if you already have some programming experience and want to learn about how to make your web pages dance without heavy libraries (jQuery is so ubiquitous to be considered as "vanilla" as JS itself these days). It does discuss Angular briefly in one chapter, but overall sticks to the basics.
  • "Basic Economics" by Thomas Sowell - Cetainly one of the best books about economics I've read in the past few years, perhaps ever. This is maybe the second time I started re-reading the book (well, re-listening in this case) immediately after finishing it the first time - it's that good! The author sets out to explain basic micro and macro-economic phenomena in the world from first principles, based on such fundamental issues as pricing, supply and demand, and the allocation of scarce resources that have alternative uses - the latter being a central theme throughout the book. He touches on a huge spectrum of topics - from price controls, to differences in productivity between countries, to why payday loans charge so much interest, to the ways many countries shot themselves in the feet by imposing government control of the market. One sobering perspective that comes out from the book is that politicians are very often incentivised to make un-economical decisions, acting fully rationally from their point of view. The solution is, of course, restriction on the power the government has over the free market. Surprisingly, Sowell manages to bring his points across without going too much into polemics; instead, he sticks to facts, historical studies and economic first principles. I can't recommend this book highly enough - if you only ever read one book on economics in your lifetime, this should be it.
  • "Black Rednecks and White Liberals" by Thomas Sowell - A collection of several loosely-connected essays on history, sharing the common thread of examining ingrained historic perceptions of race and ethnicity. For example, one of the essays explains how what we now know as "Black culture" isn't African in origin, but rather is peculiar to the South of the US, and originated in certain parts of Northern Britain. Another essay compares Jews to other "middlemen minorities" (like Chinese in South-East Asia and Armenians in the Ottoman Empire), concluding that the history, and treatment by other nationalities, of such minorities has a lot in common. Very well written and insightful book.
  • "Python machine learning" by Sebastian Raschka - this would be an average plus book, had it not been published by Packt. Packt makes it a fairly low-quality work, as usual, with weird formatting and bad editing.
  • "The girl on the train" by Paula Hawkins - a light, engrossing mystery novel about a missing woman in London, told in first person from the points-of-view of three women protagonists. Starting with about 2/3 into the book everything became painfully obvious, so it was a bit of a drag to read until the main character realized what's going on. Also, the main character - Rachel, must be one of the least likable protagonists I've encountered in books lately. It's very hard not to feel repelled by her behavior and general state of mind.
  • "Applied Economics: Thinking beyond stage one" by Thomas Sowell - the central theme here is making decisions that make sense in the short term, but turn out badly in the long term. Sowell once again pounds politics as the obvious culprit here - passing laws and regulations that boost their status with the voters for the duration of their term in some political office, but leading to problems years later when no one remembers what the original cause was. Overall, much of this book is rehashed from Basic Economics, so I found it much less exciting. It's still a good read, but reading it shortly after Basic Economics feels a bit repetitive. There are some new topics, though, which are very interesting. One is the economics of discrimination. Sowell claims that it is government policies that enable discrimination, and that free markets are actually discouraging wide-spread discrimination. He makes a very well-argued case for this, with tons of examples from different countries and eras.
  • "Economics" by Timothy Taylor (audio course) - pretty good, long course. I found Sowell's "Basic Economics" better, though the course spends much more time on macro economics, which Sowell covers less (with the caveat that macro economics in general doesn't feel very scientific to me).
  • "Deep Work" by Cal Newport - a guideline to fighting procrastination and the modern distractions of the internet and social media on the way to greater productivity. I'd say this book has the same idea as Csíkszentmihályi's "Flow" but is much less thorough and much more handwavy. I did like it overall; in particular, the thesis of work quality often beating work quantity resonates strongly with my own experience. Still, "Flow" is significantly better IMHO.
  • "One Flew Over the Cuckoo's Nest" by Ken Kesey - an excellent book about life in a mental asylum. I can see why this book is controversial w.r.t. teaching it in school - it certainly glorifies opposition to authority and the feel-good of defying rules.


  • "War and Peace" by Leo Tolstoy