Minimal character-based LSTM implementation



Following up on the earlier post deciphering a minimal vanilla RNN implementation, here I'd like to extend the example to a simple LSTM model.

Once again, the idea is to combine a well-commented code sample (available here) with some high-level diagrams and math to enable someone to fully understand the code. The LSTM architecture presented herein is the standard one originating from Hochreiter's and Schmidthuber's 1997 paper. It's described pretty much everywhere; Chris Olah's post has particularly nice diagrams and is worth reading.

LSTM cell structure

From 30,000 feet, LSTMs look just like regular RNNs; there's a "cell" that has a recurrent connection (output tied to input), and when trained this cell is usually unrolled to some fixed length.

So we can take the basic RNN structure from the previous post:

Basic RNN diagram

LSTMs are a bit trickier because there are two recurrent connections; these can be "packed" into a single vector h, so the above diagram still applies. Here's how an LSTM cell looks inside:

LSTM cell

x is the input; p is the probabilities computed from the output y (these symbols are named consistently with my earlier RNN post) and exit the cell at the bottom purely due to topological convenience. The two memory vectors are h and c - as mentioned earlier, they could be combined into a single vector, but are shown here separately for clarity.

The main idea of LSTMs is to enable training of longer sequences by providing a "fast-path" to back-propagate information farther down in memory. Hence the c vector is not multiplied by any matrices on its path. The circle-in-circle block means element-wise multiplication of two vectors; plus-in-square is element-wise addition. The funny greek letter is the Sigmoid non-linearity:

\[\sigma(x) =\frac{1}{1+e^{-x}}\]

The only other block we haven't seen in the vanilla RNN diagram is the colon-in-square in the bottom-left corner; this is simply the concatenation of h and x into a single column vector. In addition, I've combined the "multiply by matrix W, then add bias b" operation into a single rectantular box to save on precious diagram space.

Here are the equations computed by a cell:

\[\begin{align*} xh&=x^{[t]}:h^{[t-1]}\\ f&=\sigma(W_f\cdot xh+b_f)\\ i&=\sigma(W_i\cdot xh+b_i)\\ o&=\sigma(W_o\cdot xh+b_o)\\ cc&=tanh(W_{cc}\cdot xh+b_{cc})\\ c^{[t]}&=c^{[t-1]}\odot f +cc\odot i\\ h^{[t]}&=tanh(c^{[t]})\odot o\\ y^{[t]}&=W_{y}\cdot h^{[t]}+b_y\\ p^{[t]}&=softmax(y^{[t]})\\ \end{align*}\]

Backpropagating through an LSTM cell

This works exactly like backprop through a vanilla RNN; we have to carefully compute how the gradient flows through every node and make sure we properly combine gradients at fork points. Most of the elements in the LSTM diagram are familiar from the previous post. Let's briefly work through the new ones.

First, the Sigmoid function; it's an elementwise function, and computing its derivative is very similar to the tanh function discussed in the previous post. As usual, given f=\sigma(k), from the chain rule we have the following derivative w.r.t. some weight w:

\[\frac{\partial f}{\partial w}=\frac{\partial \sigma(k)}{\partial k}\frac{\partial k}{\partial w}\]

To compute the derivative \frac{\partial \sigma(k)}{\partiak k}, we'll use the ratio-derivative formula:

\[(\frac{f}{g})'=\frac{f'g-g'f}{g^2}\]

So:

\[\sigma '(k)=\frac{e^{-k}}{(1+e^{-k})^2}\]

A clever way to express this is:

\[\sigma '(k)=\sigma(k)(1-\sigma(k))\]

Going back to the chain rule with f=\sigma(k), we get:

\[\frac{\partial f}{\partial w}=f(1-f)\frac{\partial k}{\partial w}\]

The other new operation we'll have to find the derivative of is element-wise multiplication. Let's say we have the column vectors x, y and z, each with m rows, and we have z(x)=x\odot y. Since z as a function of x has m inputs and m outputs, its Jacobian has dimensions [m,m].

D_{j}z_{i} is the derivative of the i-th element of z w.r.t. the j-th element of x. For z(x)=x\odot y this is non-zero only when i and j are equal, and in that case the derivative is y_i.

Therefore, Dz(x) is a square matrix with the elements of y on the diagonal and zeros elsewhere:

\[Dz=\begin{bmatrix} y_1 & 0 & \cdots & 0 \\ 0 & y_2 & \cdots & 0 \\ \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & y_m \\ \end{bmatrix}\]

If we want to backprop some loss L through this function, we get:

\[\frac{\partial L}{\partial x}=\frac{\partial L}{\partial z}Dz\]

As x has m elements, the right-hand side of this equation multiplies a [1,m] vector by a [m,m] matrix which is diagonal, resulting in element-wise multiplication with the matrix's diagonal elements. In other words:

\[\frac{\partial L}{\partial x}=\frac{\partial L}{\partial z}\odot y\]

In code, it looks like this:

# Assuming dz is the gradient of loss w.r.t. z; dz, y and dx are all
# column vectors.
dx = dz * y

Model quality

In the post about min-char-rnn, we've seen that the vanilla RNN generates fairly low quality text:

one, my dred, roriny. qued bamp gond hilves non froange saws, to mold his a work, you shirs larcs anverver strepule thunboler muste, thum and cormed sightourd so was rewa her besee pilman

The LSTM's generated text quality is somewhat better when trained with roughtly the same hyper-parameters:

the she, over is was besiving the fact to seramed for i said over he will round, such when a where, "i went of where stood it at eye heardul rrawed only coside the showed had off with the refaurtoned

I'm fairly sure that it can be made to perform even better with larger memory vectors and more training data. That said, an even more advanced architecture can be helpful here. Moreover, since this is a character-based model, to really capture effects between words a few words apart we'll need a much deeper LSTM (I'm unrolling to 16 characters we can only capture 2-3 words), and hence much more training data and time.

Once again, the goal here is not to develop a state-of-the-art language model, but to show a simple, comprehensible example of how and LSTM is implemented end-to-end in Python code. The full code is here - please let me know if you find any issues with it or something still remains unclear.


Understanding how to implement a character-based RNN language model



In a single gist, Andrej Karpathy did something truly impressive. In a little over 100 lines of Python - without relying on any heavy-weight machine learning frameworks - he presents a fairly complete implementation of training a character-based recurrent neural network (RNN) language model; this includes the full backpropagation learning with Adagrad optimization.

I love such minimal examples because they allow me to understand some topic in full depth, connecting the math to the code and having a complete picture of how everything works. In this post I want to present a companion explanation to Karpathy's gist, showing the diagrams and math that hide in its Python code.

My own fork of the code is here; it's semantically equivalent to Karpathy's gist, but includes many more comments and some debugging options. I won't reproduce the whole program here; instead, the idea is that you'd go through the code while reading this article. The diagrams, formulae and explanations here are complementary to the code comments.

What RNNs do

I expect readers to have a basic idea of what RNN do and why they work well for some problems. RNN are well-suited for problem domains where the input (and/or output) is some sort of a sequence - time-series financial data, words or sentences in natural language, speech, etc.

There is a lot of material about this online, and the basics are easy to understand for anyone with even a bit of machine learning background. However, there is not enough coherent material online about how RNNs are implemented and trained - this is the goal of this post.

Character-based RNN language model

The basic structure of min-char-rnn is represented by this recurrent diagram, where x is the input vector (at time step t), y is the output vector and h is the state vector kept inside the model.

Basic RNN diagram

The line leaving and returning to the cell represents that the state is retained between invocations of the network. When a new time step arrives, some things are still the same (the weights inherent to the network, as we shall soon see) but some things are different - h may have changed. Therefore, unlike stateless NNs, y is not simply a function of x; in RNNs, identical xs can produce different ys, because y is a function of x and h, and h can change between steps.

The character-based part of the model's name means that every input vector represents a single character (as opposed to, say, a word or part of an image). min-char-rnn uses one-hot vectors to represent different characters.

A language model is a particular kind of machine learning algorithm that learns the statistical structure of language by "reading" a large corpus of text. This model can then reproduce authentic language segments - by predicting the next character (or word, for word-based models) based on past characters.

Internal-structure of the RNN cell

Let's proceed by looking into the internal structure of the RNN cell in min-char-rnn:

RNN cell for min-char-rnn
  • Bold-faced symbols in reddish color are the model's parameters, weights for matrix multiplication and biases.
  • The state vector h is shown twice - once for its past value, and once for its currently computed value. Whenever the RNN cell is invoked in sequence, the last computed state h is passed in from the left.
  • In this diagram y is not the final answer of the cell - we compute a softmax function on it to obtain p - the probabilities for output characters [1]. I'm using these symbols for consistency with the code of min-char-rnn, though it would probably be more readable to flip the uses of p and y (making y the actual output of the cell).

Mathematically, this cell computes:

\[\begin{align*} h^{[t]}&=tanh(W_{hh}\cdot h^{[t-1]}+W_{xh}\cdot x^{[t]}+b_h)\\ y^{[t]}&=W_{hy}\cdot h^{[t]}+b_y\\ p^{[t]}&=softmax(y^{[t]}) \end{align*}\]

Learning model parameters with backpropagation

This section will examine how we can learn the parameters W and b for this model. Mostly it's standard neural-network fare; we'll compute the derivatives of all the steps involved and will then employ backpropagation to find a parameter update based on some computed loss.

There's one serious issue we'll have to address first. Backpropagation is usually defined on acyclic graphs, so it's not entirely clear how to apply it to our RNN. Is h an input? An output? Both? In the original high-level diagram of the RNN cell, h is both an input and an output - how can we compute the gradient for it when we don't know its value yet? [2]

The way out of this conundrum is to unroll the RNN for a few steps. Note that we're already doing this in the detailed diagram by distinguishing between h^{[t]} and h^{[t-1]}. This makes every RNN cell locally acyclic, which makes it possible to use backpropagation on it. This approach has a cool-sounding name - Backpropagation Through Time (BPTT) - although it's really the same as regular backpropagation.

Note that the architecture used here is called "synced many-to-many" in Karpathy's Unreasonable Effectiveness of RNNs post, and it's useful for training a simple char-based language model - we immediately observe the output sequence produced by the model while reading the input. Similar unrolling can be applied to other architectures, like encoder-decoder.

Here's our RNN again, unrolled for 3 steps:

Unrolled RNN diagram

Now the same diagram, with the gradient flows depicted with orange-ish arrows:

Unrolled RNN diagram with gradient flow arrows shown

With this unrolling, we have everything we need to compute the actual weight updates during learning, because when we want to compute the gradients through step 2, we already have the incoming gradient \Delta h[2], and so on.

Do you now wonder what is \Delta h[t] for the final step at time t?

In some models, sequence lengths are fairly limited. For example, when we translate a single sentence, the sequence length is rarely over a couple dozen words; for such models we can fully unroll the RNN. The h state output of the final step doesn't really "go anywhere", and we assume its gradient is zero. Similarly, the incoming state h for the first step is zero.

Other models work on potentially infinite sequence lengths, or sequences much too long for unrolling. The language model in min-char-rnn is a good example, because it can theoretically ingest and emit text of any length. For these models we'll perform truncated BPTT, by just assuming that the influence of the current state extends only N steps into the future. We'll then unroll the model N times and assume that \Delta h[N] is zero. Although it really isn't, for a large enough N this is a fairly safe assumption. RNNs are hard to train on very long sequences for other reasons, anyway (we'll touch upon this point again towards the end of the post).

Finally, it's important to remember that although we unroll the RNN cells, all parameters (weights, biases) are shared. This plays an important part in ensuring translation invariance for the models - patterns learned in one place apply to another place [3]. It leaves the question of how to update the weights, since we compute gradients for them separately in each step. The answer is very simple - just add them up. This is similar to other cases where the output of a cell branches off in two directions - when gradients are computed, their values are added up along the branches - this is just the basic chain rule in action.

We now have all the necessary background to understand how an RNN learns. What remains before looking at the code is figuring out how the gradients propagate inside the cell; in other words, the derivatives of each operation comprising the cell.

Flowing the gradient inside an RNN cell

As we saw above, the formulae for computing the cell's output from its inputs are:

\[\begin{align*} h^{[t]}&=tanh(W_{hh}\cdot h^{[t-1]}+W_{xh}\cdot x^{[t]}+b_h)\\ y^{[t]}&=W_{hy}\cdot h^{[t]}+b_y\\ p^{[t]}&=softmax(y^{[t]}) \end{align*}\]

To be able to learn weights, we have to find the derivatives of the cell's output w.r.t. the weights. The full backpropagation process was explained in this post, so here is only a brief refresher.

Recall that p^{[t]} is the predicted output; we compare it with the "real" output (r^{[t]}) during learning, to find the loss (error):

\[L=L(p^{[t]}, r^{[t]})\]

To perform a gradient descent update, we'll need to find \frac{\partial L}{\partial w}, for every weight value w. To do this, we'll have to:

  1. Find the "local" gradients for every mathematical operation leading from w to L.
  2. Use the chain rule to propagate the error backwards through these local gradients until we find \frac{\partial L}{\partial w}.

We start by formulating the chain rule to compute \frac{\partial L}{\partial w}:

\[\frac{\partial L}{\partial w}=\frac{\partial L}{\partial p^{[t]}}\frac{\partial p^{[t]}}{\partial w}\]

Next comes:

\[\frac{\partial p^{[t]}}{\partial w}=\frac{\partial softmax}{\partial y^{[t]}}\frac{\partial y^{[t]}}{\partial w}\]

Let's say the weight w we're interested in is part of W_{hh}, so we have to propagate some more:

\[\frac{\partial y^{[t]}}{\partial w}=\frac{\partial y^{[t]}}{\partial h^{[t]}}\frac{\partial h^{[t]}}{\partial w}\]

We'll then proceed to propagate through the tanh function, bias addition and finally the multiplication by W_{hh}, for which the derivative by w is computed directly without further chaining.

Let's now see how to compute all the relevant local gradients.

Cross-entropy loss gradient

We'll start with the derivative of the loss function, which is cross-entropy in the min-char-rnn model. I went through a detailed derivation of the gradient of softmax followed by cross-entropy in this post; here is only a brief recap:

\[xent(p,q)=-\sum_{k}p(k)log(q(k))\]

Re-formulating this for our specific case, the loss is a function of p^{[t]}, assuming the "real" class r is constant for every training example:

\[L(p^{[t]})=-\sum_{k}r(k)log(p^{[t]}(k))\]

Since inputs and outputs to the cell are 1-hot encoded, let's just use r to denote the index where r(k) is non-zero. Then the Jacobian of L is only non-zero at index r and its value there is -\frac{1}{p^{[t]}}(r).

Softmax gradient

A detailed computation of the gradient for the softmax function was also presented in this post. For S(y) being the softmax of y, the Jacobian is:

\[D_{j}S_{i}=\frac{\partial S_i}{\partial y_j}=S_{i}(\delta_{ij}-S_j)\]

Fully-connected layer gradient

Next on our path backwards is:

\[y^{[t]}&=W_{hy}\cdot h^{[t]}+b_y\]

From my earlier post on backpropagating through a fully-connected layer, we know that \frac{\partial y^{[t]}}{\partial h^{[t]}}=W_{hy}. But that's not all; note that on the forward pass h^{[t]} splits in two - one edge goes into the fully-connected layer, another goes to the next RNN cell as the state. When we backpropagate the loss gradient to h^{[t]}, we have to take both edges into account; more specifically, we have to add the gradients along the two edges. This leads to the following backpropagation equation:

\[\frac{\partial L}{\partial h^{[t]}} = \frac{\partial y^{[t]}}{\partial h^{[t]}}\frac{\partial L}{\partial y^{[t]}}+\frac{\partial L}{\partial h^{[t+1]}}=W_{hy}\cdot \frac{\partial L}{\partial y^{[t]}}+\frac{\partial L}{\partial h^{[t+1]}}\]

In addition, note that this layer already has model parameters that need to be learned - W_{hy} and b_y - a "final" destination for backpropagation. Please refer to my fully-connected layer backpropagation post to see how the gradients for these are computed.

Gradient of tanh

The vector h^{[t]} is produced by applying a hyperbolic tangent nonlinearity to another fully connected layer.

\[h^{[t]}&=tanh(W_{hh}\cdot h^{[t-1]}+W_{xh}\cdot x^{[t]}+b_h)\]

To get to the model parameters W_{hh}, W_{xh} and b_h, we have to first backpropagate the loss gradient through tanh. tanh is a scalar function; when it's applied to a vector we apply it in element-wise fashion to every element in the vector independently, and collect the results in a similarly-shaped result vector.

Its mathematical definition is:

\[tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}\]

To find the derivative of this function, we'll use the formula for deriving a ratio:

\[(\frac{f}{g})'=\frac{f'g-g'f}{g^2}\]

So:

\[tanh'(x)=\frac{(e^x+e^{-x})(e^x+e^{-x})-(e^x-e^{-x})(e^x-e^{-x})}{(e^x+e^{-x})^2}=1-(tanh(x))^2\]

Just like for softmax, it turns out that there's a convenient way to express the derivative of tanh in terms of tanh itself. When we apply the chain rule to derivatives of tanh, for example: h=tanh(k) where k is a function of w. We get:

\[\frac{\partial h}{\partial w}=\frac{\partial tanh(k)}{\partial k}\frac{\partial k}{\partial w}=(1-h^2)\frac{\partial k}{\partial w}\]

In our case k(w) is a fully-connected layer; to find its derivatives w.r.t. the weight matrices and bias, please refer to the backpropagation through a fully-connected layer post.

Learning model parameters with Adagrad

We've just went through all the major parts of the RNN cell and computed local gradients. Armed with these formulae and the chain rule, it should be possible to understand how the min-char-rnn code flows the loss gradient backwards. But that's not the end of the story; once we have the loss derivatives w.r.t. to some model parameter, how do we update this parameter?

The most straightforward way to do this would be using the gradient descent algorithm, with some constant learning rate. I've written about gradient descent in the past - please take a look for a refresher.

Most real-world learning is done with more advanced algorithms these days, however. One such algorithm is called Adagrad, proposed in 2011 by some experts in mathematical optimization. min-char-rnn happens to use Adagrad, so here is a simplified explanation of how it works.

The main idea is to adjust the learning rate separately per parameter, because in practice some parameters change much more often than others. This could be due to rare examples in the training data set that affect a parameter that's not often affected; we'd like to amplify these changes because they are rare, and dampen changes to parameters that change often.

Therefore the Adagrad algorithm works as follows:

# Same shape as the parameter array x
memory = 0
while True:
  dx = compute_grad(x)

  # Elementwise: each memory element gets the corresponding dx^2 added to it.
  memory += dx * dx

  # The actual parameter update for this step. Note how the learning rate is
  # modified by the memory. epsilon is some very small number to avoid dividing
  # by 0.
  x -= learning_rate * dx / (np.sqrt(memory) + epsilon)

If a given element in dx was updated significantly in the past, its corresponding memory element will grow and thus the learning rate is effectively decreased.

Gradient clipping

If we unroll the RNN cell 10 times, the gradient will be multiplied by W_{hh} ten times on its way from the last cell to the first. For some structures of W_{hh}, this may lead to an "exploding gradient" effect where the value keeps growing [4].

To mitigate this, min-char-rnn uses the gradient clipping trick. Whenever the gradients are updated, they are "clipped" to some reasonable range (like -5 to 5) so they will never get out of this range. This method is crude, but it works reasonably well for training RNNs.

The flip side problem of vanishing gradient (wherein the gradients keep getting smaller with each step) is much harder to solve, and usually requires more advanced recurrent NN architectures.

min-char-rnn model quality

While min-char-rnn is a complete RNN implementation that manages to learn, it's not really good enough for learning a reasonable model for the English language. The model is too simple for this, and suffers seriously from the vanishing gradient problem.

For example, when training a 16-step unrolled model on a corpus of Sherlock Holmes books, it produces the following text after 60,000 iterations (learning on about a MiB of text):

one, my dred, roriny. qued bamp gond hilves non froange saws, to mold his a work, you shirs larcs anverver strepule thunboler muste, thum and cormed sightourd so was rewa her besee pilman

It's not complete gibberish, but not really English either. Just for fun, I wrote a simple Markov chain generator and trained it on the same text with a 4-character state. Here's a sample of its output:

though throughted with to taken as when it diabolice, and intered the stairhead, the stood initions of indeed, as burst, his mr. holmes' room, and now i fellows. the stable. he retails arm

Which, you'll admit, is quite a bit better than our "fancy" deep learning approach! And it was much faster to train too...

To have a better chance of learning a good model, we'll need a more advanced architecture like LSTM. LSTMs employ a bunch of tricks to preserve long-term dependencies through the cells and can learn much better language models. For example, Andrej Karpathy's char-rnn model from the Unreasonable Effectiveness of RNNs post is a multi-layer LSTM, and it can learn fairly nice models for a varied set of domains, ranging from Shakespeare sonnets to C code snippets in the Linux kernel.

Conclusion

The goal of this post wasn't to develop a very good RNN model; rather, it was to explain in detail the math behind training a simple RNN. More advanced RNN architerctures like LSTM are somewhat more complicated, but all the core ideas are very similar and this post should be helpful in nailing the basics.

Update: An extension of this post to LSTMs.


[1]

Computing a softmax makes sense because x is encoded with one-hot over a vocabulary-sized vector, meaning there's a 1 in the position of the letter it represents with 0s in all other positions. For example, is we only care about the 26 lower-case alphabet letters, x could be a 26-element vector. To represent 'a' it would have 1 in position 0 and zeros elsewhere; to represent 'd' it would have 1 in position 3 and zeros elsewhere.

The output p here models what the RNN cell thinks the next generated character should be. Using softmax, it would have probabilities for each character in the corresponding position, all of them properly summing up to 1.

[2]

A slightly more technical explanation: to compute the gradient for the error w.r.t. weights in the typical backpropagation flow, we'll need input gradients for p[t] and h[t]. Then, when learning happens we use the measured error and propagate it backwards. But what is the measured error for h[t]? We don't know it before we compute the error of the next iteration, and so on - a bit of a chicken-egg problem.

Unrolling/BPTT helps approximate a solution for this issue. An alternative solution is to use forward-mode gradient propagation instead, with an algorithm called RTRL (Real Time Recurrent Learning). This algorithm works well but has a high computational cost compared to BPTT. I'd love to explore this topic in more depth, as it ties into the difference between forward-mode and reverse-mode auto differentiation. But that would be a topic for another post.

[3]This is similar to convolutional networks, where the convolution filter weights are reused many times when processing a much larger input. In such models the invariance is spatial; in sequence models the invariance is temporal. In fact, space vs. time in models is just a matter of convention, and it turns out that 1D convolutional models perform very well on some sequence tasks!
[4]

An easy way to think about it is to imagine some initial value v, multiplied by another value c many times. We get vc^N for N multiplications. If c is larger than 1, it means the result will keep growing with each multiplication. How quickly will depend on the actual value of c, but this is basically an exponential runoff. We actually care about the absolute value of c, of course, since runoff is equally bad in the positive or negative direction.

Similarly with the absolute value of c smaller than 1, we'll get a "vanishing" effect since the result will keep getting smaller with each iteration.


Backpropagation through a fully-connected layer



The goal of this post is to show the math of backpropagating a derivative for a fully-connected (FC) neural network layer consisting of matrix multiplication and bias addition. I have briefly mentioned this in an earlier post dedicated to Softmax, but here I want to give some more attention to FC layers specifically.

Here is a fully-connected layer for input vectors with N elements, producing output vectors with T elements:

Diagram of a fully connected layer

As a formula, we can write:

\[y=Wx+b\]

Presumably, this layer is part of a network that ends up computing some loss L. We'll assume we already have the derivative of the loss w.r.t. the output of the layer \frac{\partial{L}}{\partial{y}}.

We'll be interested in two other derivatives: \frac{\partial{L}}{\partial{W}} and \frac{\partial{L}}{\partial{b}}.

Jacobians and the chain rule

As a reminder from The Chain Rule of Calculus, we're dealing with functions that map from n dimensions to m dimensions: f:\mathbb{R}^{n} \to \mathbb{R}^{m}. We'll consider the outputs of f to be numbered from 1 to m as f_1,f_2 \dots f_m. For each such f_i we can compute its partial derivative by any of the n inputs as:

\[D_j f_i(a)=\frac{\partial f_i}{\partial a_j}(a)\]

Where j goes from 1 to n and a is a vector with n components. If f is differentiable at a then the derivative of f at a is the Jacobian matrix:

\[Df(a)=\begin{bmatrix} D_1 f_1(a) & \cdots & D_n f_1(a) \\ \vdots &  & \vdots \\ D_1 f_m(a) & \cdots & D_n f_m(a) \\ \end{bmatrix}\]

The multivariate chain rule states: given g:\mathbb{R}^n \to \mathbb{R}^m and f:\mathbb{R}^m \to \mathbb{R}^p and a point a \in \mathbb{R}^n, if g is differentiable at a and f is differentiable at g(a) then the composition f \circ g is differentiable at a and its derivative is:

\[D(f \circ g)(a)=Df(g(a)) \cdot Dg(a)\]

Which is the matrix multiplication of Df(g(a)) and Dg(a).

Back to the fully-connected layer

Circling back to our fully-connected layer, we have the loss L(y) - a scalar function L:\mathbb{R}^{T} \to \mathbb{R}. We also have the function y=Wx+b. If we're interested in the derivative w.r.t the weights, what are the dimensions of this function? Our "variable part" is then W, which has NT elements overall, and the output has T elements, so y:\mathbb{R}^{NT} \to \mathbb{R}^{T} [1].

The chain rule tells us how to compute the derivative of L w.r.t. W:

\[\frac{\partial{L}}{\partial{W}}=D(L \circ y)(W)=DL(y(W)) \cdot Dy(W)\]

Since we're backpropagating, we already know DL(y(W)); because of the dimensionality of the L function, the dimensions of DL(y(W)) are [1,T] (one row, T columns). y(W) has NT inputs and T outputs, so the dimensions of Dy(W) are [T,NT]. Overall, the dimensions of D(L \circ y)(W) are then [1,NT]. This makes sense if you think about it, because as a function of W, the loss has NT inputs and a single scalar output.

What remains is to compute Dy(W), the Jacobian of y w.r.t. W. As mentioned above, it has T rows - one for each output element of y, and NT columns - one for each element in the weight matrix W. Computing such a large Jacobian may seem daunting, but we'll soon see that it's very easy to generalize from a simple example. Let's start with y_1:

\[y_1=\sum_{j=1}^{N}W_{1,j}x_{j}+b_1\]

What's the derivative of this result element w.r.t. each element in W? When the element is in row 1, the derivative is x_j (j being the column of W); when the element is in any other row, the derivative is 0.

Similarly for y_2, we'll have non-zero derivatives only for the second row of W (with the same result of x_j being the derivative for the j-th column), and so on.

Generalizing from the example, if we split the index of W to i and j, we get:

\[\begin{align} D_{ij}y_t&=\frac{\partial(\sum_{j=1}^{N}W_{t,j}x_{j}+b_t)}{\partial W_{ij}} &= \left\{\begin{matrix} x_j & i = t\\ 0 & i \ne t \end{matrix}\right. \end{align*}\]

This goes into row t, column (i-1)N+j in the Jacobian matrix. Overall, we get the following Jacobian matrix with shape [T,NT]:

\[Dy=\begin{bmatrix} x_1 & x_2 & \cdots & x_N & \cdots & 0 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & x_1 & x_2 & \cdots & x_N \end{bmatrix}\]

Now we're ready to finally multiply the Jacobians together to complete the chain rule:

\[D(L \circ y)(W)=DL(y(W)) \cdot Dy(W)\]

The left-hand side is this row vector:

\[DL(y)=(\frac{\partial L}{\partial y_1}, \frac{\partial L}{\partial y_2},\cdots,\frac{\partial L}{\partial y_T})\]

And we're multiplying it by the matrix Dy shown above. Each item in the result vector will be a dot product between DL(y) and the corresponding column in the matrix Dy. Since Dy has a single non-zero element in each column, the result is fairly trivial. The first N entries are:

\[\frac{\partial L}{\partial y_1}x_1, \frac{\partial L}{\partial y_1}x_2, \cdots, \frac{\partial L}{\partial y_1}x_N\]

The next N entries are:

\[\frac{\partial L}{\partial y_2}x_1, \frac{\partial L}{\partial y_2}x_2, \cdots, \frac{\partial L}{\partial y_2}x_N\]

And so on, until the last (T-th) set of N entries is all x-es multiplied by \frac{\partial L}{\partial y_T}.

To better see how to apply each derivative to a corresponding element in W, we can "re-roll" this result back into a matrix of shape [T,N]:

\[\frac{\partial{L}}{\partial{W}}=D(L\circ y)(W)=\begin{bmatrix} \frac{\partial L}{\partial y_1}x_1 & \frac{\partial L}{\partial y_1}x_2 & \cdots & \frac{\partial L}{\partial y_1}x_N \\ \\ \frac{\partial L}{\partial y_2}x_1 & \frac{\partial L}{\partial y_2}x_2 & \cdots & \frac{\partial L}{\partial y_2}x_N \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial y_T}x_1 & \frac{\partial L}{\partial y_T}x_2 & \cdots & \frac{\partial L}{\partial y_T}x_N \end{bmatrix}\]

Computational cost and shortcut

While the derivation shown above is complete and mathematically correct, it can also be computationally intensive; in realistic scenarios, the full Jacobian matrix can be really large. For example, let's say our input is a (modestly sized) 128x128 image, so N=16,384. Let's also say that T=100. The weight matrix then has NT=1,638,400 elements; respectably big, but nothing out of the ordinary.

Now consider the size of the full Jacobian matrix: it's T by NT, or over 160 million elements. At 4 bytes per element that's more than half a GiB!

Moreover, to compute every backpropagation we'd be forced to multiply this full Jacobian matrix by a 100-dimensional vector, performing 160 million multiply-and-add operations for the dot products. That's a lot of compute.

But the final result D(L\circ y)(W) is the size of W - 1.6 million elements. Do we really need 160 million computations to get to it? No, because the Jacobian is very sparse - most of it is zeros. And in fact, when we look at the D(L\circ y)(W) found above - it's fairly straightforward to compute using a single multiplication per element.

Moreover, if we stare at the \frac{\partial{L}}{\partial{W}} matrix a bit, we'll notice it has a familiar pattern: this is just the outer product between the vectors \frac{\partial{L}}{\partial{y}} and x:

\[\frac{\partial L}{\partial W}=\frac{\partial L}{\partial y}\otimes x\]

If we have to compute this backpropagation in Python/Numpy, we'll likely write code similar to:

# Assuming dy (gradient of loss w.r.t. y) and x are column vectors, by
# performing a dot product between dy (column) and x.T (row) we get the
# outer product.
dW = np.dot(dy, x.T)

Bias gradient

We've just seen how to compute weight gradients for a fully-connected layer. Computing the gradients for the bias vector is very similar, and a bit simpler.

This is the chain rule equation applied to the bias vector:

\[\frac{\partial{L}}{\partial{b}}=D(L \circ y)(b)=DL(y(b)) \cdot Dy(b)\]

The shapes involved here are: DL(y(b)) is still [1,T], because the number of elements in y remains T. DL(y(b)) has T inputs (bias elements) and T outputs (y elements), so its shape is [T,T]. Therefore, the shape of the gradient D(L \circ y)(b) is [1,T].

To see how we'd fill the Jacobian matrix DL(y(b)), let's go back to the formula for y:

\[y_1=\sum_{j=1}^{N}W_{1,j}x_{j}+b_1\]

When derived by anything other than b_1, this would be 0; when derived by b_1 the result is 1. The same applies to every other element of y:

\[\frac{\partial y_i}{\partial b_j}=\left\{\begin{matrix} 1 & i=j \\ 0 & i\neq j \end{matrix}\right\]

In matrix form, this is just an identity matrix with dimensions [T,T]. Therefore:

\[\frac{\partial{L}}{\partial{b}}=D(L \circ y)(b)=I \cdot Dy(b)=Dy(b)\]

For a given element of b, its gradient is just the corresponding element in \frac{\partial L}{\partial y}.

Fully-connected layer for a batch of inputs

The derivation shown above applies to a FC layer with a single input vector x and a single output vector y. When we train models, we almost always try to do so in batches (or mini-batches) to better leverage the parallelism of modern hardware. So a more typical layer computation would be:

\[Y=WX+b\]

Where the shape of X is [N,B]; B is the batch size, typically a not-too-large power of 2, like 32. W and b still have the same shapes, so the shape of Y is [T,B]. Each column in X is a new input vector (for a total of B vectors in a batch); a corresponding column in Y is the output.

As before, given \frac{\partial{L}}{\partial{Y}}, our goal is to find \frac{\partial{L}}{\partial{W}} and \frac{\partial{L}}{\partial{b}}. While the end results are fairly simple and pretty much what you'd expect, I still want to go through the full Jacobian computation to show how to find the gradiends in a rigorous way.

Starting with the weigths, the chain rule is:

\[\frac{\partial{L}}{\partial{W}}=D(L \circ Y)(W)=DL(Y(W)) \cdot DY(W)\]

The dimensions are:

  • DL(Y(W)): [1,TB] because Y has T outputs for each input vector in the batch.
  • DY(W): [TB,TN] since Y(W) has TB outputs and TN inputs overall.
  • D(L\circ Y)(W): [1,TN] same as in the batch-1 case, because the same weight matrix is used for all inputs in the batch.

Also, we'll use the notation x_{i}^{[b]} to talk about the i-th element in the b-th input vector x (out of a total of B such input vectors).

With this in hand, let's see how the Jacobians look; starting with DL(Y(W)), it's the same as before except that we have to take the batch into account. Each batch element is independent of the others in loss computations, so we'll have:

\[\frac{\partial L}{\partial y_{i}^{[b]}}\]

As the Jacobian element; how do we arrange them in a 1-dimensional vector with shape [1,TB]? We'll just have to agree on a linearization here - same as we did with W before. We'll go for row-major again, so in 1-D the array Y would be:

\[Y=(y_{1}^{[1]},y_{1}^{[2]},\cdots,y_{1}^{[B]}, y_{2}^{[1]},y_{2}^{[2]},\cdots,y_{2}^{[B]},\cdots)\]

And so on for T elements. Therefore, the Jacobian of L w.r.t Y is:

\[\frac{\partial L}{\partial Y}=( \frac{\partial L}{\partial y_{1}^{[1]}}, \frac{\partial L}{\partial y_{1}^{[2]}},\cdots, \frac{\partial L}{\partial y_{1}^{[B]}}, \frac{\partial L}{\partial y_{2}^{[1]}}, \frac{\partial L}{\partial y_{2}^{[2]}},\cdots, \frac{\partial L}{\partial y_{2}^{[B]}},\cdots)\]

To find DY(W), let's first see how to compute Y. The i-th element of Y for batch b is:

\[y_{i}^{[b]}=\sum_{j=1}^{N}W_{i,j}x_{j}^{[b]}+b_i\]

Recall that the Jacobian DY(W) now has shape [TB,TN]. Previously we had to unroll the [T,N] of the weight matrix into the rows. Now we'll also have to unrill the [T,B] of the output into the columns. As before, first all b-s for t=1, then all b-s for t=2, etc. If we carefully compute the derivative, we'll see that the Jacobian matrix has similar structure to the single-batch case, just with each line repeated B times for each of the batch elements:

\[DY(W)=\begin{bmatrix} x_{1}^{[1]} & x_{2}^{[1]} & \cdots & x_{N}^{[1]} & \cdots & 0 & 0 & \cdots & 0 \\ \\ x_{1}^{[2]} & x_{2}^{[2]} & \cdots & x_{N}^{[2]} & \cdots & 0 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\ x_{1}^{[B]} & x_{2}^{[B]} & \cdots & x_{N}^{[B]} & \cdots & 0 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & x_{1}^{[1]} & x_{2}^{[1]} & \cdots & x_{N}^{[1]} \\ \\ 0 & 0 & \cdots & 0 & \cdots & x_{1}^{[2]} & x_{2}^{[2]} & \cdots & x_{N}^{[2]} \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & x_{1}^{[B]} & x_{2}^{[B]} & \cdots & x_{N}^{[B]} \\ \end{bmatrix}\]

Multiplying the two Jacobians together we get the full gradient of L w.r.t. each element in the weight matrix. Where previously (in the non-batch case) we had:

\[\frac{\partial L}{\partial W_{ij}}=\frac{\partial L}{\partial y_i}x_j\]

Now, instead, we'll have:

\[\frac{\partial L}{\partial W_{ij}}=\sum_{b=1}^{B}\frac{\partial L}{\partial y_{i}^{[b]}}x_{j}^{[b]}\]

Which makes total sense, since it's simply taking the loss gradient computed from each batch separately and adds them up. This aligns with our intuition of how gradient for a whole batch is computed - compute the gradient for each batch element separately and add up all the gradients [2].

As before, there's a clever way to express the final gradient using matrix operations. Note the sum across all batch elements when computing \frac{\partial L}{\partial W_{ij}}. We can express this as the matrix multiplication:

\[\frac{\partial L}{\partial W}=\frac{\partial L}{\partial Y}\cdot X^T\]

This is a good place to recall the computation cost again. Previously we've seen that for a single-input case, the Jacobian can be extremely large ([T,NT] having about 160 million elements). In the batch case, the Jacobian would be even larger since its shape is [TB,NT]; with a reasonable batch of 32, it's something like 5-billion elements strong. It's good that we don't actually have to hold the full Jacobian in memory and have a shortcut way of computing the gradient.

Bias gradient for a batch

For the bias, we have:

\[\frac{\partial{L}}{\partial{b}}=D(L \circ Y)(b)=DL(Y(b)) \cdot DY(b)\]

DL(Y(b)) here has the shape [1,TB]; DY(b) has the shape [TB,T]. Therefore, the shape of \frac{\partial{L}}{\partial{b}} is [1,T], as before.

From the formula for computing Y:

\[y_1=\sum_{j=1}^{N}W_{1,j}x_{j}+b_1\]

We get, for any batch b:

\[\frac{\partial y_{i}^{[b]}}{\partial b_j}=\left\{\begin{matrix} 1 & i=j \\ 0 & i\neq j \end{matrix}\right\]

So, whereas DY(b) was an identity matrix in the no-batch case, here it looks like this:

\[DY(b)=\begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 1 & 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \\ 0 & 0 & 0 & \cdots & 1 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \\ \end{bmatrix}\]

With B identical rows at a time, for a total of TB rows. Since \frac{\partial L}{\partial Y} is the same as before, their matrix multiplication result has this in column j:

\[\frac{\partial{L}}{\partial{b_j}}=\sum_{b=1}^{B}\frac{\partial L}{\partial y_{j}^{[b]}}\]

Which just means adding up the gradient effects from every batch element independently.

Addendum - gradient w.r.t. x

This post started by explaining that the parameters of a fully-connected layer we're usually looking to optimize are the weight matrix and bias. In most cases this is true; however, in some other cases we're actually interested in propagating a gradient through x - often when there are more layers before the fully-connected layer in question.

Let's find the derivative \frac{\partial{L}}{\partial{x}}. The chain rule here is:

\[\frac{\partial{L}}{\partial{x}}=D(L \circ y)(x)=DL(y(x)) \cdot Dy(x)\]

Dimensions: DL(y(x)) is [1, T] as before; Dy(x) has T outputs (elements of y) and N inputs (elements of x), so its dimensions are [T, N]. Therefore, the dimensions of \frac{\partial{L}}{\partial{x}} are [1, N].

From:

\[y_1=\sum_{j=1}^{N}W_{1,j}x_{j}+b_1\]

We know that \frac{\partial y_1}{\partial x_j}=W_{1,j}. Generalizing this, we get \frac{\partial y_i}{\partial x_j}=W_{i,j}; in other words, Dy(x) is just the weight matrix W. So \frac{\partial{L}}{\partial{x_i}} is the dot product of DL(y(x)) with the i-th column of W.

Computationally, we can express this as follows:

dx = np.dot(dy.T, W).T

Again, recall that our vectors are column vectors. Therefore, to multiply dy from the left by W we have to transpose it to a row vector first. The result of this matrix multiplication is a [1, N] row-vector, so we transpose it again to get a column.

An alternative method to compute this would transpose W rather than dy and then swap the order:

dx = np.dot(W.T, dy)

These two methods produce exactly the same dx; it's important to be familiar with these tricks, because otherwise it may be confusing to see a transposed W when we expect the actual W from gradient computations.


[1]

As explained in the softmax post, we linearize the 2D matrix W into a single vector with NT elements using some approach like row-major, where the N elements of the first row go first, then the N elements of the second row, and so on until we have NT elements for all the rows.

This is a fully general approach as we can linearize any-dimensional arrays. To work with Jacobians, we're interested in K inputs, no matter where they came from - they could be a linearization of a 4D array. As long as we remember which element out of the K corresponds to which original element, we'll be fine.

[2]In some cases you may hear about averaging the gradients across the batch. Averaging just means dividing the sum by B; it's a constant factor that can be consolidated into the learning rate.