wc in x64 assembly

In the past few weeks there appears to be a resurgence of interest in assembly programming; most likely, this is due to the release of the source code for the Appolo 11 guidance computer on Github - a truly awesome repository, if you haven't seen it yet.

This inspired me to dig up a project I did ~3 years ago - a reimplementation of the wc command-line tool in pure x64 assembly. It's been open on Github from the start, but I never really mentioned it anywhere. Summoning my best powers of imagination, I named the project... wait for it... wcx64. It's ~400 lines of assembly code (most of which are comments) in the default gas syntax.

As a compiler hacker, and an embedded programmer back in the day, I've done my share of writing and digging in assembly code; wrote my own assembler, and did work on LLVM's assembler. There's an Assembly tag in this blog with a bunch of posts being assembly-related. I wrote "production" assembly for many architectures - from small 8-bit controllers, to x64, to obscure DSPs. However, almost all of this code was self-contained for very specific tasks.

The idea of wcx64 was to understand how realistic programs could be written from start to end, including dealing with the OS, the file system, input-output and so on. It's a nice "code kata" exercise I find useful when exploring new programming languages, because you get to do a lot of things "real" programs do, just confined to a very simple task. Here are some of the interesting things you'll find in the code:

  • Processing command-line arguments (argc and argv).
  • Reading from files and from the standard input and writing to standard output, using system calls.
  • Writing functions that adhere to the x64 ABI calling convention, including passing and returning arguments and callee-saved registers.
  • Fundamentals of string processing: very simple parsing of text using a two-state state machine, converting numbers to strings, etc.

And some facts about the outcome:

  • wcx64 doesn't need any C runtime/support code to work. It invokes Linux system calls directly, and is completely self-contained.
  • When assembled and linked, the binary size is 6.5 Kib.
  • It's fast! On a couple of samples I tried, it's between 6 and 13 times faster than the command-line wc tool for processing 1 GiB files.

The performance is surprising to me. I didn't expect the difference to be this great. True, the inner loop of wcx64 is tight assembly, but I really didn't spend any time optimizing it, opting for clarity instead. My guess is that the real wc supports more features, like multi-byte characters.

Linear regression

Linear regression is one of the most basic, and yet most useful approaches for predicting a single quantitative (real-valued) variable given any number of real-valued predictors. This article presents the basics of linear regression for the "simple" (single-variable) case, as well as for the more general multivariate case. Companion code in Python implements the techniques described in the article on simulated and realistic data sets. The code is self-contained, using only Numpy as a dependency.

Simple linear regression

The most basic kind of regression problem has a single predictor (the input) and a single outcome. Given a list of input values x_i and corresponding output values y_i, we have to find parameters m and b such that the linear function:

\[\hat{y}(x) = mx + b\]

Is "as close as possible" to the observed outcome y. More concretely, suppose we get this data [1]:

Linear regression input data

We have to find a slope m and intercept b for a line that approximates this data as well as possible. We evaluate how well some pair of m and b approximates the data by defining a "cost function". For linear regression, a good cost function to use is the Mean Square Error (MSE) [2]:

\[\operatorname{MSE}(m, b)=\frac{1}{n}\sum_{i=1}^n(\hat{y_i} - y_i)^2\]

Expanding \hat{y_i}=m{x_i}+b, we get:

\[\operatorname{MSE}(m, b)=\frac{1}{n}\sum_{i=1}^n(m{x_i} + b - y_i)^2\]

Let's turn this into Python code (link to the full code sample):

def compute_cost(x, y, m, b):
    """Compute the MSE cost of a prediction based on m, b.

    x: inputs vector
    y: observed outputs vector
    m, b: regression parameters

    Returns: a scalar cost.
    yhat = m * x + b
    diff = yhat - y
    # Vectorized computation using a dot product to compute sum of squares.
    cost = np.dot(diff.T, diff) / float(x.shape[0])
    # Cost is a 1x1 matrix, we need a scalar.
    return cost.flat[0]

Now we're faced with a classical optimization problem: we have some parameters (m and b) we can tweak, and some cost function J(m, b) we want to minimize. The topic of mathematical optimization is vast, but what ends up working very well for machine learning is a fairly simple algorithm called gradient descent.

Imagine plotting J(m, b) as a 3-dimensional surface, and picking some random point on it. Our goal is to find the lowest point on the surface, but we have no idea where that is. A reasonable guess is to move a bit "downwards" from our current location, and then repeat.

"Downwards" is exactly what "gradient descent" means. We make a small change to our location (defined by m and b) in the direction in which J(m, b) decreases most - the gradient [3]. We then repeat this process until we reach a minimum, hopefully global. In fact, since the linear regression cost function is convex we will find the global minimum this way. But in the general case this is not guaranteed, and many sophisticated extensions of gradient descent exist that try to avoid local minima and maximize the chance of finding a global one.

Back to our function, \operatorname{MSE}(m, b). The gradient is defined as the vector:

\[\nabla \operatorname{MSE}=\left \langle \frac{\partial \operatorname{MSE}}{\partial m}, \frac{\partial \operatorname{MSE}}{\partial b} \right \rangle\]

To find it, we have to compute the partial derivatives of MSE w.r.t. the learning parameters m and b:

\[\begin{align*} \frac{\partial \operatorname{MSE}}{\partial m}&=\frac{2}{n}\sum_{i=i}^n(m{x_i}+b-y_i)x_i\\ \frac{\partial \operatorname{MSE}}{\partial b}&=\frac{2}{n}\sum_{i=i}^n(m{x_i}+b-y_i) \end{align*}\]

And then update m and b in each step of the learning with:

\[\begin{align*} m &= m-\eta \frac{\partial \operatorname{MSE}}{\partial m} \\ b &= b-\eta \frac{\partial \operatorname{MSE}}{\partial b} \\ \end{align*}\]

Where \eta is a customizable "learning rate", a hyperparameter. Here is the gradient descent loop in Python. Note that we examine the whole data set in every step; for much larger data sets, SGD (Stochastic Gradient Descent) with some reasonable mini-batch would make more sense, but for simple linear regression problems the data size is rarely very big.

def gradient_descent(x, y, nsteps, learning_rate=0.1):
    """Runs gradient descent optimization to fit a line y^ = x * m + b.

    x, y: input data and observed outputs.
    nsteps: how many steps to run the optimization for.
    learning_rate: learning rate of gradient descent.

    Yields 'nsteps + 1' triplets of (m, b, cost) where m, b are the fit
    parameters for the given step, and cost is their cost vs the real y.
    n = x.shape[0]
    # Start with m and b initialized to 0s for the first try.
    m, b = 0, 0
    yield m, b, compute_cost(x, y, m, b)

    for step in range(nsteps):
        yhat = m * x + b
        diff = yhat - y
        dm = learning_rate * (diff * x).sum() * 2 / n
        db = learning_rate * diff.sum() * 2 / n
        m -= dm
        b -= db
        yield m, b, compute_cost(x, y, m, b)

After running this for 30 steps, the gradient converges and the parameters barely change. Here's a 3D plot of the cost as a function of the regression parameters, along with a contour plot of the same function. It's easy to see this function is convex, as expected. This makes finding the global minimum simple, since no matter where we start, the gradient will lead us directly to it.

To help visualize this, I marked the cost for each successive training step on the contour plot - you can see how the algorithm relentlessly converges to the minimum

Linear regression cost and contour

The final parameters learned by the regression are 2.2775 for m and 6.0028 for b, which is very close to the actual parameters I used to generate this fake data with.

Here's a visualization that shows how the regression line improves progressively during learning:

Regression fit visualization

Evaluating how good the fit is

In statistics, there are many ways to evaluate how good a "fit" some model is on the given data. One of the most popular ones is the r-squared test ("coefficient of determination"). It measures the proportion of the total variance in the output (y) that can be explained by the variation in x:

\[R^2 = 1 - \frac{\sum_{i=1}^n (y_i - (m{x_i} + b))^2}{n\cdot var(y)}\]

This is trivial to translate to code:

def compute_rsquared(x, y, m, b):
    yhat = m * x + b
    diff = yhat - y
    SE_line = np.dot(diff.T, diff)
    SE_y = len(y) * y.var()
    return 1 - SE_line / SE_y

For our regression results, I get r-squared of 0.76, which isn't too bad. Note that the data is very jittery, so it's natural the regression cannot explain all the variance. As an interesting exercise, try to modify the code that generates the data with different standard deviations for the random noise and see the effect on r-squared.

An analytical solution to simple linear regression

Using the equations for the partial derivatives of MSE (shown above) it's possible to find the minimum analytically, without having to resort to a computational procedure (gradient descent). We compare the derivatives to zero:

\[\begin{align*} \frac{\partial \operatorname{MSE}}{\partial m}&=\frac{2}{n}\sum_{i=i}^n(m{x_i}+b-y_i)x_i = 0\\ \frac{\partial \operatorname{MSE}}{\partial b}&=\frac{2}{n}\sum_{i=i}^n(m{x_i}+b-y_i) = 0 \end{align*}\]

And solve for m and b. To make the equations easier to follow, let's introduce a bit of notation. \bar{x} is the mean value of x across all samples. Similarly \bar{y} is the mean value of y. So the sum \sum_{i=1}^n x_i is actually n\bar{x}. Now let's take the second equation from above and see how to simplify it:

\[\begin{align*} \frac{\partial \operatorname{MSE}}{\partial b} &= \frac{2}{n}\sum_{i=i}^n(m{x_i}+b-y_i) \\                                 &= \frac{2}{n}(mn\bar{x}+nb-n\bar{y}) \\                                 &= 2m\bar{x} + 2b - 2\bar{y} = 0 \end{align*}\]

Similarly, for the partial derivative by m we can reach:

\[\frac{\partial \operatorname{MSE}}{\partial m}= 2m\overline{x^2} + 2b\bar{x} - 2\overline{xy} = 0\]

In these equations, all quantities except m and b are constant. Solving them for the unknowns m and b, we get [4]:

\[m = \frac{\bar{x}\bar{y} - \overline{xy}}{\bar{x}^2 - \overline{x^2}} \qquad b = \bar{y} - \bar{x}\frac{\bar{x}\bar{y} - \overline{xy}}{\bar{x}^2 - \overline{x^2}}\]

If we plug the data values we have for x and y in these equations, we get 2.2777 for m and 6.0103 for b - almost exactly the values we obtained with regression [5].

Remember that by comparing the partial derivatives to zero we find a critical point, which is not necessarily a minimum. We can use the second derivative test to find what kind of critical point that is, by computing the Hessian of the cost:

\[H(m, b) = \begin{pmatrix} \operatorname{MSE}_{mm}(x, y) & \operatorname{MSE}_{mb}(x, y) \\ \operatorname{MSE}_{bm}(x, y) & \operatorname{MSE}_{bb}(x, y) \end{pmatrix}\]

Plugging the numbers and running the test, we can indeed verify that the critical point is a minimum.

Multiple linear regression

The good thing about simple regression is that it's easy to visualize. The model is trained using just two parameters, and visualizing the cost as a function of these two parameters is possible since we get a 3D plot. Anything beyond that becomes increasingly more difficult to visualize.

In simple linear regression, every x is just a number; so is every y. In multiple linear regression this is no longer so, and each data point x is a vector. The model parameters can also be represented by the vector \theta. To avoid confusion of indices and subscripts, let's agree that we use subscripts to denote components of vectors, while parenthesized superscripts are used to denote different samples. So x_1^{(6)} is the second component of sample 6.

Our goal is to find the vector \theta such that the linear function:

\[\hat{y}(x) = \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_n x_n\]

Is as close as possible to the actual y across all samples. Since working with vectors is easier for this problem, we define x_0 to always be equal to 1, so that the first term in the equation above denotes the intercept. Expressing the regression coefficients as a vector:

\[\begin{pmatrix} \theta_0\\ \theta_1\\ ...\\ \theta_n \end{pmatrix}\in\mathbb{R}^{n+1}\]

We can now rewrite \hat{y}(x) as:

\[\hat{y}(x) = \theta^T x\]

Where both \theta and x are column vectors with n+1 elements, as shown above. The mean square error (over k samples) now becomes:

\[\operatorname{MSE}=\frac{1}{k}\sum_{i=1}^k(\hat{y}(x^{(i)}) - y^{(i)})^2\]

Now we have to find the partial derivative of this cost by each \theta. Using the chain rule, it's easy to see that:

\[\frac{\partial \operatorname{MSE}}{\partial \theta_j} = \frac{2}{k}\sum_{i=1}^k(\hat{y}(x^{(i)}) - y^{(i)})x_j^{(i)}\]

And use this to update the parameters in every training step. The code is actually not much different from the simple regression case; here is a well documented, completely worked out example. The code takes a realistic dataset from the UCI machine learning repository with 4 predictors and a single outcome and builds a regression model. 4 predictors plus one intercept give us a 5-dimensional \theta, which is utterly impossible to visualize, so we have to stick to math in order to analyze it.

An analytical solution to multiple linear regression

Multiple linear regression also has an analytical solution. If we compute the derivative of the cost by each \theta, we'll end up with n+1 equations with the same number of variables, which we can solve analytically.

An elegant matrix formula that computes \theta from X and y is called the Normal Equation:


I've written about deriving the normal equation previously, so I won't spend more time on it. The accompanying code computes \theta using the normal equation and compares the result with the \theta obtained from gradient descent.

As an excercise, you can double check that the analytical solution for simple linear regression (formulae for m and b) is just a special case of applying the normal equation in two dimensions.

You may wonder: when should be use the analytical solution, and when is gradient descent better? In general, whenever we can use the analytical solution - we should. But it's not always feasible, computationally.

Consider a data set with k samples and n features. Then X is a k x n matrix, and hence X^TX is a n x n matrix. Inverting a matrix is a O(n^3) operation, so for large n, finding (X^TX)^{-1} can take quite a bit of time. Moreover, keeping X^TX in memory can be computationally infeasible if X is huge and sparse, but X^TX is dense. In all these cases, iterative gradient descent is a more feasible approach.

In addition, the moment we deviate from the linear regression a bit, such as adding nonlinear terms, regularization, or some other model enhancement, the analytical solutions no longer apply. Gradient descent keeps working just the same, however, as long as we know how to compute the gradient of the new cost function.

[1]This data was generated by using a slope of 2.25, intercept of 6 and added Gaussian noise with a standard deviation of 1.5
[2]Some resources use SSE - the Squared Sum Error, which is just the MSE without the averaging. Yet others have 2n in the denominator to make the gradient derivation cleaner. None of this really matters in practice. When finding the minimum analytically, we compare derivatives to zero so constant factors cancel out. When running gradient descent, all constant factors are subsumed into the learning rate which is arbitrary.
[3]For a mathematical justification for why the gradient leads us in the direction of most change, see this post.
[4]An alternative way I've seen this equation written is to express m as:
\[\begin{align*} m &= \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2} \\   &= \frac{cov(x, y)}{var(x)} \end{align*}\]
[5]Can you figure out why even the analytical solution is a little off from the actual parameters used to generated this data?

Understanding gradient descent

Gradient descent is a standard tool for optimizing complex functions iteratively within a computer program. Its goal is: given some arbitrary function, find a minumum. For some small subset of functions - those that are convex - there's just a single minumum which also happens to be global. For most realistic functions, there may be many minima, so most minima are local. Making sure the optimization finds the "best" minumum and doesn't get stuck in sub-optimial minima is out of the scope of this article. Here we'll just be dealing with the core gradient descent algorithm for finding some minumum from a given starting point.

The main premise of gradient descent is: given some current location x in the search space (the domain of the optimized function) we ought to update x for the next step in the direction opposite to the gradient of the function computed at x. But why is this the case? The aim of this article is to explain why, mathematically.

This is also the place for a disclaimer: the examples used throughout the article are trivial, low-dimensional, convex functions. We don't really need an algorithmic procedure to find their global minumum - a quick computation would do, or really just eyeballing the function's plot. In reality we will be dealing with non-linear, 1000-dimensional functions where it's utterly impossible to visualize anything, or solve anything analytically. The approach works just the same there, however.

Building intuition with single-variable functions

The gradient is formally defined for multivariate functions. However, to start building intuition, it's useful to begin with the two-dimensional case, a single-variable function f(x).

In single-variable functions, the simple derivative plays the role of a gradient. So "gradient descent" would really be "derivative descent"; let's see what that means.

As an example, let's take the function f(x)=(x-1)^2. Here's its plot, in red:

Plot of parabola with tangent lines

I marked a couple of points on the plot, in blue, and drew the tangents to the function at these points. Remember, our goal is to find the minimum of the function. To do that, we'll start with a guess for an x, and continously update it to improve our guess based on some computation. How do we know how to update x? The update has only two possible directions: increase x or decrease x. We have to decide which of the two directions to take.

We do that based on the derivative of f(x). The derivative at some point x_0 is defined as the limit [1]:

\[\frac{d}{dx}f(x_0)=\lim_{h \to 0}\frac{f(x_0+h)-f(x_0)}{h}\]

Intuitively, this tells us what happens to f(x) when we add a very small value to x. For example in the plot above, at x_0=3 we have:

\[\begin{align*} \frac{d}{dx}f(3)&=\lim_{h \to 0}\frac{f(3+h)-f(3)}{h} \\                 &=\lim_{h \to 0}\frac{(3+h-1)^2-(3-1)^2}{h} \\                 &=\lim_{h \to 0}\frac{h^2+4h}{h} \\                 &=\lim_{h \to 0}h+4=4 \end{align*}\]

This means that the slope of \frac{df}{dx} at x_0=3 is 4; for a very small positive change h to x at that point, the value of f(x) will increase by 4h. Therefore, to get closer to the minimum of f(x) we should rather decrease x_0 a bit.

Let's take another example point, x_0=-1. At that point, if we add a little bit to x_0, f(x) will decrease by 4x that little bit. So that's exactly what we should do to get closer to the minimum.

It turns out that in both cases, we should nudge x_0 in the direction opposite to the derivative at x_0. That's the most basic idea behind gradient descent - the derivative shows us the way to the minimum; or rather, it shows us the way to the maximum and we then go in the opposite direction. Given some initial guess x_0, the next guess will be:


Where \eta is what we call a "learning rate", and is constant for each given update. It's the reason why we don't care much about the magnitude of the derivative at x_0, only its direction. In general, it makes sense to keep the learning rate fairly small so we only make a tiny step at at time. This makes sense mathematically, because the derivative at a point is defined as the rate of change of f(x) assuming an infinitesimal change in x. For some large change x who knows where we will get. It's easy to imagine cases where we'll entirely overshoot the minimum by making too large a step [2].

Multivariate functions and directional derivatives

With functions of multiple variables, derivatives become more interesting. We can't just say "the derivative points to where the function is increasing", because... which derivative?

Recall the formal definition of the derivative as the limit for a small step h. When our function has many variables, which one should have the step added? One at a time? All at once? In multivariate calculus, we use partial derivatives as building blocks. Let's use a function of two variables - f(x,y) as an example throughout this section, and define the partial derivatives w.r.t. x and y at some point (x_0,y_0):

\[\begin{align*} \frac{\partial }{\partial x}f(x_0,y_0)&=\lim_{h \to 0}\frac{f(x_0+h,y_0)-f(x_0,y_0)}{h} \\ \frac{\partial }{\partial y}f(x_0,y_0)&=\lim_{h \to 0}\frac{f(x_0,y_0+h)-f(x_0,y_0)}{h} \end{align*}\]

When we have a single-variable function f(x), there's really only two directions in which we can move from a given point x_0 - left (decrease x) or right (increase x). With two variables, the number of possible directions is infinite, becase we pick a direction to move on a 2D plane. Hopefully this immediately pops ups "vectors" in your head, since vectors are the perfect tool to deal with such problems. We can represent the change from the point (x_0,y_0) as the vector \vec{v}=\langle a,b \rangle [3]. The directional derivative of f(x,y) along \vec{v} at (x_0,y_0) is defined as its rate of change in the direction of the vector at that point. Mathematically, it's defined as:

\[\begin{equation} D_{\vec{v}}f(x_0,y_0)=\lim_{h \to 0}\frac{f(x_0+ah,y_0+bh)-f(x_0,y_0)}{h} \tag{1} \end{equation}\]

The partial derivatives w.r.t. x and y can be seen as special cases of this definition. The partial derivative \frac{\partial f}{\partial x} is just the directional direvative in the direction of the x axis. In vector-speak, this is the directional derivative for \vec{v}=\langle a,b \rangle=\widehat{e_x}=\langle 1,0 \rangle, the standard basis vector for x. Just plug a=1,b=0 into (1) to see why. Similarly, the partial derivative \frac{\partial f}{\partial y} is the directional derivative in the direction of the standard basis vector \widehat{e_y}=\langle 0,1 \rangle.

A visual interlude

Functions of two variables f(x,y) are the last frontier for meaningful visualizations, for which we need 3D to plot the value of f for each given x and y. Even in 3D, visualizing gradients is significantly harder than in 2D, and yet we have to try since for anything above two variables all hopes of visualization are lost.

Here's the function f(x,y)=x^2+y^2 plotted in a small range around zero. I drew the standard basis vectors \widehat{x}=\widehat{e_x} and \widehat{y}=\widehat{e_y} [4] and some combination of them \vec{v}.

3D parabola with direction vector markers

I also marked the point on f(x,y) where the vectors are based. The goal is to help us keep in mind how the independent variables x and y change, and how that affects f(x,y). We change x and y by adding some small vector \vec{v} to their current value. The result is "nudging" f(x,y) in the direction of \vec{v}. Remember our goal for this article - find \vec{v} such that this "nudge" gets us closer to a minimum.

Finding directional derivatives using the gradient

As we've seen, the derivative of f(x,y) in the direction of \vec{v} is defined as:

\[D_{\vec{v}}f(x_0,y_0)=\lim_{h \to 0}\frac{f(x_0+ah,y_0+bh)-f(x_0,y_0)}{h}\]

Looking at the 3D plot above, this is how much the value of f(x,y) changes when we add \vec{v} to the vector \langle x_0,y_0 \rangle. But how do we do that? This limit definition doesn't look like something friendly for analytical analysis for arbitrary functions. Sure, we could plug (x_0,y_0) and \vec{v} in there and do the computation, but it would be nice to have an easier-to-use formula. Luckily, with the help of the gradient of f(x,y) it becomes much easier.

The gradient is a vector value we compute from a scalar function. It's defined as:

\[\nabla f=\left \langle \frac{\partial f}{\partial x},\frac{\partial f}{\partial y} \right \rangle\]

It turns out that given a vector \vec{v}, the directional derivative D_{\vec{v}}f can be expressed as the following dot product:

\[D_{\vec{v}}f=(\nabla f) \cdot \vec{v}\]

If this looks like a mental leap too big to trust, please read the Appendix section at the bottom. Otherwise, feel free to verify that the two are equivalent with a couple of examples. For instance, try to find the derivative in the direction of \vec{v}=\langle \frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}} \rangle at (x_0,y_0)=(-1.5,0.25). You should get \frac{-2.5}{\sqrt{2}} using both methods.

Direction of maximal change

We're almost there! Now that we have a relatively simple way of computing any directional derivative from the partial derivatives of a function, we can figure out which direction to take to get the maximal change in the value of f(x,y).

We can rewrite:

\[D_{\vec{v}}f=(\nabla f) \cdot \vec{v}\]


\[D_{\vec{v}}f=\left \| \nabla f \right \| \left \| \vec{v} \right \| cos(\theta)\]

Where \theta is the angle between the two vectors. Now, recall that \vec{v} is normalized so its magnitude is 1. Therefore, we only care about the direction of \vec{v} w.r.t. the gradient. When is this equation maximized? When \theta=0, because then cos(\theta)=1. Since a cosine can never be larger than 1, that's the best we can have.

So \theta=0 gives us the largest positive change in f(x,y). To get \theta=0, \vec{v} has to point in the same direction as the gradient. Similarly, for \theta=180^{\circ} we get cos(\theta)=-1 and therefore the largest negative change in f(x,y). So if we want to decrease f(x,y) the most, \vec{v} has to point in the opposite direction of the gradient.

Gradient descent update for multivariate functions

To sum up, given some starting point (x_0,y_0), to nudge it in the direction of the minimum of f(x,y), we first compute the gradient of f(x,y) at (x_0,y_0). Then, we update (using vector notation):

\[\langle x_1,y_1 \rangle=\langle x_0,y_0 \rangle-\eta \nabla{f(x_0,y_0)}\]

Generalizing to multiple dimensions, let's say we have the function f:\mathbb{R}^n\rightarrow \mathbb{R} taking the n-dimensional vector \vec{x}=(x_1,x_2 \dots ,x_n). We define the gradient update at step k to be:

\[\vec{x}_{(k)}=\vec{x}_{(k-1)} - \eta \nabla{f(\vec{x}_{(k-1)})}\]

Previously, for the single-variate case we said that the derivatve points us to the way to the minimum. Now we can say that while there are many ways to get to the minimum (eventually), the gradient points us to the fastest way from any given point.

Appendix: directional derivative definition and gradient

This is an optional section for those who don't like taking mathematical statements for granted. Now it's time to prove the equation shown earlier in the article, and on which its main result is based:

\[D_{\vec{v}}f=(\nabla f) \cdot \vec{v}\]

As usual with proofs, it really helps to start by working through an example or two to build up some intuition into why the equation works. Feel free to do that if you'd like, using any function, starting point and direction vector \vec{v}.

Suppose we define a function w(t) as follows:


Where x=x(t) and y=y(t) defined as:

\[\begin{align*} x(t)&=x_0+at \\ y(t)&=y_0+bt \end{align*}\]

In these definitions, x_0, y_0, a and b are constants, so both x(t) and y(t) are truly functions of a single variable. Using the chain rule, we know that:

\[\frac{dw}{dt}=\frac{\partial f}{\partial x}\frac{dx}{dt}+\frac{\partial f}{\partial y}\frac{dy}{dt}\]

Substituting the derivatives of x(t) and y(t), we get:

\[\frac{dw}{dt}=a\frac{\partial f}{\partial x}+b\frac{\partial f}{\partial y}\]

One more step, the significance of which will become clear shortly. Specifically, the derivative of w(t) at t=0 is:

\[\begin{equation} \frac{d}{dt}w(0)=a\frac{\partial}{\partial x}f(x_0,y_0)+b\frac{\partial}{\partial y}f(x_0,y_0) \tag{2} \end{equation}\]

Now let's see how to compute the derivative of w(t) at t=0 using the formal limit definition:

\[\begin{align*} \frac{d}{dt}w(0)&=\lim_{h \to 0}\frac{w(h)-w(0)}{h} \\                 &=\lim_{h \to 0}\frac{f(x_0+ah,b_0+bh)-f(x_0,y_0)}{h} \end{align*}\]

But the latter is precisely the definition of the directional derivative in equation (1). Therefore, we can say that:


From this and (2), we get:

\[\frac{d}{dt}w(0)=D_{\vec{v}}f(x_0,y_0)=a\frac{\partial}{\partial x}f(x_0,y_0)+b\frac{\partial}{\partial y}f(x_0,y_0)\]

This derivation is not special to the point (x_0,y_0) - it works just as well for any point where f(x,y) has partial derivatives w.r.t. x and y; therefore, for any point (x,y) where f(x,y) is differentiable:

\[\begin{align*} D_{\vec{v}}f(x,y)&=a\frac{\partial}{\partial x}f(x,y)+b\frac{\partial}{\partial y}f(x,y) \\                      &=\left \langle \frac{\partial f}{\partial x},\frac{\partial f}{\partial y} \right \rangle \cdot \langle a,b \rangle \\                      &=(\nabla f) \cdot \vec{v} \qedhere \end{align*}\]
[1]The notation \frac{d}{dx}f(x_0) means: the value of the derivative of f w.r.t. x, evaluated at x_0. Another way to say the same would be f{}'(x_0).
[2]That said, in some advanced variations of gradient descent we actually want to probe different areas of the function early on in the process, so a larger step makes sense (remember, realistic functions have many local minima and we want to find the best one). Further along in the optimization process, when we've settled on a general area of the function we want the learning rate to be small so we actually get to the minimum. This approach is called annealing and I'll leave it for some future article.
[3]To avoid tracking vector magnitudes, from now on in the article we'll be dealing with normalized direction vectors. That is, we always assume that \left \| \vec{v}  \right \|=1.
[4]Yes, \widehat{y} is actually going in the opposite direction so it's -\widehat{e_y}, but that really doesn't change anything. It was easier to draw :)