Eli Bendersky's website - Math

Notes on Taylor and Maclaurin series

2024-07-23T18:55:00-07:00

A Maclaurin series is a power series - a polynomial with carefully selected coefficients and an infinite number of terms - used to approximate arbitrary functions with some conditions (e.g. differentiability). The Maclaurin series does this for input values close to 0, and is a special case of the Taylor series which can be used to find a polynomial approximation around any value.

Intuition

Let's say we have a function and we want to approximate it with some other - polynomial - function p(x). To make sure that p(x) is as close as possible to , we'll create a function that has similar derivatives to .

We start with a constant polynomial, such that p(0)=f(0). This approximation is perfect at 0 itself, but not as much elsewhere.
We want p(x) to behave similarly to around 0, so we'll set the derivative of our approximation to be the same as the derivative of at 0; in other words p'(0)=f'(0). This approximation will be decent very close to 0 (at least in the direction of the slope), but will become progressively worse as we get farther away from 0.
We continue this process, by setting the second derivative to be p''(0)=f''(0), the third derivative to be p'''(0)=f'''(0) and so on, for as many terms as we need to achieve a good approximation in our desired range. Intuitively, if many derivatives of p(x) are identical to the corresponding derivatives of at some point, the two functions will have very similar behaviors around that point [1].

The full Maclaurin series that accomplishes this approximation is:

\[p(x) = f(0)+\frac{f'(0)}{1!}x+\frac{f''(0)}{2!}x^2+\frac{f'''(0)}{3!}x^3+\cdots=\sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!}x^n\]

We'll get to how this equation is found in a moment, but first an example that demonstrates its approximation capabilities. Suppose we want to find a polynomial approximation for f(x)=cos(x). Following the definition of the Maclaurin series, it's easy to calculate:

\[p_{cos}(x)=1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!}+\frac{x^8}{8!}-\cdots\]

(try it as an exercise).

The dark blue line is the cosine function f(x)=cos(x). The light blue lines are successive approximations, with k terms of the power series p_{cos}(x) included:

With k=1, p_{cos}(x)=1 since that's just the value of cos(x) at 0.
With k=2, p_{cos}(x)=1-\frac{x^2}{2}, and indeed the line looks parabolic
With k=3 we get a 4th degree polynomial which tracks the function better, and so on

With more terms in the power series, the approximation resembles cos(x) more and more, at least close to 0. The farther away we get from 0, the more terms we need for a good approximation [2].

How the Maclaurin series works

This section shows how one arrives at the formula for the Maclaurin series, and connects it to the intuition of equating derivatives.

We'll start by observing that the Maclaurin series is developed around 0 for a good reason. The generalized form of a power series is:

\[p(x)=a_0+a_1 x+a_2 x^2 + a_3 x^3 + a_4 x^4 + \cdots\]

To properly approximate a function, we need this series to converge; therefore, it would be desirable for its terms to decrease. An x value close to zero guarantees that x^n becomes smaller and smaller with each successive term. There's a whole section on convergence further down with more details.

Recall from the Intuition section that we're looking for a polynomial that passes through the same point as at 0, and that has derivatives equal to those of at that point.

Let's calculate a few of the first derivatives of p(x); the function itself can be considered as the 0-th derivative:

\[\begin{align*} p(x)&=a_0+a_1 x+a_2 x^2 + a_3 x^3+ a_4 x^4+\cdots\\ p'(x)&= a_1 +2 a_2 x + 3 a_3 x^2+4 a_4 x^3+\cdots\\ p''(x)&= 2 a_2 + 3 \cdot 2 a_3 x+ 4 \cdot 3 x^2+\cdots\\ p'''(x)&= 3\cdot 2 a_3 + 4\cdot 3 \cdot 2 x+\cdots \\ \cdots \end{align*}\]

Now, equate these to corresponding derivatives of at x=0. All the non-constant terms drop out, and we're left with:

\[\begin{align*} f(0)&=p(0)=a_0\\ f'(0)&=p'(0)= a_1 \\ f''(0)&=p''(0)= 2 a_2 \\ f'''(0)&=p'''(0)= 3\cdot 2 a_3 \\ \cdots\\ f^{(n)}(0)&=p^{(0)}(0)=n!a_n\\ \cdots\\ \end{align*}\]

So we can set the coefficients of the power series, generalizing the denominators using factorials:

\[\begin{align*} a_0 &= f(0)\\ a_1 &= \frac{f'(0)}{1!}\\ a_2 &= \frac{f''(0)}{2!}\\ a_3 &= \frac{f'''(0)}{3!}\\ \cdots \\ a_n &= \frac{f^{(n)}(0)}{n!} \end{align*}\]

Which gives us the definition of the Maclaurin series:

\[p(x) = f(0)+\frac{f'(0)}{1!}x+\frac{f''(0)}{2!}x^2+\frac{f'''(0)}{3!}x^3+\cdots=\sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!}x^n\]

Taylor series

The Maclaurin series is suitable for finding approximations for functions around 0; what if we want to approximate a function around a different value? First, let's see why we would even want that. A couple of major reasons come to mind:

We have a non-cyclic function and we're really interested in approximating it around some specific value of x; if we use Maclaurin series, we get a good approximation around 0, but its quality is diminishing the farther away we get. We may be able to use much fewer terms for a good approximation if we start it around our target value.
The function we're approximating is not well behaved around 0.

It's the second reason which is most common, at least in calculus. By "not well behaved" I mean a function that's not finite at 0 (or close to it), or that isn't differentiable at that point, or whose derivatives aren't finite.

There's a very simple and common example of such a function - the natural logarithm ln(x). This function is undefined at 0 (it approaches -\infty). Moreover, its derivatives are:

\[\begin{align*} ln'(x)&= \frac{1}{x}\\ ln''(x)&= -\frac{1}{x^2}\\ ln'''(x)&= \frac{2}{x^3}\\ ln^{(4)}(x)&= -\frac{6}{x^4}\\ ln^{(5)}(x)&= \frac{24}{x^5}\\ \cdots \end{align*}\]

None of these is defined at 0 either! The Maclaurin series won't work here, and we'll have to turn to its generalization - the Taylor series:

\[p(x) = f(a)+\frac{f'(a)}{1!}(x-a)+\frac{f''(a)}{2!}(x-a)^2+\frac{f'''(a)}{3!}(x-a)^3+\cdots=\sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x-a)^n\]

This is a power series that provides an approximation for around any point a where is finite and differentiable. It's easy to use exactly the same technique to develop this series as we did for Maclaurin.

Let's use this to approximate ln(x) around x=1, where this function is well behaved. ln(1)=0 and substituting x=1 into its derivatives (as listed above) at this point, we get:

\[f'(1)=1\quad f''(1)=-1\quad f'''(1)=2\quad f^{(4)}(1)=-6\quad f^{(5)}(1)=24\]

There's a pattern here: generally, the n-th derivative at 1 is (n-1)! with an alternating sign. Substituting into the Taylor series equation from above we get:

\[p_{ln}(x)=(x-1)-\frac{1}{2}(x-1)^2+\frac{1}{3}(x-1)^3-\frac{1}{4}(x-1)^4+\cdots\]

Here's a plot of approximations with the first k terms (the function itself is dark blue, as before):

While the approximation looks good in the vicinity of 1, it seems like all approximations diverge dramatically at some point. The next section helps understand what's going on.

Convergence of power series and the ratio test

When approximating a function with power series (e.g. with Maclaurin or Taylor series), a natural question to ask is: does the series actually converge to the function it's approximating, and what are the conditions on this convergence?

Now it's time to treat these questions a bit more rigorously. We'll be using the ratio test to check for convergence. Generally, for a series:

\[\sum_{n=1}^\infty a_n\]

We'll administer this test:

\[L = \lim_{n\to\infty}\left|\frac{a_{n+1}}{a_n}\right|\]

And check the conditions for which L < 1, meaning that our series converges absolutely.

Let's start with our Maclaurin series for cos(x):

\[p_{cos}(x)=1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!}+\frac{x^8}{8!}-\cdots=1+\sum_{n=1}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!}\]

Ignoring the constant term, we'll write out the ratio limit. Note that because of the absolute value, we can ignore the power-of-minus-one term too:

\[\begin{align*} L &= \lim_{n\to\infty}\left|\frac{a_{n+1}}{a_n}\right|\\ &= \lim_{n\to\infty}\left| \frac{x^{2n+2} (2n)!}{(2n+2)! x^{2n}}\right|\\ &= \lim_{n\to\infty}\left| \frac{x^2}{(2n+1)(2n+2)}\right| \end{align*}\]

Since the limit contents are independent of x, it's obvious that that L=0 for any x. This means that the series converges to cos(x) at any x, given an infinite number of terms. This matches our intuition for this function, which is well-behaved (smooth everywhere).

Now on to ln(x) with its Taylor series around x=1. The series is:

\[p_{ln}(x)=(x-1)-\frac{1}{2}(x-1)^2+\frac{1}{3}(x-1)^3-\frac{1}{4}(x-1)^4+\cdots=\sum_{n=1}^{\infty} \frac{(-1)^{n+1} (x-1)^n}{n}\]

Once again, writing out the ratio limit:

\[\begin{align*} L &= \lim_{n\to\infty}\left|\frac{a_{n+1}}{a_n}\right|\\ &= \lim_{n\to\infty}\left| \frac{(x-1)^{n+1} n}{(n+1) (x-1)^n}\right|\\ &= \lim_{n\to\infty}\left| \frac{n(x-1)}{(n+1)}\right|\\ &= \left|x-1\right| \lim_{n\to\infty}\left| \frac{n}{(n+1)}\right|=\left| x-1\right| \end{align*}\]

To converge, we require:

\[L=\left| x-1\right|<1\]

The solution of this inequality is 0 < x < 2. Therefore, the series converges to ln(x) only in this range of x. This is also what we observe in the latest plot. Another way to say it: the radius of convergence of the series around x=1 is 1.

[1]	If this explanation and the plot of cos(x) following it don't convince you, consider watching this video by 3Blue1Brown - it includes more visualizations as well as a compelling alternative intuition using integrals and area.

[2]

Note that since cos(x) is cyclic, all we really need is good approximations in the range [-\pi, \pi). Our plot only shows the positive x axis; it looks like a mirror image on the negative side, so we see that a pretty good approximation is achieved by the time we reach k=5.

This is also a good place to note that while Maclaurin series are important in Calculus, it's not the best approximation for numerical analysis purposes; there are better approximations that converge faster.

Projections and Projection Matrices

2024-06-26T05:56:00-07:00

We'll start with a visual and intuitive representation of what a projection is. In the following diagram, we have vector b in the usual 3-dimensional space and two possible projections - one onto the z axis, and another onto the x,y plane.

If we think of 3D space as spanned by the usual basis vectors, a projection onto the z axis is simply:

\[b_z=\begin{bmatrix} 0 \\ 0 \\ z \end{bmatrix}\]

A couple of intuitive ways to think about what a projection means:

The projection of b on the z axis is a vector in the direction of the z axis that's closest to b.
The projection of b on the z axis is the shadow cast by b when a flashlight is pointed at it in the direction of the z axis.

We'll see a more formal definition soon. A projection onto the x,y plane is similarly easy to express.

Projection onto a line

Projecting onto an axis is easy - as the diagram shows, it's simply taking the vector component in the direction of the axis. But how about projections onto arbitrary lines?

In vector space, a line is just all possible scalings of some vector [1].

Speaking more formally now, we're interested in the projection of \vec{b} onto \vec{a}, where the arrow over a letter means it's a vector. The projection (which we call \vec{b_a}) is the closest vector to \vec{b} in the direction of \vec{a}. In other words, the dotted line in the diagram is at a right angle to the line a; therefore, the error vector \vec{e} is orthogonal to \vec{a}.

This orthogonality gives us the tools we need to find the projection. We'll want to find a constant c such that:

\[\vec{b_a}=c\vec{a}\]

\vec{e} is orthogonal to \vec{a}, meaning that their dot product is zero: \vec{e}\cdot\vec{a}=0. We'll use the distributive property of the dot product in what follows:

\[\begin{align*} \vec{a}\cdot\vec{e}&=0 \\ \vec{a}\cdot(\vec{b}-c\vec{a})&=0\\ \vec{a}\cdot\vec{b}-c\vec{a}\cdot\vec{a}&=0\\ c&=\frac{\vec{a}\cdot\vec{b}}{\vec{a}\cdot\vec{a}} \end{align*}\]

Note that \vec{a}\cdot\vec{a} is the squared magnitude of \vec{a}; for a unit vector this would be 1. This is why it doesn't matter if \vec{a} is a unit vector or not - we normalize it anyway.

We have a formula for c now - we can find it given \vec{a} and \vec{b}. To prepare for what comes next, however, we'll switch notations. We'll use matrix notation, in which vectors are - by convention - column vectors, and a dot product can be expressed by a matrix multiplication between a row and a column vector. Therefore:

\[\begin{align*} c&=\frac{a^T b}{a^T a} \Rightarrow \\ b_a&=\frac{a^T b}{a^T a}a \end{align*}\]

Projection matrix

Since the fraction representing c is a constant, we can switch the order of the multiplication by a, and then use the fact that matrix multiplication is associative to write:

\[\begin{align*} b_a&=a\frac{a^T b}{a^T a}\\ b_a&=\frac{a a^T}{a^T a}b \end{align*}\]

In our case, since a is a 3D vector, a a^T is a 3x3 matrix [2], while a^Ta is a scalar. Thus we get our projection matrix - call it P:

\[\begin{align*} P&=\frac{a a^T}{a^T a}\\ b_a&=Pb \end{align*}\]

A recap: given some vector \vec{a}, we can construct a projection matrix P. This projection matrix can take any vector \vec{b} and help us calculate its projection onto \vec{a} by means of a simple matrix multiplication!

Example of line projection

Consider our original example - projection on the z axis. First, we'll find a vector that spans the subspace represented by the z axis: a trivial vector is the unit vector:

\[a_z=\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}\]

What's the projection matrix corresponding to this vector?

\[P = \frac{a_z a_{z}^{T}}{1} = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}\begin{bmatrix}0&0&1\end{bmatrix}=\begin{bmatrix} 0&0&0\\ 0&0&0\\ 0&0&1 \end{bmatrix}\]

Now, given any arbitrary vector \vec{b} we can find its projection onto the z axis by multiplying with P. For example:

\[b_a=Pb=\begin{bmatrix} 0&0&0\\ 0&0&0\\ 0&0&1 \end{bmatrix}\begin{bmatrix} x\\ y\\ z \end{bmatrix}=\begin{bmatrix} 0\\ 0\\ z \end{bmatrix}\]

Another example - less trivial this time. Say we want to project vectors onto the line spanned by the vector:

\[a=\begin{bmatrix} 1 \\ 3 \\ 7 \end{bmatrix}\]

Let's compute the projection matrix:

\[P = \frac{a a^{T}}{a^T a} = \frac{1}{59}\begin{bmatrix} 1 \\ 3 \\ 7 \end{bmatrix}\begin{bmatrix}1&3&7\end{bmatrix}=\frac{1}{59}\begin{bmatrix} 1&3&7\\ 3&9&21\\ 7&21&49 \end{bmatrix}\]

Now we'll use it to calculate the projection of b=\begin{bmatrix}2 & 8 & -4\end{bmatrix}^T onto this line:

\[b_a=Pb=\frac{1}{59}\begin{bmatrix} 1&3&7\\ 3&9&21\\ 7&21&49 \end{bmatrix}\begin{bmatrix} 2\\ 8\\ -4 \end{bmatrix}=\frac{1}{59}\begin{bmatrix} -2\\ -6\\ -14 \end{bmatrix}\]

To verify this makes sense, we can calculate the error vector \vec{e}:

\[\begin{align*} e&=b-b_a=\begin{bmatrix} 2\\ 8\\ -4 \end{bmatrix}-\frac{1}{59}\begin{bmatrix} -2\\ -6\\ -14 \end{bmatrix}=\frac{1}{59}\begin{bmatrix} 120\\ 478\\ -222 \end{bmatrix} \end{align*}\]

And check that it's indeed orthogonal to \vec{a}:

\[a\cdot e = \frac{1}{59}(1\cdot 120 + 3\cdot 478 + 7 \cdot -222)=0\]

Projection onto a vector subspace

A subspace of a vector space is a subset of vectors from the vector space that's closed under vector addition and scalar multiplication. For \mathbb{R}^3, some common subspaces include lines that go through the origin and planes that go through the origin.

Therefore, the projection onto a line scenario we've discussed so far is just a special case of a projection onto a subspace. We'll look at the general case now.

Suppose we have an m-dimensional vector space , and a set of n linearly independent vectors \vec{a_1},\dots,\vec{a_n} \in \mathbb{R}^m. We want to find a combination of these vectors that's closest to some target vector \vec{b} - in other words, to find the projection of \vec{b} onto the subspace spanned by \vec{a_1},\dots,\vec{a_n}.

Arbitrary m-dimensional vectors are difficult to visualize, but the derivation here follows exactly the path we've taken for projections onto lines in 3D. There, we were looking for a constant c such that c\vec{a} was the closest vector to \vec{b}. Now, we're looking for a vector \vec{c} which represents a linear combination of \vec{a_1},\dots,\vec{a_n} that is closest to a target \vec{b}.

If we organize \vec{a_1},\dots,\vec{a_n} as columns into a matrix called A, we can express this as:

\[\vec{b_a}=A\vec{c}\]

This is a matrix multiplication: \vec{c} is a list of coefficients that describes some linear combination of the columns of A. As before, we want the error vector \vec{e}=\vec{b}-\vec{b_a} to be orthogonal to the subspace onto which we're projecting: this means it's orthogonal to every one of \vec{a_1},\dots,\vec{a_n}. The fact that vectors \vec{a_n} are orthogonal to \vec{e} can be expressed as [3]:

\[\begin{align*} a_{1}^{T}e&=0\\ \vdots\\ a_{n}^{T}e&=0 \end{align*}\]

This is a system of linear equations, and thus it can be represented as a matrix multiplication by a matrix with vectors a_{k}^T in its rows; this matrix is just A^T:

\[A^T e=0\]

But e=b-Ac, so:

\[\begin{align*} A^T (b-Ac)&=0 \Rightarrow \\ A^Tb&=A^TAc \end{align*}\]

Since the columns of A are linearly independent, A^T A is an invertible matrix [4], so we can isolate c:

\[c=(A^T A)^{-1}A^T b\]

Then the projection \vec_{b_a} is:

\[b_a=Ac=A(A^T A)^{-1}A^T b\]

Similarly to the line example, we can also define a projection matrix as:

\[P=A(A^T A)^{-1}A^T\]

Given a vector \vec{b}, P projects it onto the subspace spanned by the vectors \vec{a_1},\dots,\vec{a_n}:

\[b_a=Pb\]

Let's make sure the dimensions work out. Recall that A consists of n columns, each with m rows. So we have:

\[\begin{matrix} A & (m\times n) \\ A^T & (n\times m)\\ A^T A & (n\times n) \\ (A^T A)^{-1} & (n\times n) \\ A(A^T A)^{-1} & (m\times n) \\ A(A^T A)^{-1}A^T & (m\times m) \\ \end{matrix}\]

Since the vector \vec{b} is m-dimensional, Pb is valid and the result is another m-dimensional vector - the projection \vec{b}_a.

Example of subspace projection

At the beginning of this post there's a diagram showing the projection of an arbitrary vector \vec{b} onto a line and onto a plane. We'll find the projection matrix for the plane case now. The projection is onto the xy plane, which is spanned by these vectors:

\[a_x=\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} a_y=\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}\]

Collecting them into a single matrix A, we get:

\[A=\begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 0 \end{bmatrix}\]

To find P, let's first calculate A^T A:

\[A^T A= \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 0 \end{bmatrix}= \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix}\]

This happens to be the identity matrix, so its inverse is itself. Thus, we get:

\[P=A(A^T A)^{-1}A^T=AIA^T=AA^T= \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \end{bmatrix}= \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{bmatrix}\]

We can now project an arbitrary vector \vec{b} onto this plane by multiplying it with this P:

\[b_a=Pb= \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix}= \begin{bmatrix} x \\ y \\ 0 \end{bmatrix}\]

Granted, this is a fairly trivial example - but it works in the general case. As an exercise, pick a different pair of independent vectors and find the projection matrix onto the plane spanned by them; then, verify that the resulting error is orthogonal to the plane.

Properties of projection matrices

Projection matrices have some interesting properties that are educational to review.

First, projection matrices are symmetric. To understand why, first recall how a transpose of a matrix product is done:

\[(AB)^T=B^T A^T\]

As a warm-up, we can show that A^T A is symmetric:

\[(A^T A)^T=A^T (A^T)^T=A^T A\]

Now, let's transpose P:

\[\begin{align*} P&=A(A^T A)^{-1}A^T \\ P^T&=(A(A^T A)^{-1}A^T)^T\\ &=((A^T A)^{-1}A^T)^T A^T\\ &=A(A^T A)^{-1}A^T=P \end{align*}\]

Here we've used the fact that the inverse of a symmetric matrix is also symmetric, and we see that indeed P^T=P.

Second, projection matrices are idempotent: P^2=P; this isn't hard to prove either:

\[\begin{align*} P^2&=A(A^T A)^{-1}A^T A(A^T A)^{-1}A^T\\ &=A(A^T A)^{-1}(A^T A)(A^T A)^{-1}A^T\\ &=A(A^T A)^{-1}[(A^T A)(A^T A)^{-1}]A^T\\ &=A(A^T A)^{-1}IA^T\\ &=A(A^T A)^{-1}A^T=P \end{align*}\]

Intuitive explanation: think about what a projection does - given some \vec{b}, it calculates the closest vector to it in the desired subspace. If we try to project this projection again - what will we get? Well, still the closest vector in that subspace - itself! In other words:

\[b_a=Pb=P(Pb)\]

Projections onto orthogonal subspaces

There's another special case of projections that is interesting to discuss: projecting a vector onto orthogonal subspaces. We'll work through this using an example.

Consider the vector:

\[a_1=\begin{bmatrix} 1 \\ -2 \\ 3 \end{bmatrix}\]

We'll find the projection matrix for this vector:

\[P_1=\frac{a_1 a_{1}^T}{a_{1}^T a_1}= \frac{1}{14} \begin{bmatrix} 1 & -2 & 3\\ -2 & 4 & -6\\ 3 & -6 & 9 \end{bmatrix}\]

Now, consider the following vector, which is orthogonal to \vec{a_1}:

\[a_2=\begin{bmatrix} -3 \\ 0 \\ 1 \end{bmatrix}\]

Its projection matrix is:

\[P_2=\frac{a_2 a_{2}^T}{a_{2}^T a_2}= \frac{1}{10} \begin{bmatrix} 9 & 0 & -3\\ 0 & 0 & 0\\ -3 & 0 & 1 \end{bmatrix}\]

It's trivial to check that both P_1 and P_2 satisfy the properties of projective matrices; what's more interesting is that P_1 + P_2 does as well - so it's also a proper projection matrix!

To take it a step further, consider yet another vector:

\[a_3=\begin{bmatrix} -1 \\ -5 \\ -3 \end{bmatrix}\]

The vectors (\vec{a_1},\vec{a_2},\vec{a_3}) are all mutually orthogonal, and thus form an orthogonal basis for \mathbb{R}^3. We can calculate P_3 in the usual way, and get:

\[P_3=\frac{a_3 a_{3}^T}{a_{3}^T a_3}= \frac{1}{35} \begin{bmatrix} 1 & 5 & 3\\ 5 & 25 & 15\\ 3 & 15 & 9 \end{bmatrix}\]

Not only is P_1+P_2+P_3 is a projection matrix, it's a very familiar matrix in general:

\[P_1+P_2+P_3=I\]

This is equivalent to saying that for any vector \vec{b}:

\[(P_1+P_2+P_3)b=b\]

Hopefully this makes intuitive sense because it's just expressing \vec{b} in an alternative basis for \mathbb{R}^3 [5].

[1]

We're dealing with vector spaces, where we don't really have lines - only vectors. A line is just a visual way to think about certain subspaces of the vector space \mathbb{R}^3. Specifically, a line through the origin (lines that don't go through the origin belong in affine spaces) is a way to represent \forall c, c\vec{a} where \vec{a} is a vector in the same direction as this line and c is a constant; in other words it's the subspace of \mathbb{R}^3 spanned by \vec{a}.

[2]	By the rules of matrix multiplication: we're multiplying a column vector (a 3x1 matrix) by a row vector (a 1x3 matrix). The multiplication is allowed because the inner dimensions match, and the result is a 3x3 matrix.

[3]	Recall from the earlier example: we're dropping the explicit vector markings to be able to write matrix arithmetic more naturally. By default vectors are column vectors, so v^T w expresses the dot product between vectors and \vec{w}.

[4]	It's possible to prove this statement, but this post is already long enough.

[5]	This is a special case of a change of basis, in which the basis is orthogonal.

Method of differences and Newton polynomials

2024-04-16T05:54:00-07:00

I was reading about Babbage's Difference engine the other day, and stumbled upon a very interesting application of the forward differences method. It turns out that if we get a sequence generated by a polynomial, under certain conditions we can find the generating polynomial from just a few elements in the sequence.

For example, 0, 1, 5, 12, 22, 35, 51... is a sequence known as the pentagonal numbers, and we can use this technique to figure out that the polynomial \frac{3n^2}{2}-\frac{n}{2} generates it [1].

Notation

Let's start with some mathematical notation. We'll call the underlying function generating the sequence f(n). In our example, f(0)=0, f(1)=1, f(2)=5, f(3)=12 and so on.

The first difference is the sequence \Delta f(0)=f(1)-f(0), \Delta f(1)=f(2)-f(1), etc.
The second difference is the sequence \Delta^2 f(0)=\Delta f(1) - \Delta f(0), \Delta^2 f(1)=\Delta f(2) - \Delta f(1) etc.
In general, the k-th difference is: \Delta ^k f(n)=\Delta ^{k-1}f(n+1) - \Delta ^{k-1}f(n).
As a starting condition in the induction of differences, we can say that f(n) itself is the 0-th difference.

Difference table

We can construct the difference table for our sequence and observe its properties. In a difference table, the first column is n, which runs from 0 to whatever number of elements we have for the sequence. The second column is the values of the sequence at these n. Then come the first difference, the second difference and so on. For our sample sequence we get the table:

\[\begin{matrix} n & f(n) & \Delta f(n) & \Delta ^2 f(n) & \Delta ^3 f(n) \\ 0 & 0 & 1 & 3 & 0 \\ 1 & 1 & 4 & 3 & 0 \\ 2 & 5 & 7 & 3 & 0 \\ 3 & 12 & 10 & 3 & \\ 4 & 22 & 13 & & \\ 5 & 35 & & & \end{matrix}\]

Notice how at some point the differences become all-zero! We'll soon see why.

Obviously, we can construct such a table from only the column of f(n) - that's what we just did! A more interesting observation is that if we accept that all differences (columns) are 0 from a certain point, we can also construct this table from just the first row! For example, with f(0) and \Delta f(0) in hand, we know f(1); with \Delta f(0) and \Delta^2 f(0), we know \Delta f(1) etc. Try it!

Inferring the polynomial's degree from the table

Claim: if f(n) has degree k, then the k-th difference column in the table is constant.

Proof: this is a general k-th degree polynomial with coefficients a_k:

\[f(n)=a_k n^k + a_{k-1} n^{k-1} + \cdots + a_1 n + a_0\]

By definition of the first difference, if we expand the polynomial form and perform the subtraction per power of n:

\[\begin{align*} \Delta f(n) &= f(n+1)-f(n) = a_k (n+1)^{k} - a_k n^k + a_{k-1} (n+1)^{k-1} - a_{k-1} n^{k-1}+\cdots \\ &= \sum_{j=0}^{k} a_j(n+1)^j-a_j n^j \end{align*}\]

Using the binomial theorem we know that:

\[\begin{align*} (n+1)^j &= \sum_{i=0}^{j} \binom{j}{i}n^{j-i} \cdot 1^{i} \\ &= n^j + \binom{j}{1}n^{j-1}+\binom{j}{2}n^{j-2}+ \cdots \end{align*}\]

Therefore:

\[(n+1)^j - n^j = \binom{j}{1}n^{j-1}+\binom{j}{2}n^{j-2}+ \cdots\]

Now if we look at the sum we got for \Delta f(n) again, we'll notice that in each term, the j-th power of n gets canceled out. This means that the highest power of n in \Delta f(n) is going to be k-1.

We can similarly show that in \Delta^2 f(n), the highest power of n is going to be k-2. Therefore, the k-th difference \Delta^k f(n) will be constant \blacksquare.

Two observations for extra credit:

Note that the claim goes one way - if f(n) is a k-th degree polynomial, the k-th difference is constant. What we observe in the table is the k-th difference is constant, so can we infer that the function is a k-th degree polynomial? Not in the general case! The sequence could be generated by some higher-degree polynomial, or by a completely different kind of function. That said, since we assume f(n) is polynomial and are seeking to find the simplest (lowest degree) one, this inference is valid.
Did you notice the equivalence to derivatives of polynomials? The k-th difference of a polynomial of degree k is constant... but the same is true for the k-th derivative! This is not by chance - since we have a discrete domain, differences play largely the same role as derivatives for continuous functions. This isn't a rigorous proof - but think about the definition of derivatives (the one with the limit) - what do you get when you take \Delta x (also sometimes called h) and set it to 1?

Finding the coefficients with the Newton polynomial

Polynomial interpolation can fit any N points (with distinct x values) with a N-1 degree polynomial. One way of finding such a polynomial was discovered by Isaac Newton and is called the Newton polynomial.

Our problem of finding a polynomial that generates a given set of points can be reduced to this interpolation problem, since we've just figured out the degree of the generating polynomial! Looking at the difference table, we've found when the difference becomes constant, and that gives us the polynomial's degree k. So all we need is the first k+1 points.

Here's how to develop the Newton polynomial from scratch; we'll start with the first few coefficients and will then generalize for any k.

The Newton polynomial for our set of forward differences can be expressed as follows:

\[f(n) = b_0 + b_1 n + b_2 n (n-1) + \cdots + b_k n(n-1)(n-2)\cdots (n-k+1)\]

This polynomial is constructed in a clever way; notice that for any p, when we calculate f(p) all the elements starting with b_{p+1} will be multiplied by zero and vanish. This helps us determine this polynomial's coefficients in a gradual manner.

We'll start with the point (0, f(0)), substituting it into the Newton polynomial:

\[f(0) = b_0\]

This gives us the first coefficient b_0. Next, let's look at (1, f(1)):

\[f(1) = b_0 + b_1 \cdot 1 = b_0 + b_1\]

Since we know that b_0=f(0), we can infer that b_1=f(1)-f(0). Another way to express that is b_1=\Delta f(0).

Continuing to (2, f(2)):

\[\begin{align*} f(2) &= b_0 + b_1 \cdot 2 + b_2 \cdot 2 \cdot 1 \\ &= b_0 + 2 b_1 + 2! b_2 \\ &= f(0) + 2 \Delta f(0) + 2! b_2 \end{align*}\]

We've substituted the values of b_0 and b_1 that we've found earlier; let's solve for b_2 now:

\[\begin{align*} b_2 &= \frac{f(2) - f(0) - 2 \Delta f(0)}{2!} \\ &= \frac{f(2) - f(1) + f(1) - f(0) - 2 \Delta f(1)}{2!} \\ &= \frac{\Delta f(1) + \Delta f(0) - 2 \Delta f(0)}{2!} \\ &= \frac{\Delta f(1) - \Delta f(0)}{2!} \end{align*}\]

The last line's numerator is - by definition - \Delta^2 f(0):

\[b_2 = \frac{\Delta^2 f(0)}{2!}\]

We can keep going with this (feel free to do (3, f(3)) as an exercise), but the emerging generalization is that:

\[b_i = \frac{\Delta^i f(0)}{i!}\]

And a concise way to write Newton's polynomial for forward differences is:

\[f(n) = f(0) + \sum_{i=1}^{k} \frac{\Delta^i f(0)}{i!}g_i(n)\]

Where g_i(n) is:

\[\prod_{j=0}^{i-1} (n-j)\]

Note that we only use the differences for f(0), meaning that we need just the first row of the difference table! Let's try it for the pentagonal numbers example.

First, we've determined that \Delta^2 f(n) is a constant, so the degree of the polynomial is 2. We only need to calculate until i=2 in the sum:

\[f(n)=f(0)+\frac{\Delta f(0)}{1!} n + \frac{\Delta^2 f(0)}{2!} n(n-1)\]

Substituting the values from the difference table we have for n=0, we get:

\[\begin{align*} f(n)&=0+n+\frac{3}{2}n(n-1) \\ &=n + \frac{3n^2-3n}{2} \\ &=\frac{3 n^2}{2} -\frac{n}{2} \end{align*}\]

Which is exactly what we expected!

Another example

Let's work through another example, taking the sequence -8, -12, -6, 16, 60... We'll start by constructing the difference table:

\[\begin{matrix} n & f(n) & \Delta f(n) & \Delta ^2 f(n) & \Delta ^3 f(n) \\ 0 & -8 & -4 & 10 & 6 \\ 1 & -12 & 6 & 16 & 6 \\ 2 & -6 & 22 & 22 & \\ 3 & 16 & 44 & & \\ 4 & 60 & & & \\ \end{matrix}\]

The difference \Delta^3 f(n) appears to be constant, so we can generate this sequence with a degree 3 polynomial. Let's use the first line of the table to construct the Newton polynomial for it.

\[\begin{align*} f(n)&=f(0)+\frac{\Delta f(0)}{1!} n + \frac{\Delta^2 f(0)}{2!} n(n-1) + \frac{\Delta^3 f(0)}{3!} n(n-1)(n-2)\\ &= -8 + \frac{-4}{1}n + \frac{10}{2}n(n-1) + \frac{6}{6}n(n-1)(n-2) \\ &= -8-4n+5n^2-5n+n^3-3n^2+2n \\ &= n^3+2n^2-7n-8 \end{align*}\]

Verifying that this polynomial generates our sequence as its first 5 elements is an easy exercise.

Note that given 5 elements, we can always find a 4th-degree polynomial fitting it. Here we found a 3rd-degree one, though, leveraging the technique of differences. This becomes more acute if we have more elements in the sequence - using differences it's often possible to find significantly simpler polynomials.

Recap

For an integer sequence, if this sequence is generated by a polynomial we can figure out which polynomial it is - given enough elements. We start by constructing a difference table and noticing if a column becomes constant from some point on. This tells us the degree of the generating polynomial. With that in hand, we can use the Newton polynomial to discover a polynomial that generates the sequence.

Resources

This blog post was the original inspiration. The post explains how Babbage's difference engine worked, and makes off-hand remarks like "for a 2nd degree polynomial, we only need up to the second difference to know everything". This follow-up by the same author mentioned going from sequences back to polynomials.
The old Encyclopedia of integer sequences has an intriguing section 2.5, which unfortunately presents several lemmas with no proofs.
This wiki page from brilliant.org is the best single resource, though its proof of constructing the Newton polynomial is lacking, IMHO.
Wikipedia: Divided differences and Newton polynomial pages.
Newton's forward differences YouTube lecture.
Knuth covers this very briefly in section 4.6.4 of Part 2 of TAOCP, relegating the derivation of the Newton polynomial to an exercise that has a terse solution.

Appendix: Another way to find the polynomial's coefficients

Here's another way to discover the coefficients of the polynomial; the following discusses the coefficients of the highest power, but can be generalized to other powers as well. Using Newton's polynomial is simpler, though, so this is just an appendix for some extra practice in manipulating such polynomials.

Claim: if f(n) has degree k, its coefficient a_k is equal to \frac{\Delta^k f(n)}{k!}. Since we've proven that \Delta^k f(n) is a constant, we can find the precise value of a_k this way.

Proof: Let's go back to the sum formulation of \Delta f(n) from the previous proof.

\[\sum_{j=0}^{k} a_j(n+1)^j-a_j n^j\]

Expanding the binomial, we get:

\[\sum_{j=0}^{k} a_j \left [ \binom{j}{1}n^{j-1}+\binom{j}{2}n^{j-2}+ \cdots \right ]\]

The highest power of n here is k-1; its coefficient comes only from the first term of the binomial expansion for j=k, and is equal to k a_k. In order not to deal with long sums, let's just focus on the highest-degree term in \Delta f(n):

\[\Delta f(n) = k a_k n^{k-1} + \gamma\]

Where \gamma represents other elements with lower powers of k, so we don't care about them for this discussion [2]. Let's move on to the next difference:

\[\begin{align*} \Delta^2 f(n) &= \Delta f(n+1) - \Delta f(n) \\ &= k a_k (n+1)^{k-1} + \gamma_1 - k a_k n^{k-1} + \gamma_2 \\ &= k a_k \left [ n^{k-1} + \binom{k-1}{1} n^{k-2} + \cdots \right ] - k a_k n^{k-1} + \gamma \\ &= k a_k \left [ \binom{k-1}{1} n^{k-2} + \cdots \right ] + \gamma \\ &= k(k-1)a_k n^{k-2} + \gamma \end{align*}\]

We've just found the coefficient of the highest power of n in \Delta^2 f(n). It's clear that if we continue doing this, \Delta^k f(n) will be:

\[\Delta^k f(n)= a_k k(k-1)(k-2)\cdots 1 = a_k k!\]

In other words, a_k = \frac{\Delta^k f(n)}{k!}, meaning that we can know the coefficient of the highest power of n in f(n) from the k-th difference \blacksquare.

[1]	Naturally, there's an infinitude of potential functions that generate this sequence; it's more precise to say we're looking for the simplest (lowest degree) polynomial that would do this.

[2]	In the following calculation, we're playing loose with the \gammas to represent "anything with lower powers of n that we don't care about".

Cubic spline interpolation

2023-10-12T05:57:00-07:00

This post explains how cubic spline interpolation works, and presents a full implementation in JavaScript, hooked up to a SVG-based visualization. As a side effect, it also covers Gaussian elimination and presents a JavaScript implementation of that as well.

I love topics that mix math and programming in a meaningful way, and cubic spline interpolation is an excellent example of such a topic. There's a bunch of linear algebra here and some calculus, all connected with code to create a useful tool.

Motivation

In an interpolation problem, we're given a set of points (we'll be using 2D points X,Y throughout this post) and are asked to estimate Y values for Xs not in this original set, specifically for Xs that lie between Xs of the original set (estimation for Xs outside the bounds of the original set is called extrapolation).

As a concrete example, consider the set of points (0, 1), (1, 3), (2, 2); here they are plotted in the usual coordinate system:

Interpolation is estimating the value of Y for Xs between 0 and 2, given just this data set. Obviously, the more complex the underlying function/phenomenon, and the fewer original points we're given, interpolation becomes more difficult to do accurately.

There are many techniques to interpolate between a given set of points. Polynomial interpolation can perfectly fit N points with an N-1 degree polynomial, but this approach can be problematic for large a N; high-degree polynomials tend to overfit their data, and suffer from other numerical issues like Runge's phenomenon.

Instead of interpolating all the points with a single function, a very popular alternative is using Splines, which are piece-wise polynomials. The idea is to fit a low-degree polynomial between every pair of adjacent points in the original data set; for N points, we get N-1 different polynomials. The simplest and best known variant of this technique is linear interpolation:

Linear interpolation has clear benefits: it's very fast, and when N is large it produces reasonable results. However, for small Ns the result isn't great, and the approximation is very crude. Here's the linear spline interpolation of the Sinc function sampled at 7 points:

We can certainly do much better.

How about higher-degree splines? We can try second degree polynomials, but it's better to jump straight to cubic (third degree). Here's why: to make our interpolation realistic and aesthetically pleasing, we want the neighboring polynomials not only to touch at the original points (the linear splines already do this), but to actually look like they're part of the same curve. For this purpose, we want the slope of the polynomials to be continuous, meaning that if two polynomials meet at point P, their first derivatives at this point are equal. Moreover, to ensure smoothness and to minimize needless bending [1], we also want the second derivatives of the two polynomials to be equal at P. The lowest degree of polynomial that gives us this level of control is 3 (since the second derivative of a quadratic polynomial is constant); hence cubic splines.

Here's a cubic spline interpolating between the three points of the original example:

And the Sinc function:

Because of the continuity of first and second derivatives, cubic splines look very natural; on the other hand, since the degree of each polynomial remains at most 3, they don't overfit too much. Hence they're such a popular tool for interpolation and design/graphics.

All the plots in this post have been produced by JavaScript code that implements cubic spline interpolation from scratch. Let's move on to learn how it works.

Setting up equations for cubic spline interpolation

Given a set of N points, we want to produce N-1 cubic polynomials between these points. While these are distinct polynomials, they are connected through mutual constraints on the original points, as we'll see soon.

More formally, we're going to define N-1 polynomials in the inclusive range i \in\{0 ...N-2\}:

\[p_i(x)=a_ix^3+b_ix^2+c_ix+d_i\]

For each polynomial, we have to find 4 coefficients: a, b, c and d; in total, for N-1 polynomials we'll need 4N-4 coefficients. We're going to find these coefficients by expressing the constraints we have as linear equations, and then solving a system of linear equations. We'll need 4N-4 equations to ensure we can find a unique solution for 4N-4 unknowns.

Let's use our sample set of three original points to demonstrate how this calculation works: (0, 1), (1, 3), (2, 2). Since N is 3, we'll be looking for two polynomials and a total of 8 coefficients.

The first set of constraints is obvious - each polynomial has to pass through the two points it's interpolating between. The first polynomial passes through the points (0, 1) and (1, 3), so we can write the equations:

\[\begin{align*} p_0(0)&=0a_0 + 0b_0 + 0c_0 + d_0=1\\ p_0(1)&=a_0+b_0+c_0+d_0=3 \end{align*}\]

The second polynomial passes through the points (1, 3) and (2, 2), resulting in the equations:

\[\begin{align*} p_1(1)&=a_1+b_1+c_1+d_1=3\\ p_1(2)&=8a_1 + 4b_1 + 2c_1 + d_1=2 \end{align*}\]

We have 4 equations, and need 4 more.

We constrain the first and second derivatives of the polynomials to be equal at the points where they meet. In our example, there are only two polynomials that meet at a single point, so we'll get two equations: their derivatives are equal at point (1, 3).

Recall that the first and second derivatives of a cubic polynomial are:

\[\begin{align*} p_i'(x)&=3a_ix^2+2b_ix+c_i\\ p_i''(x)&=6a_ix+2b_i \end{align*}\]

The equation we get from equating the first derivatives is:

\[p_0'(1)=3a_0+2b_0+c_0=p_1'(1)=3a_1+2b_1+c_1\]

Or, expressed as a linear equation of all coefficients:

\[3a_0+2b_0+c_0-3a_1-2b_1-c_1=0\]

Similarly, the equation we get from equating the second derivatives is:

\[p_0''(1)=6a_0+2=p_1''(1)=6a_1+2\]

Expressed as a linear equation of all coefficients:

\[6a_0+2-6a_1-2=0\]

This brings us to a total of 6 equations. The last two equations will come from boundary conditions. Notice that - so far - we didn't say much about how our interpolating polynomials behave at the end points, except that they pass through them. Boundary conditions are constraints we create to define how our polynomials behave at these end points. There are several approaches to this, but here we'll just discuss the most commonly-used one: a natural spline. Mathematically it says that the first polynomial has a second derivative of 0 at the first original point, and the last polynomial has a second derivative of 0 at the last original point. In our example:

\[\begin{align*} p_0''(0)=0\\ p_1''(2)=0 \end{align*}\]

Substituting the second derivative equations:

\[\begin{align*} p_0''(0)&=2b_0=0\\ p_1''(2)&=12a_1+2b_1=0 \end{align*}\]

We have 8 equations now:

\[\begin{align*} d_0&=1\\ a_0+b_0+c_0+d_0&=3\\ a_1+b_1+c_1+d_1&=3\\ 8a_1 + 4b_1 + 2c_1 + d_1&=2\\ 3a_0+2b_0+c_0-3a_1-2b_1-c_1&=0\\ 6a_0+2-6a_1-2&=0\\ 2b_0&=0\\ 12a_1+2b_1&=0 \end{align*}\]

To restate the obvious - while our example only uses 2 polynomials, this approach generalizes to any number. For N original points, we'll interpolate with N-1 polynomials, resulting in 4N-4 coefficients. We'll get:

2N-2 equations from setting the points these polynomials pass through
N-2 equations from equating first derivatives at internal points
N-2 equations from equating second derivatives at internal points
2 equations from boundary conditions

For a total of 4N-4 equations.

The code that constructs these equations from a given set of points is available in this file.

Solving the equations

We now have 8 equations with 8 variables. Some of them are trivial, so it's tempting to just solve the system by hand, and indeed one can do it very easily. In the general case, however, it would be quite difficult - imagine interpolating 10 polynomials resulting in 36 equations!

Fortunately, the full power of linear algebra is now at our disposal. We can express this set of linear equations as a matrix multiplication problem Ax=b, where A is a matrix of coefficients, x is a vector of unknowns and b is the vector of right-hand side constants:

\[Ax=b\Rightarrow \begin{pmatrix} 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1\\ 0 & 0 & 0 & 0 & 8 & 4 & 2 & 1\\ 3 & 2 & 1 & 0 & -3 & -2 & -1 & 0\\ 6 & 2 & 0 & 0 & -6 & -2 & 0 & 0\\ 0 & 2 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 12 & 2 & 0 & 0\\ \end{pmatrix}\begin{pmatrix} a_0 \\ b_0 \\ c_0 \\ d_0 \\ a_1 \\ b_1 \\ c_1 \\ d_1\end{pmatrix}=\begin{pmatrix} 1\\ 3\\ 3\\ 2\\ 0\\ 0\\ 0\\ 0 \end{pmatrix}\]

Solving this system is straightforward using Gaussian elimination. Our JavaScript implementation does this in a few steps:

Performs Gaussian elimination to bring A into row-echelon form, using the algorithm outlined on Wikipedia. This approach tries to preserve numerical stability by selecting the row with the largest (in absolute value) value for each column [2].
Further transforms the resulting matrix into reduced row-echelon form (a.k.a. Gauss-Jordan elimination)
Extracts the solution.

In our example, the solution ends up being the vector (-0.75, 0, 2.75, 1, 0.75, -4.5, 7.25, -0.5); therefore, the interpolating polynomials are:

\[\begin{align*} p_0(x)&=-0.75x^3+2.75x+1\\ p_1(x)&=0.75x^3-4.5x^2+7.25x-0.5 \end{align*}\]

Performing the interpolation itself

Now that we have the interpolating polynomials, we can generate any number of interpolated points. For all x between 0 and 1 we use p_0(x), and for x between 1 and 2 we use p_1(x). In our JavaScript code this is done by the doInterpolate function. We've already seen the result:

Code

The complete code sample for this post is available on GitHub. It includes functions for constructing equations for cubic splines from an original set of points, code for solving linear equations with Gauss-Jordan elimination, and a demo HTML page that plots the points and linear/spline interpolations.

The code is readable, heavily-commented JavaScript with no dependencies (except D3 for the plotting).

An additional demo that uses similar functionality is line-plotting; it plots arbitrary mathematical functions with optional interpolation (when the number of sampled points is low).

[1]

This requirement actually has neat historical roots. In the days before computers, "splines" were elastic rulers engineers and drafters would use to interpolate between points by hand. These rulers would bend and connect at the original points, and it was considered best practice to minimize bending.

[2]	This helps avoid division by very small numbers, which may cause issues when using finite-precision floating point.

My favorite prime number generator

2023-08-22T20:01:00-07:00

Many years ago I've re-posted a Stack Overflow answer with Python code for a terse prime sieve function that generates a potentially infinite sequence of prime numbers ("potentially" because it will run out of memory eventually). Since then, I've used this code many times - mostly because it's short and clear. In this post I will explain how this code works, where it comes from (I didn't come up with it), and some potential optimizations. If you want a teaser, here it is:

def gen_primes():
    """Generate an infinite sequence of prime numbers."""
    D = {}
    q = 2
    while True:
        if q not in D:
            D[q * q] = [q]
            yield q
        else:
            for p in D[q]:
                D.setdefault(p + q, []).append(p)
            del D[q]
        q += 1

The sieve of Eratosthenes

To understand what this code does, we should first start with the basic Sieve of Eratosthenes; if you're familiar with it, feel free to skip this section.

The Sieve of Eratosthenes is a well-known algorithm from ancient Greek times for finding all the primes below a certain number reasonably efficiently using a tabular representation. This animation from Wikipedia explains it pretty well:

Starting with the first prime (2) it marks all its multiples until the requested limit. It then takes the next unmarked number, assumes it's a prime (because it is not a multiple of a smaller prime), and marks its multiples, and so on until all the multiples below the limit are marked. The remaining unmarked numbers are primes.

Here's a well-commented, basic Python implementation:

def gen_primes_upto(n):
    """Generates a sequence of primes < n.

    Uses the full sieve of Eratosthenes with O(n) memory.
    """
    if n == 2:
        return

    # Initialize table; True means "prime", initially assuming all numbers
    # are prime.
    table = [True] * n
    sqrtn = int(math.ceil(math.sqrt(n)))

    # Starting with 2, for each True (prime) number I in the table, mark all
    # its multiples as composite (starting with I*I, since earlier multiples
    # should have already been marked as multiples of smaller primes).
    # At the end of this process, the remaining True items in the table are
    # primes, and the False items are composites.
    for i in range(2, sqrtn):
        if table[i]:
            for j in range(i * i, n, i):
                table[j] = False

    # Yield all the primes in the table.
    yield 2
    for i in range(3, n, 2):
        if table[i]:
            yield i

When we want a list of all the primes below some known limit, gen_primes_upto is great, and performs fairly well. There are two issues with it, though:

We have to know what the limit is ahead of time; this isn't always possible or convenient.
Its memory usage is high - O(n); this can be significantly optimized, however; see the bonus section at the end of the post for details.

The infinite prime generator

Back to the infinite prime generator that's in the focus of this post. Here is its code again, now with some comments:

def gen_primes():
    """Generate an infinite sequence of prime numbers."""
    # Maps composites to primes witnessing their compositeness.
    D = {}

    # The running integer that's checked for primeness
    q = 2

    while True:
        if q not in D:
            # q is a new prime.
            # Yield it and mark its first multiple that isn't
            # already marked in previous iterations
            D[q * q] = [q]
            yield q
        else:
            # q is composite. D[q] holds some of the primes that
            # divide it. Since we've reached q, we no longer
            # need it in the map, but we'll mark the next
            # multiples of its witnesses to prepare for larger
            # numbers
            for p in D[q]:
                D.setdefault(p + q, []).append(p)
            del D[q]

        q += 1

The key to the algorithm is the map D. It holds all the primes encountered so far, but not as keys! Rather, they are stored as values, with the keys being the next composite number they divide. This lets the program avoid having to divide each number it encounters by all the primes known so far - it can simply look in the map. A number that's not in the map is a new prime, and the way the map updates is not unlike the sieve of Eratosthenes - when a composite is removed, we add the next composite multiple of the same prime(s). This is guaranteed to cover all the composite numbers, while prime numbers should never be keys in D.

I highly recommend instrumenting this function with some printouts and running through a sample invocation - it makes it easy to understand how the algorithm makes progress.

Compared to the full sieve gen_primes_upto, this function doesn't require us to know the limit ahead of time - it will keep producing prime numbers ad infinitum (but will run out of memory eventually). As for memory usage, the D map has all the primes in it somewhere, but each one appears only once. So its size is O(\pi(n)), where \pi(n) is the Prime-counting function, the number of primes smaller or equal to n. This can be approximated by O(\frac{n}{ln(n)}) [1].

I don't remember where I first saw this approach mentioned, but all the breadcrumbs lead to this ActiveState Recipe by David Eppstein from way back in 2002.

Optimizing the generator

I really like gen_primes; it's short, easy to understand and gives me as many primes as I need without forcing me to know what limit to use, and its memory usage is much more reasonable than the full-blown sieve of Eratosthenes. It is, however, also quite slow, over 5x slower than gen_primes_upto.

The aforementioned ActiveState Recipe thread has several optimization ideas; here's a version that incorporates ideas from Alex Martelli, Tim Hochberg and Wolfgang Beneicke:

def gen_primes_opt():
    yield 2
    D = {}
    for q in itertools.count(3, step=2):
        p = D.pop(q, None)
        if not p:
            D[q * q] = q
            yield q
        else:
            x = q + p + p  # get odd multiples
            while x in D:
                x += p + p
            D[x] = p

The optimizations are:

Instead of holding a list as the value of D, just have a single number. In cases where we need more than one witness to a composite, find the next multiple of the witness and assign that instead (this is the while x in D inner loop in the else clause). This is a bit like using linear probing in a hash table instead of having a list per bucket.
Skip even numbers by starting with 2 and then proceeding from 3 in steps of 2.
The loop assigning the next multiple of witnesses may land on even numbers (when p and q are both odd). So instead jump to q + p + p directly, which is guaranteed to be odd.

With these in place, the function is more than 3x faster than before, and is now only within 40% or so of gen_primes_upto, while remaining short and reasonably clear.

There are even fancier algorithms that use interesting mathematical tricks to do less work. Here's an approach by Will Ness and Tim Peters (yes, that Tim Peters) that's reportedly faster. It uses the wheels idea from this paper by Sorenson. Some additional details on this approach are available here. This algorithm is both faster and consumes less memory; on the other hand, it's no longer short and simple.

To be honest, it always feels a bit odd to me to painfully optimize Python code, when switching languages provides vastly bigger benefits. For example, I threw together the same algorithms using Go and its experimental iterator support; it's 3x faster than the Python version, with very little effort (even though the new Go iterators and yield functions are still in the proposal stage and aren't optimized). I can't try to rewrite it in C++ or Rust for now, due to the lack of generator support; the yield statement is what makes this code so nice and elegant, and alternative idioms are much less convenient.

Bonus: segmented sieve of Eratosthenes

The Wikipedia article on the sieve of Eratosthenes mentions a segmented approach, which is also described in the Sorenson paper in section 5.

The main insight is that we only need the primes up to \sqrt{n} to be able to sieve a table all the way to N. This results in a sieve that uses only O(\sqrt{n}) memory. Here's a commented Python implementation:

def gen_primes_upto_segmented(n):
    """Generates a sequence of primes < n.

    Uses the segmented sieve or Eratosthenes algorithm with O(√n) memory.
    """
    # Simplify boundary cases by hard-coding some small primes.
    if n < 11:
        for p in [2, 3, 5, 7]:
            if p < n:
                yield p
        return

    # We break the range [0..n) into segments of size √n
    segsize = int(math.ceil(math.sqrt(n)))

    # Find the primes in the first segment by calling the basic sieve on that
    # segment (its memory usage will be O(√n)). We'll use these primes to
    # sieve all subsequent segments.
    baseprimes = list(gen_primes_upto(segsize))
    for bp in baseprimes:
        yield bp

    for segstart in range(segsize, n, segsize):
        # Create a new table of size √n for each segment; the old table
        # is thrown away, so the total memory use here is √n
        # seg[i] represents the number segstart+i
        seg = [True] * segsize

        for bp in baseprimes:
            # The first multiple of bp in this segment can be calculated using
            # modulo.
            first_multiple = (
                segstart if segstart % bp == 0 else segstart + bp - segstart % bp
            )
            # Mark all multiples of bp in the segment as composite.
            for q in range(first_multiple, segstart + segsize, bp):
                seg[q % len(seg)] = False

        # Sieving is done; yield all composites in the segment (iterating only
        # over the odd ones).
        start = 1 if segstart % 2 == 0 else 0
        for i in range(start, len(seg), 2):
            if seg[i]:
                if segstart + i >= n:
                    break
                yield segstart + i

Code

The full code for this post - along with tests and benchmarks - is available on GitHub.

[1] While this is a strong improvement over O(n) (e.g. for a billion primes, memory usage here is only 5% of the full sieve version), it still depends on the size of the input. In the unlikely event that you need to generate truly gigantic primes starting from 2, even the square-root-space solutions become infeasible. In this case, the whole approach should be changed; instead, one would just generate random huge numbers and use probabilistic primality testing to check for their primeness. This is what real libraries like Go's crypto/rand.Prime do.

Demystifying Tupper's formula

2023-05-22T19:45:00-07:00

A book I was recently reading mentioned a mathematical curiosity I haven't seen before - Tupper's self-referential formula. There are some resources about it online, but this post is my attempt to explain how it works - along with an interactive implementation you can try in the browser.

Tupper's formula

Here is the formula:

\[\frac{1}{2}< \left \lfloor mod\left ( \left \lfloor \frac{y}{17}\right \rfloor 2^{-17\lfloor x \rfloor - mod(\lfloor y \rfloor, 17)}, 2 \right ) \right \rfloor\]

We want to plot this formula, but how?

For this purpose, it's more useful to think of Tupper's formula not as a function but as a relation, in the mathematical sense. In Tupper's paper this is a relation on , meaning that it's a set of pairs in \mathbb{R} \times \mathbb{R} that satisfy the inequality.

For our task we'll use discrete indices for x and y, so the relation is on \mathbb{N}. We'll plot the relation by using a dark pixel (or square) for a x,y coordinate where the inequality holds and a light pixel for a coordinate where it doesn't hold.

The "mind-blowing" fact about Tupper's formula is that when plotted for a certain range of x and y, it produces this:

Note that while x runs in the inclusive range of 0-105 on the plot, y starts at a mysterious K and ends at K+16. For the plot above, K needs to be:

4858450636189713423582095962494202044581400587983244549483
0930850619347047088099284506447698655243648499972470249151
1911041160573917740785691975432657185544205721044573588368
1829823754139634338225199452191651284348332905131193199953
5024137587652392648746133949068701305622958132194811136853
3953556529085002387509285689269455597428154638651073004910
6723058933586052544096664351265349363643957125565695936815
1843348576052669401612512669514215505395545191537854575257
5659074054015792900176596796548006442782913148854825991472
1248506352686630476300

The amazement subsides slightly when we discover that for a different K [1], we get a different plot:

And, in fact, this formula can produce any 2D grid of 106x17 pixels, given the right coordinates. Since the formula itself is so simple, it is quite apparent that the value of K is the key here; these are huge numbers with hundreds of digits, so clearly they encode the image information somehow. Read on to see how this actually works.

A JavaScript demo

I've implemented a simple online demo of plotting the Tupper formula - available at https://eliben.github.io/tupperformula/ (with source code on GitHub). It was used to produce the images shown above. The code is fairly straightforward, so I'll just focus on the interesting part.

The core of the code is a 2D grid that's plotted for x running from 0 to 105 and y from K to K+16 (both ranges inclusive). The grid is populated every time the number changes:

const GridWidth = 106;
const GridHeight = 17;
let K = BigInt(Knum.value);

for (let x = 0; x < GridWidth; x++) {
    for (let y = 0; y < GridHeight; y++) {
        Grid.setCell(x, y, tupperFormula(BigInt(x), K + BigInt(y)));
    }
}

Note the use of JavaScript's BigInt types here - very handy when dealing with such huge numbers. Here is tupperFormula:

function tupperFormula(x, y) {
    let d = (y / 17n) >> (17n * x + y % 17n);
    return d % 2n == 1n;
}

It looks quite different from the mathematical formula at the top of this post; why? Because - as mentioned before - while Tupper's original formula works on real numbers, our program only needs the discrete integer range of x in [0, 105] and y in [K, K+16]. When we deal with discrete numbers, the formula can be simplified greatly.

Let's start with the original formula and simplify it step by step:

\[\frac{1}{2}< \left \lfloor mod\left ( \left \lfloor \frac{y}{17}\right \rfloor 2^{-17\lfloor x \rfloor - mod(\lfloor y \rfloor, 17)}, 2 \right ) \right \rfloor\]

First of all, since x and y are natural numbers, the floor operations on them don't do anything, so we can drop them (including on the division by 17, if we just assume integer division that rounds down by default):

\[\frac{1}{2}< \left \lfloor mod\left ( \left ( \frac{y}{17}\right ) 2^{-17x - mod(y, 17)}, 2 \right ) \right \rfloor\]

Next, since the result of the mod(N,2) operation for a natural N is either 0 or 1, the comparison to half is just a fancy way of saying "equals 1"; we can replace the inequality by:

\[mod\left ( \left ( \frac{y}{17}\right ) 2^{-17x - mod(y, 17)}, 2 \right )=1\]

Note the negative power of 2; multiplying by it is the same as dividing by its positive counterpart. Another way to express division by 2^p for natural numbers is a bit shift right by p bits. So we get the code of the tupperFormula function shown above:

let d = (y / 17n) >> (17n * x + y % 17n);
return d % 2n == 1n;

How the Tupper formula works

The distillation of the Tupper to JS code already peels off a few layers of mystery. Let's now remove the rest of the curtain on its inner workings.

I'll start by explaining how to take an image we want the formula to produce and encode it into K. Here are the first three columns of the Tupper formula plot:

Each pixel in the plot is converted to a bit (0 for light, 1 for dark). We start at the bottom left corner (x=0 and y=K), which is the LSB (least-significant bit) and move up through the first column; when we reach the top (x=0 and y=K+16), we continue from the bottom of the next column (x=1 and y=K). In the example above, the first bits (from lowest to highest) of the number are:

00110010101000100 00101010101111100 ...

Once we're done with the whole number (106x17 = 1802 bits), we convert it to decimal - let's call this number IMG, and multiply by 17. The result is K.

Now back to tupperFormula, looking at how it decodes the image back from x and y (recall that y runs from K to K+16). Let's work through the first coordinate in detail:

For x=0 and y=K, in tupperFormula we get:

d = (y/17) >> (17x + y%17)
...
substitute x=0, y=K (and recall that K = IMG * 17)
...
d = IMG >> 0

In other words, d is the lowest bit of IMG - the lowest bit of our image! We can continue for x=0 and y=K+1:

d = (y/17) >> (17x + y%17)
...
substitute x=0, y=K+1 (and recall that K = IMG * 17)
...
d = IMG >> 1

Here d is the second lowest bit of IMG. The pattern should be clear by now.

d = (y/17) >> (17x + y%17)
...
x=0  y=K+2:  IMG >> (0 + 2)
x=0  y=K+3:  IMG >> (0 + 3)
...
x=0  y=K+16  IMG >> (0 + 16)
x=1  y=K:    IMG >> (17 + 0)
x=1  y=K+1:  IMG >> (17 + 1)
x=1  y=K+2:  IMG >> (17 + 2)

The formula simply calculates the correct bit of IMG given x and y, using a modular arithmetic trick to "fold" the 2D x and y into a 1D sequence (this is just customary column-major layout).

This is why the formula can plot any 106x17 grid, given the right K. In the formula, 17 is not some piece of magic - it's just the height of the grid. As an exercise, you can modify the formula and code to plot larger or smaller grids.

As a bonus, the JavaScript demo can also encode a grid back to its representative K; here's the code for it:

// Calculate K value from the grid.
function encodeGridToK() {
    let kval = BigInt(0);

    // Build up K from MSB to LSB, scanning from the top-right corner down and
    // then moving left by column.
    for (let x = GridWidth - 1; x >= 0; x--) {
        for (let y = GridHeight - 1; y >= 0; y--) {
            kval = 2n * kval + BigInt(Grid.getCell(x, y));
        }
    }
    return kval * 17n;
}

It constructs K starting with the MSB, but otherwise the code is straightforward to follow.

Background

The formula was first describe by Jeff Tupper in a 2001 paper titled "Reliable Two-Dimensional Graphing Methods for Mathematical Formulae with Two Free Variables". The paper itself focuses on methods of precisely graphing relations and presents several algorithms to do so. This formula is described in passing in section 12, and presented as follows:

And Figure 13 is:

Interestingly, the K provided by Tupper's paper renders the formula flipped on both the x and y axes using the standard grid used in this post [2]. This is why my JavaScript demo has flip toggles that let you flip the axes of any plot.

[1]	This would be

1445202489708975828479425373371945674812777822151507024797
1881396854908873568298734888825132090576643817888323197692
3440016667764749242125128995265907053708020473915320841631
7920255490054180047686572016997304663833949016013743197155
2099618114524978194501906835950051065780432564080119786755
6863142280259694206254096081665642417367403946384170774537
4273196064438999230103793989386750257869294552344763192918
6095761834543224800492172803334941981620674985447203819393
9738513848960476759782673313437697051994580681869819330446
336774047268864

[2]	I can totally see why the y axis would be flipped: in computer programs the concept of the y axis is represented as rows in a grid which typically count from 0 on top and downwards. It's less clear to me how the inversion on the x axis came to be.

Sum of same-frequency sinusoids

2023-03-11T19:44:00-08:00

I was reviewing an electronics textbook the other day, and it made an offhand comment that "sinusoidal signals of the same frequency always add up to a sinusoid, even if their magnitudes and phases are different". This gave me pause; is that really so? Even with different phases?

Using EE notation, a sinusoidal signal with magnitude A_1, frequency and phase \phi_1 is A_1 sin(wt+\phi_1) [1]. The book's statement amounts to:

\[A_1 sin(wt+\phi_1)+A_2 sin(wt+\phi_2)=A_3 sin(wt+\phi_3)\]

The sum is also a sinusoid with the same frequency, but potentially different magnitude and phase. I couldn't find this equality in any of my reference books, so why is it true?

Empirical probing

Let's start by asking whether this is true at all? It's not at all obvious that this should work. Armed with Python, Numpy and matplotlib, I plotted two sinusoidal signals with the same frequency but different magnitudes and phases:

Now, plotting their sum in green on the same chart:

Well, look at that. It seems to be working. I guess it's time to prove it.

Proof using trig identities

The first proof I want to demonstrate doesn't use any fancy math beyond some basic trigonometric identities. One of best known ones is:

\[sin(a+b)=sin(a)cos(b)+cos(a)sin(b) \hspace{2cm} (id. 1)\]

Taking our sum of sinusoids:

\[A_1 sin(wt+\phi_1)+A_2 sin(wt+\phi_2)\]

Applying (id.1) to each of the terms, and then regrouping, we get:

\[\begin{align*} <sum>&=A_1\left [sin(wt)cos(\phi_1)+cos(wt)sin(\phi_1) \right ]+A_2\left [sin(wt)cos(\phi_2)+cos(wt)sin(\phi_2) \right ]\\ &=\left [A_1 cos(\phi_1) + A_2 cos(\phi_2) \right ]sin(wt)+\left [ A_1 sin(\phi_1) + A_2 sin(\phi_2)\right ]cos(wt)\\ \end{align*}\]

Now, a change of variables trick: we'll assume we can solve the following set of equations for some and [2]:

\[\begin{align*} Bcos(\theta)&=A_1 cos(\phi_1)+A_2 cos(\phi_2) \hspace{2cm} (1)\\ Bsin(\theta)&=A_1 sin(\phi_1)+A_2 sin(\phi_2) \hspace{2cm} (2)\\ \end{align*}\]

To find , we can square each of (1) and (2) and then add the squares together:

\[B^2 cos^2 (\theta)+B^2 sin^2 (\theta)=(A_1 cos(\phi_1)+A_2 cos(\phi_2))^2 + (A_1 sin(\phi_1)+A_2 sin(\phi_2))^2\]

Using the fact that cos^2(a)+sin^2(a)=1, we get:

\[B=\sqrt{(A_1 cos(\phi_1)+A_2 cos(\phi_2))^2 + (A_1 sin(\phi_1)+A_2 sin(\phi_2))^2}\]

To solve for , we can divide equation (2) by (1), getting:

\[\frac{sin(\theta)}{cos(\theta)}=tan(\theta)=\frac{A_1 sin(\phi_1)+A_2 sin(\phi_2)}{A_1 cos(\phi_1)+A_2 cos(\phi_2)}\]

Meaning that:

\[\theta = atan{\frac{A_1 sin(\phi_1)+A_2 sin(\phi_2)}{A_1 cos(\phi_1)+A_2 cos(\phi_2)}}\]

Now that we have the values of and , let's put them aside for a bit and get back to the final line of our sum of sinusoids equation:

\[A_1 sin(wt+\phi_1)+A_2 sin(wt+\phi_2)=\left [A_1 cos(\phi_1) + A_2 cos(\phi_2) \right ]sin(wt)+\left [ A_1 sin(\phi_1) + A_2 sin(\phi_2)\right ]cos(wt)\]

On the right-hand side, we can apply equations (1) and (2) to get:

\[A_1 sin(wt+\phi_1)+A_2 sin(wt+\phi_2)=B cos(\theta) sin(wt)+ B sin(\theta) cos(wt)\]

Applying (id.1) again, we get:

\[A_1 sin(wt+\phi_1)+A_2 sin(wt+\phi_2)=B sin(wt + \theta)\]

We've just shown that the sum of sinusoids with the same frequency is another sinusoid with frequency , and we've calculated and from the other parameters (A_1, A_2, \phi_1 and \phi_2) \blacksquare

Proof using complex numbers

The second proof uses a bit more advanced math, but overall feels more elegant to me. The plan is to use Euler's equation and prove a more general statement on the complex plane.

Instead of looking at the sum of real sinusoids, we'll first look at the sum of two complex exponential functions:

\[A_1 e^{j(wt + \phi_1)} + A_2 e^{j(wt + \phi_2)}\]

Reminder: Euler's equation for a complex exponential is

\[e^{jx}=cosx+jsinx\]

Regrouping our sum of exponentials a bit and then applying this equation:

\[\begin{align*} A_1 e^{j(wt + \phi_1)} + A_2 e^{j(wt + \phi_2)}&=e^{jwt}\left (A_1 e^{j\phi_1} + A_2 e^{j\phi_2}\right )\\ &=e^{jwt}\left ( A_1 cos(\phi_1) + jA_1 sin(\phi_1) + A_2 cos(\phi_2) + jA_2 sin(\phi_2)\right )\\ &=e^{jwt}\left [\left (A_1 cos(\phi_1) + A_2 cos(\phi_2) \right ) + j\left(A_1 sin(\phi_1) + A_2 sin(\phi_2) \right ) \right ] \end{align*}\]

The value inside the square brackets can be viewed as a complex number in its rectangular form: x + jy. We can convert it to its polar form: re^{j\theta}, by calculating:

\[\begin{align*} r&=\sqrt{x^2+y^2}\\ \theta&=atan(\frac{y}{x}) \end{align*}\]

In our case:

\[r=\sqrt{(A_1 cos(\phi_1)+A_2 cos(\phi_2))^2 + (A_1 sin(\phi_1)+A_2 sin(\phi_2))^2}\]

And:

\[\theta = atan{\frac{A_1 sin(\phi_1)+A_2 sin(\phi_2)}{A_1 cos(\phi_1)+A_2 cos(\phi_2)}}\]

Therefore, the sum of complex exponentials is another complex exponential with the same frequency, but a different magnitude and phase:

\[A_1 e^{j(wt + \phi_1)} + A_2 e^{j(wt + \phi_2)}= e^{jwt} r e^{j \theta}=r e^{j(wt + \theta)}\]

From here, we can use Euler's equation again to see the equivalence in terms of sinusoidal functions:

\[\begin{align*} A_1 cos(wt+\phi_1)+jA_1 sin(wt+\phi_1)&+\\ A_2 cos(wt+\phi_2)+jA_2 sin(wt+\phi_2)&=r cos(wt+\theta) + jr sin(wt+\theta) \end{align*}\]

If we only compare the imaginary parts of this equation, we get:

\[A_1 sin(wt+\phi_1)+A_2 sin(wt+\phi_2)=r sin(wt+\theta)\]

With known r and we've calculated earlier from the other constants \blacksquare

Note that by comparing the real parts of the equation, we can trivially prove a similar statement about the sum of cosines (which should surprise no one, since a cosine is just a phase-shifted sine).

[1]	Electrical engineers prefer their signal frequencies in units of radian per second. We also like calling the imaginary unit j instead of i, because the latter is used for electrical current.

[2]	If you're wondering "hold on, why would this work?", recall that any point (x,y) on the Cartesian plane can be represented using polar coordinates with magnitude and angle .

Derivative of the Exponential Function

2022-09-12T20:23:00-07:00

It's a known piece of math folklore that e was "discovered" by Jacob Bernoulli in the 17th century, when he was pondering compound interest, and defined thus [1]:

\[e=\displaystyle \lim_{n \to \infty}\left ( 1+\frac{1}{n}\right )^n\]

e is extremely important in mathematics for several reasons; one of them is its useful behavior under derivation and integration; specifically, that:

\[\frac{\mathrm{d} }{\mathrm{d} x}e^x=e^x\]

In this post I want to present a couple of simple proofs of this fundamental fact.

Proof using the limit definition

As a prerequisite for this proof, let's reorder the original definition of e slightly. If we perform a change of variable replacing n by \frac{1}{m}, we get:

\[e=\displaystyle \lim_{m \to 0}\left ( 1+m \right )^\frac{1}{m} \tag{1}\]

This equation will become useful a bit later.

Let's start our proof by spelling out the definition of a derivative:

\[\frac{\mathrm{d} }{\mathrm{d} x}e^x=\displaystyle \lim_{h \to 0}\left ( \frac{e^{x+h}-e^x}{h} \right )\]

A bit of algebra and observing that e^x does not depend on h gives us:

\[\begin{align*} \frac{\mathrm{d} }{\mathrm{d} x}e^x= \displaystyle \lim_{h \to 0}\left ( \frac{e^{x+h}-e^x}{h} \right )&= \displaystyle \lim_{h \to 0}\left ( \frac{e^x(e^{h}-1)}{h} \right )\\ &=\displaystyle e^x\lim_{h \to 0}\left ( \frac{e^{h}-1}{h} \right ) \end{align*}\]

At this point we're stuck; clearly as h approaches 0, both the numerator and denominator approach 0 as well. The way out - as is often the case in such scenarios - is a sneaky change of variable. Recall equation (1) - how could we use it here?

The change of variable we'll use is m=e^h-1, which implies that h=ln(m+1). Note that as h approaches zero, so does m. Rewriting our last expression, we get:

\[\begin{align*} \frac{\mathrm{d} }{\mathrm{d} x}e^x= \displaystyle e^x\lim_{m \to 0}\left ( \frac{m}{ln(m+1)} \right ) &=\displaystyle e^x\lim_{m \to 0}\left ( \frac{1}{\frac{1}{m}ln(m+1)} \right )\\ &=\displaystyle e^x\lim_{m \to 0}\left ( \frac{1}{ln(m+1)^\frac{1}{m}} \right ) \end{align*}\]

Equation (1) tells us that as m approaches zero, (m+1)^\frac{1}{m} approaches e. Substituting that into the denominator we get:

\[\frac{\mathrm{d} }{\mathrm{d} x}e^x= \displaystyle e^x\lim_{m \to 0}\left ( \frac{1}{ln(e)} \right )=e^x \quad \blacksquare\]

Proof using power series expansion

It's always fun to prove the same thing in multiple ways; while I'm sure there are many other techniques to find the derivative of e^x, one I particularly like for its simplicity uses its power series expansion.

Similarly to the way e itself was defined empirically, one can show that:

\[e^x=\displaystyle \lim_{n \to \infty}\left ( 1+\frac{x}{n}\right )^n\]

(For a proof of this equation, see the Appendix)

Let's use the Binomial theorem to open up the parentheses inside the limit:

\[e^x=\lim_{n \to \infty}\left ( 1+\frac{x}{n} \right )^n = \lim_{n \to \infty}\sum_{k=0}^n {n \choose k}1^{n-k}\left (\frac{x}{n}\right )^k\]

We'll unroll the sum a bit, so it's easier to manipulate algebraically. We can use the standard formula for "choose n out of k" and get:

\[e^x=\lim_{n \to \infty}\left ( 1+n\frac{x}{n}+\frac{n(n-1)}{2!}\frac{x^2}{n^2}+\frac{n(n-1)(n-2)}{3!}\frac{x^3}{n^3}+\dotsb \right )\]

Inside the limit, we can simplify all the n-c terms with a constant c to just n, since compared to infinity c is negligible. This means that all these terms can be simplified as n(n-1)\approx n^2, n(n-1)(n-2)\approx n^3 and so on. All these powers of n cancel out in the numerator and denominator, and we get:

\[e^x=\lim_{n \to \infty}\left ( 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\dotsb \right )\]

And since the contents of the limit don't actually depend on n any more, this leaves us with a well-known formula for approximating e^x [2]:

\[e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\dotsb\]

We can finally use this power series expansion to calculate the derivative of e^x quite trivially. Since it's a sum of terms, the derivative is the sum of the derivatives of the terms:

\[\begin{align*} (e^x)^\prime&=1^\prime+x^\prime+\left (\frac{x^2}{2!}\right )^\prime+\left (\frac{x^3}{3!}\right )^\prime+\left (\frac{x^4}{4!}\right )^\prime+\dotsb\\ &=0+1+\frac{2x}{2!}+\frac{3x^2}{3!}+\frac{4x^3}{4!}+\dotsb\\ &=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\dotsb \end{align*}\]

Look at that, we've got e^x back, \blacksquare

Appendix

Let's see why:

\[e^x=\displaystyle \lim_{n \to \infty}\left ( 1+\frac{x}{n}\right )^n\]

We'll start with the limit and will arrive at e^x. Using a change of variable m=\frac{n}{x}:

\[\lim_{n \to \infty}\left (1 + \frac{x}{n}\right )^{n}=\lim_{n \to \infty}\left (1 + \frac{1}{m}\right )^{mx} =\lim_{n \to \infty}\left [\left (1 + \frac{1}{m}\right )^{m}\right ]^{x}\]

Given our change of variable, since n approaches infinity, so does m. Therefore, we get:

\[\lim_{m \to \infty}\left [\left (1 + \frac{1}{m}\right )^{m}\right ]^{x}\]

Nothing in the limit depends on x, so that exponent can be seen as applying to the whole limit. And the limit is the definition of e; therefore, we get e^x, \blacksquare

[1]

What I love about this definition is that it's entirely empirical. Try to substitute successively larger numbers for n in the equation, and you'll see that the result approaches the value e more and more closely. The limit of this process for an infinite n was called e. Bernoulli did all of this by hand, which is rather tedious. His best estimate was that e is "larger than 2 and a half but smaller than 3".

[2]	Another way to get this formula is from the Maclaurin series expansion of e^x, but we couldn't use that here since Maclaurin series require derivatives, while we're trying to figure out what the derivative of e^x is.

Some clues to understanding Benford's law

2022-03-12T06:03:00-08:00

Benford's law is a really fascinating observation that in many real-life sets of numerical data, the first digit is most likely to be 1, and every digit d is more common than d+1. Here's a table of the probability distribution, from Wikipedia:

Now, the caveat "real-life data sets" is really important. Specifically, this only applies when the data spans several orders of magnitude. Clearly, if we're measuring the height in inches of some large group of adults, the overwhelming majority of data will lie between 50 and 85 inches, and won't follow Benford's law. Another aspect of real-life data is that it's non random; if we take a bunch of truly random numbers spanning several orders of magnitude, their leading digit won't follow Benford's law either.

In this short post I'll try to explain how I understand Benford's law, and why it intuitively makes sense. During the post I'll collect a set of clues, which will help get the intuition in place eventually. By the way, we've already encountered our first clues:

Clue 1: Benford's law only works on real-life data.
Clue 2: Benford's law isn't just about the digit 1; 2 is more common than 3, 3 is more common than 4 etc.

Real-world example

First, let's start with a real-world demonstration of the law in action. I found a data table of the populations of California's ~480 largest cities, and ran an analysis of the population number's leading digit [1]. Clearly, this is real-life data, and it also spans many orders of magnitude (from LA at 3.9 mln to Amador with 153 inhabitants). Indeed, Benford's law applies beautifully on this data:

Eyeballing the city population data, we'll notice something important but also totally intuitive: most cities are small. There are many more small cities than large ones. Out of the 480 cities in our data set, only 74 have population over 100k, for example.

The same is true of other real-world data sets; for example, if we take a snapshot of stock prices of S&P 500 companies at some historic point, the prices range from $1806 to $2, though 90% are under $182 and 65% are under $100.

Clue 3: in real-world data distributed along many orders of magnitude, smaller data points are more common than larger data points.

Statistically, this is akin to saying that the data follows the Pareto distribution, of which the "80-20 rule" - known as the Pareto principle - is a special case. Another similar mathematical description (applied to discrete probability distributions) is Zipf's law.

Logarithmic scale

To reiterate, a lot of real-world data isn't really uniformly distributed. Rather, it follows a Pareto distribution where smaller numbers are more common. Here's a useful logarithmic scale borrowed from Wikipedia - this could be the X axis of any logarithmic plot:

In this image, smaller values get more "real estate" on the X axis, which is fair for our distribution if smaller numbers are more common than larger numbers. It should not be hard to convince yourself that every time we "drop a pin" on this scale, the chance of the leading digit being 1 is the highest. Another (related) way to look at it is - when smaller numbers are more common it takes a 100% percent increase to go from leading digit being 1 to it being 2, but only a 50% increase to go from 2 to 3, etc.

Clue 4: on a logarithmic scale, the distance between numbers starting with 1s and numbers starting with 2s is bigger than the distance between numbers starting with 2s and numbers starting with 3s, and so on.

We can visualize this in another way; let's plot the ratio of numbers starting with 1 among all numbers up to some point. On the X axis we'll place N which means "in all numbers up to N", and on the Y axis we'll place the ratio of numbers i between 0 and N that start with 1:

Note that whenever some new order of magnitude is reached, the ratio starts to climb steadily until it reaches ~0.5 (because there are just as many numbers with D digits as numbers starting with 1 and followed by another D digits); it then starts falling until it reaches ~0.1 just before we flip to the next order of magnitude (because in all D-digit numbers, numbers starting with each digit are one tenth of the population). If we calculate the smoothed average of this graph over time, it ends up at about 0.3, which corresponds to Benford's law.

Summary

When I'm thinking of Benford's law, the observation that really brings it home for me is that "smaller numbers are more common than larger numbers" (this is clue 3). This property of many realistic data sets, along with an understanding of the logarithmic scale (the penultimate image above) is really all you need to intuitively grok Benford's law.

Benford's law is also famous for being scale-invariant (by typically applying regardless of the unit of measurement) and base-invariant (works in bases other than 10). Hopefully, this post makes it clear why these properties are expected to be true.

[1]	All the (hacky Go) code and data required to generate the plots in this post is available on GitHub.

Computing the Chinese Remainder Theorem

2020-12-18T05:44:00-08:00

Last year, I wrote a post about the Chinese Remainder Theorem (CRT), focusing on the math. Here, I want to talk about implementing solvers for the CRT.

CRT reminder

Assume n_1,\dots,n_k are positive integers, pairwise co-prime; that is, for any i\neq j, (n_i,n_j)=1. Let a_1,\dots,a_k be arbitrary integers. The system of congruences with an unknown x:

\[\begin{align*} x &\equiv a_1 \pmod{n_1} \\ &\vdots \\ x &\equiv a_k \pmod{n_k} \end{align*}\]

has a single solution modulo the product N=n_1\times n_2\times \cdots \times n_k.

See the post linked above for a proof of this theorem, along with all the required number theory prerequisites.

Naive solution by searching exhaustively

Suppose you have an actual programming problem that maps to the CRT. How would you go about solving it?

To make things more concrete, say the problem is:

\[\begin{align*} x &\equiv 0 \pmod{3} \\ x &\equiv 3 \pmod{4} \\ x &\equiv 4 \pmod{5} \\ \end{align*}\]

The naive solution is search from 1 to 3*4*5-1 (since by the CRT we expect a unique solution modulo N, which is the product of all n). When we find a number that satisfies all the congruences - it's the solution! In this particular case, 39 is a solution. Since the solution is only unique modulo N, we can keep adding 60 to our solution to get additional solutions.

Coding this is trivial:

func crtSearch(a, n []int64) int64 {
  var N int64 = 1
  for _, nk := range n {
    N *= nk
  }

search:
  for i := int64(0); i < N; i++ {
    // Does i satisfy all the congruences?
    for k := 0; k < len(n); k++ {
      if i%n[k] != a[k] {
        continue search
      }
    }
    return i
  }

  return -1
}

This approach works well for small problems, but fails miserably for anything even moderately large. The problem is obvious: the complexity of this algorithm is O(kN) where k is the number of congruences. In number theoretic algorithms, it's common to talk about the problem size as the bit size of the numbers involved; in this formulation, this algorithm is exponential (since N itself is 2 to the power of its size in bits).

For a concrete example, let's examine a somewhat larger problem:

\[\begin{align*} x &\equiv 2292 &\pmod{77003} \\ x &\equiv 3010 &\pmod{61223} \\ x &\equiv 500 &\pmod{60161} \\ x &\equiv 399 &\pmod{25873} \\ \end{align*}\]

Here the naive algorithm will have to run for up to 4N iterations, where N is a rather sizable 19 digit number. That's just not going to cut it [1]; we need a better approach.

Search by sieving

The best way to understand this algorithm is to sit down with a piece of paper and a pencil and try to work through a CRT problem by hand. A key insight that may help is that there is a unique solution for every subset of the CRT problem as well.

For example, let's take the first problem (remainders 0, 3, 4 and moduli 3, 4, 5); looking only at the first two congruences, we can find a unique solution (modulo 12) - in this case 3. By the CRT itself, we know this solution is unique, and any number in the arithmetic progression 3+12k is also a solution.

Therefore, we can build a solution by induction, starting with the first congruence and moving to the next each time.

For just the first congruence, a_1 itself is a trivial solution. But so are all a_1+kn_1, for each integer k. One of these will be a solution to the second congruence as well. Let's call this solution x_2; it is a solution for the first two congruences. We can continue the same approach; since x_2 is unique modulo n_1n_2, we'll start looking for a solution for the third congruence by checking x_2+kn_1n_2 for each integer k. And so on, until we find a solution to all the congruences.

If you're having trouble following this explanation with a piece of paper, try reading the Wikipedia description for "Search by sieving".

Here is the code that implements this:

func crtSieve(a, n []int64) int64 {
  var N int64 = 1
  for _, nk := range n {
    N *= nk
  }

  base := a[0]
  incr := n[0]

nextBase:
  // This loop goes over the congruences one by one; base is a solution
  // to the congruences seen so far.
  for i := 1; i < len(a); i++ {
    // Find a solution that works for the new congruence a[i] as well.
    for candidate := base; candidate < N; candidate += incr {
      if candidate%n[i] == a[i] {
        base = candidate
        incr *= n[i]
        continue nextBase
      }
    }
    // Inner loop exited without finding candidate
    return -1
  }
  return base
}

By the way, for this approach to be maximally effective we should sort the congruences by decreasing modulo (solve the congruence with the largest n first).

Applied to the larger problem presented at the end of the previous section, this algorithm solves it in half a millisecond on my machine. Not a bad improvement vs. the ~forever time it takes using the naive approach!

That said, this approach is still exponential! For really large problems (think public cryptography-level numbers) we'll need something better.

Using the proof construction

The proof of the CRT includes a construction of the solution that we could implement in code. A quick reminder:

\[x=a_1 N_1 N'_1+a_2 N_2 N'_2+\cdots +a_k N_k N'_k\]

Is a solution, where N_k=\frac{N}{n_k} and N'_k is the multiplicative inverse of N_k modulo n_k. Finding a multiplicative modular inverse can be done efficiently with the (extended) Euclidean algorithm, so we should be good as long as we can handle potentially enormous numbers.

In Go, this is easy with the math/big package [2]. Here's an implementation; unlike the previous ones, it uses big.Int instead of int64, so it can handle integers of arbitrary size (machine memory permitting):

func crtConstruct(a, n []*big.Int) *big.Int {
  // Compute N: product(n[...])
  N := new(big.Int).Set(n[0])
  for _, nk := range n[1:] {
    N.Mul(N, nk)
  }

  // x is the accumulated answer.
  x := new(big.Int)

  for k, nk := range n {
    // Nk = N/nk
    Nk := new(big.Int).Div(N, nk)

    // N'k (Nkp) is the multiplicative inverse of Nk modulo nk.
    Nkp := new(big.Int)
    if Nkp.ModInverse(Nk, nk) == nil {
      return big.NewInt(-1)
    }

    // x += ak*Nk*Nkp
    x.Add(x, Nkp.Mul(a[k], Nkp.Mul(Nkp, Nk)))
  }
  return x.Mod(x, N)
}

Looking at the formula at the beginning of this section, following this code should be straightforward. The complexity of this algorithm is quadratic, and it's much faster than the sieve method on large inputs.

To make the comparison fair, I've implemented a version of crtSieve that uses big.Int as well (since the version shown above is limited by the size of int64) and ran it vs. crtConstruct on a large-ish CRT problem where the solution has 144 bits (a 44-digit number in decimal). crtConstruct was ~20 times faster, in my measurements.

You can see all the code for this post, along with some tests and simple benchmarks, in this repository.

[1]	The solution, in case you were wondering, is 4412381708627286819.

[2] I have to admit Go really shines here. Since it comes with an industrial-strength crypto package which itself relies on arbitrary-sized integers, math/big has a whole bunch of goodies that are useful for number-theoretical computations. For example the ModInverse method is built-in (and if you need something more general, there's also a GCD method for computing the full extended Euclidean algorithm).