Some notes on the Y combinator



The goal of this post is to jot down a few notes about the Y combinator, explaining how it works without getting too much into lambda-calculus theory. I'll be using Clojure and Python as the demonstration languages.

The idea is to build up intuition for the Y combinator from simple examples in a way that makes understanding it a sequences of small mental leaps rather than one large one.

Recursion with named functions

It wouldn't be a proper article about recursion if it didn't start with a factorial. Here's a fairly run-of-the-mill implementation in Clojure:

(defn factorial-rec
  [n]
  (if (zero? n)
    1
    (* n (factorial-rec (- n 1)))))

Recursion is accomplished by invoking the function, by name, within itself. Herein begins the thought experiment that will lead us to the Y combinator. Imagine that we're using a language where functions have no names - they're all anonymous. We can assign anonymous functions to symbols, but those symbols aren't visible or usable from within the function's body.

As an example of what I'm talking about, here is a non-recursive implementation of factorial in Clojure:

(def factorial-loop
  (fn [n]
    (loop [i n answer 1]
      (if (zero? i)
        answer
        (recur (- i 1) (* answer i))))))

Note how this is defined: we assign an anonymous function (lambda in Lisp/Scheme/Python parlance, fn in Clojure) to the symbol factorial-loop. This anonymous function computes the factorial of its parameter, and we can call it as follows:

ycombinator.core=> (factorial-loop 6)
720

To emphasize that factorial-loop is just a convenience symbol and plays no role in the implementation, we can forego it for a slightly more convoluted invocation:

ycombinator.core=> ((fn [n]
              #_=>     (loop [i n answer 1]
              #_=>       (if (zero? i)
              #_=>         answer
              #_=>         (recur (- i 1) (* answer i))))) 6)
720

No names in sight - we just invoke the anonymous function directly. But this implementation of factorial isn't recursive, so we don't really need to refer to the function's name from within its body. What if we do want to use recursion? This brings us back to the thought experiment.

Recursion with anonymous functions

It turns out this is absolutely possible by using some ingenuity and cranking the abstraction level up one notch. In our original factorial-rec, at the point where the function invokes itself all we need is an object that implements factorial, right? In factorial-rec we're using the fact that the symbol factorial-rec is bound to such an object (by the nature of defn). But we can't rely on that in our thought experiment. How else can we get access to such an object? Well, we can take it as a parameter... Here's how:

(def factorial-maker
  (fn [self]
    (fn [n]
      (if (zero? n)
        1
        (* n ((self self) (- n 1)))))))

And now we can compute factorials as follows:

ycombinator.core=> ((factorial-maker factorial-maker) 6)
720

A few things to note:

  1. factorial-maker is not computing a factorial. It creates an (anonymous) function that computes a factorial. It expects to be passed itself as a parameter.
  2. The expression (factorial-maker factorial-maker) does precisely that. It invokes factorial-maker and passes it itself as a parameter. The result of that is a function that computes a factorial, which we then apply to 6.
  3. The recursion inside the factorial is replaced by (self self); when the function created by (factorial-maker factorial-maker) runs for the first time, self is assigned to factorial-maker, so (self self) is (factorial-maker factorial maker). This is equivalent to the first call - recursion!

You may still feel uncomfortable about the def and the name factorial-maker. Aren't we just cheating? Nope, because we can do the same expansion as we did with factorial-loop; we don't need that def. Here's how it would look:

ycombinator.core=> (((fn [self]
              #_=>     (fn [n]
              #_=>       (if (zero? n)
              #_=>         1
              #_=>         (* n ((self self) (- n 1))))))
              #_=>   (fn [self]
              #_=>     (fn [n]
              #_=>       (if (zero? n)
              #_=>         1
              #_=>         (* n ((self self) (- n 1))))))) 6)
720

Pretty it is not... But hey, we've now implemented a recursive factorial function, without a single name in sight. How cool is that?

Understanding the example above is about 80% of the way to understanding the Y combinator, so make sure to spend the time required to thoroughly grok how it works. Tracing through the execution for 2-3 calls while drawing the "environments" (call frames) in action helps a lot.

To get a better feel of the direction we're taking, here's another recursive function that's slightly more complex than the factorial:

(defn tree-sum-rec
  [t]
  (if (nil? t)
    0
    (let [[nodeval left right] t]
      (+ nodeval
         (tree-sum-rec left)
         (tree-sum-rec right)))))

Given a binary tree represented as a list-of-lists with numbers for node deta, this function computes the sum of all the nodes in the tree. For example:

ycombinator.core=> (def t1 '(1 (2) (4 (3) (7))))
#'ycombinator.core/t1
ycombinator.core=> (tree-sum-rec t1)
17

We can rewrite it without using any symbol names within the function as follows:

(def tree-sum-maker
  (fn [self]
    (fn [t]
      (if (nil? t)
        0
        (let [[nodeval left right] t]
          (+ nodeval
             ((self self) left)
             ((self self) right)))))))

And invoke it as follows:

ycombinator.core=> ((tree-sum-maker tree-sum-maker) t1)
17

Note the similarities between tree-sum-maker and factorial-maker. They are transformed very similarly to synthesize the unnamed from the named-recursion variant. The recipe seems to be:

  1. Instead of a function taking a parameter, create a function factory that accepts itself as the self parameter, and returns the actual computation function.
  2. In every place where we'd previously call ourselves, call (self self) instead.
  3. The initial invocation of (foo param) is replaced by ((foo-maker foo-maker) param).

Y combinator - a tool for making anonymous functions recursive

Since there is a clear pattern here, we should be able to abstract it away and provide some method that transforms a given named-recursive function into an unnamed variant. This is precisely what the Y combinator does, though the nature of the problem makes it somewhat obscure at first sight:

(def Ycombinator
  (fn [func]
    ((fn [self]
       (func (fn [n] ((self self) n))))
     (fn [self]
       (func (fn [n] ((self self) n)))))))

I'll explain how it works shortly, but first let's see how we use it. We have to write our factorial as follows:

(def factorial-rec*
  (fn [recurse]
    (fn [n]
      (if (zero? n)
        1
        (* n (recurse (- n 1)))))))

Note the superficial similarity to the factorial-maker version. factorial-rec* also takes a function and returns the actual function computing the factorial, though in this case I don't call the function parameter self (it's not self in the strict sense, as we'll soon see). We can convert this function to a recursive computation of the factorial by invoking the Y combinator on it:

ycombinator.core=> ((Ycombinator factorial-rec*) 6)
720

It's easiest to understand how Ycombinator does its magic by unraveling this invocation step by step. Similarly to how we did earlier, we can get rid of the Ycombinator name and just apply the object it's defined to be directly:

(((fn [func]
    ((fn [self]
       (func (fn [n] ((self self) n))))
     (fn [self]
       (func (fn [n] ((self self) n))))))
  factorial-rec*)
 6)

As before, this does two things:

  1. Call the Y combinator (just a scary-looking anonymous function) on factorial-rec*.
  2. Call the result of (1) on 6.

If you look carefully at step 1, it invokes the following anonymous function:

(fn [self]
  (func (fn [n] ((self self) n))))

On itself, with func bound to factorial-rec*. So what we get is:

(((fn [self]
    (factorial-rec* (fn [n] ((self self) n))))
  (fn [self]
    (factorial-rec* (fn [n] ((self self) n)))))
 6)

And if we actually perform the call:

((factorial-rec* (fn [n] (((fn [self]
                             (factorial-rec* (fn [n] ((self self) n))))
                           (fn [self]
                             (factorial-rec* (fn [n] ((self self) n)))))
                          n)))
 6)

This calls factorial-rec*, passing it an anonymous function as recurse [1]. factorial-rec* returns a factorial-computing function. This is where the first step ends. Invoking this factorial-computing function on 6 is the second step.

It should now be obvious what's going on. When the invocation with 6 happens and the program gets to calling recurse, it calls the parameter of factorial-rec* as shown above. But we've already unwrapped this call before - it... recurses into factorial-rec*, while propagating itself forward so that the recurse parameter is always bound properly. It's just the same trick as was employed by factorial-maker earlier in the post.

So, the Y combinator is the magic sauce that lets us take code like factorial-rec and convert it into code like factorial-maker. Here's how we can implement an unnamed version of tree-sum-rec:

(def tree-sum-rec*
  (fn [recurse]
    (fn [t]
      (if (nil? t)
        0
        (let [[nodeval left right] t]
          (+ nodeval
             (recurse left)
             (recurse right)))))))

And using it with the Y combinator:

ycombinator.core=> ((Ycombinator tree-sum-rec*) t1)
17

Here is an alternative formulation of the Y combinator that can make it a bit easier to understand. In this version I'm using named Clojure functions for further simplification (since many folks find the syntax of anonymous functions applied to other anonymous functions too cryptic):

(defn apply-to-self [func] (func func))

(defn Ycombinator-alt
  [func]
  (apply-to-self
   (fn [self]
     (func (fn [n] ((self self) n))))))

The Y combinator in Python

Finally, just to show that the Y combinator isn't something unique to the Lisp family of languages, here's a Python implementation:

ycombinator = lambda func: \
                (lambda self: func(lambda n: (self(self))(n)))(
                        lambda self: func(lambda n: (self(self))(n)))

factorial = lambda recurse: \
                   lambda n: \
                     1 if n == 0 else n * recurse(n - 1)

And we can invoke it as follows:

>>> (ycombinator(factorial))(6)
720

There's no real difference between the Python and the Clojure versions. As long as the language supports creating anonymous functions and treats them as first-class citizens, all is good.

It's even possible to create the Y combinator in C++. Static typing makes it somewhat less elegant than in the more dynamic languages, but C++14's generic lambdas help a lot. Take a look at Rosetta Stone for an example.


[1]

Incidentally, note that by starting with (Ycombinator factorial-rec*), we now got to (factorial-rec* (Ycombinator factorial-rec*)). For this reason, the Y combinator is a fixed-point combinator in lambda calculus.

There's another interesting thing to note here - the equivalence mentioned above is imperfect. The call (Ycombinator factorial-rec*) results in a delayed fixed point equivalence (the delay achieved by means of wrapping the result in a fn). This is because we're using Clojure - an eagerly evaluated language. This version of the Y combinator is called the applicative-order Y combinator. Without the delay, we'd get an infinite loop. In lazily evaluated languages, it's possible to define the Y combinator somewhat more succinctly.

All of this is very interesting, but I'm deliberately avoiding getting too deep into lambda calculus and programming language theory in this post; I may write more about it some time in the future.


EOPL define-datatype and cases in Clojure



I'm going through the Essentials of Programming Languages (3rd ed.) book and it's been pretty good so far. In chapter 2, the authors use a pair of macros - define-datatype and cases - to make it easy to define data-driven programs, where objects belong to types, each of which has several "variants" with custom fields.

The canonical example used chapter 2 is the "Lambda calculus expression":

(define-datatype lc-exp lc-exp?
  (var-exp
   (var symbol?))
  (lambda-exp
   (bound-var symbol?)
   (body lc-exp?))
  (app-exp
   (rator lc-exp?)
   (rand lc-exp?)))

This means we create a type named lc-exp, with three variants:

  1. var-exp which has a field named var, a symbol.
  2. lambda-exp which has two fields: bound-var is a symbol, and body is a lc-exp.
  3. app-exp which has two fields: rator and rand, both a lc-exp.

The define-datatype invocation creates multiple helper functions; for example, the predicate lc-exp? that tests whether the object it's given is a lc-exp. It can also optionally create accessors such as app-exp->rand, that will extract a field from a given variant.

The companion cases macro lets us organize code that operates on types created with define-datatype succinctly. For example, a function that checks whether some symbol occurs as a free variable in a given lc-exp:

(defn occurs-free?
  [search-var exp]
  (cases lc-exp exp
         (var-exp (variable) (= variable search-var))
         (lambda-exp (bound-var body)
                     (and (not (= search-var bound-var))
                          (occurs-free? search-var body)))
         (app-exp (rator rand)
                  (or
                   (occurs-free? search-var rator)
                   (occurs-free? search-var rand)))))

[Note: this is actual Clojure code from my implementation; the book uses Scheme, so it has slightly different syntax.]

Alas, while the book explains how this pair of macros works and uses them all over the place, it provides no definition. The definitions found online are either hard to hunt down or very verbose (which may be due to Scheme's use of hygienic macros).

Therefore I rolled my own, in Clojure, and the full code is available here. The code comes with a large number of unit tests, many of which are taken from the exercises in chapter 2 of the book.

It's been quite a while since I last did any serious Lispy macro hacking, so my implementation is fairly cautious in its use of macros. One cool thing about the way Clojure's (Common Lisp-like) macros work is that writing them is very close to just manipulating lists of symbols (representing code) in regular functions. Here's my define-datatype:

(defn define-datatype-aux
  "Creates a datatype from the specification. This is a function, so all its
  arguments are symbols or quoted lists. In particular, variant-descriptors is a
  quoted list of all the descriptors."
  [typename predicate-name variant-descriptors]
  ...)

(defmacro define-datatype
  "Simple macro wrapper around define-datatype-aux, so that the type name,
  predicate name and variant descriptors don't have to be quoted but rather can
  be regular Clojure symbols."
  [typename predicate-name & variant-descriptors]
  (define-datatype-aux typename predicate-name variant-descriptors))

All the macro does here is to do the thing only macros can do - change the evaluation rules of expressions, by not actually evaluating the arguments passed to define-datatype; rather passing them as lists of symbols (code) to a function. The define-datatype-aux function can then manipulate these lists of symbols. The only problem with this approach is that while macros can simply inject defns into the namespace, functions have to work a bit harder for that; what I use instead is:

(defn internfunc
  "Helper for interning a function with the given name (as a string) in the
  current namespace."
  [strname func]
  (intern *ns* (symbol strname) func))

I'm sure the code could be made much shorter by doing more work in the macro, but writing it this way made it possible to break the implementation into a number of small and simple functions, each of which is easy to test and understand without peering into the output of macroexpand.

In the implementation of cases I was a bit more brave and left more work in the macro itself:

(defn make-cond-case
  "Helper function for cases that generates a single case for the variant cond.

  variant-case is one variant case as given to the cases macro.
  obj-variant is the actual object variant (a symbol) as taken from the object.
  obj-fields is the list of the actual object's fields.

  Produces the code for '(cond-case cond-action)."
  [variant-case obj-variant obj-fields]
  `((= (quote ~(first variant-case)) ~obj-variant)
    (apply (fn [~@(second variant-case)] ~(last variant-case)) ~obj-fields)))

(defmacro cases
  [typename obj & variant-cases]
  (let [obj-type-sym (gensym 'type)
        obj-variant-sym (gensym 'variant)
        obj-fields-sym (gensym 'fields)]
    `(let [[~obj-type-sym ~obj-variant-sym & ~obj-fields-sym] ~obj]
       (assert (= ~obj-type-sym (quote ~typename)) "Unexpected type")
       (cond
         ~@(mapcat (fn [vc] (make-cond-case vc obj-variant-sym obj-fields-sym))
                   variant-cases)
         :else (assert false "Unsupported variant")))))

As you can see, I still deferred some of the work to a function - make-cond-case - to avoid complex nested quoting within the macro.

The full code is on Github.


Logistic regression



This article covers logistic regression - arguably the simplest classification model in machine learning; it starts with basic binary classification, and ends up with some techniques for multinomial classification (selecting between multiple possibilities). The final examples using the softmax function can also be viewed as an example of a single-layer fully connected neural network.

This article is the theoretical part; in addition, there's quite a bit of accompanying code here. All the models discussed in the article are implemented from scratch in Python using only Numpy.

Linear model for binary classification

Using a linear model for binary classification is very similar to linear regression, except that we expect a binary (yes/no) answer rather than a numeric answer.

We want to come up with a parameter vector \theta, such that for every data vector x we can compute [1]:

\[\hat{y}(x) = \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_n x_n\]

And then make a binary decision based on the value of \hat{y}(x). A simple way to make a decision is to say "yes" if \hat{y}(x)\geq 0 and "no" otherwise. Note that this is arbitrary, as we could flip the condition for "yes" and for "no". We could also compare \hat{y}(x) to some value other than zero, and the model would learn equally well [2].

Let's make this more concrete, also assigning numeric values to "yes" and "no", which will make some computations simpler later on. For "yes" we'll (again, arbitrarily) select +1, and for "no" we'll go with -1. So, a linear model for binary classification is parameterized by some \theta, such that:

\[\hat{y}(x) = \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_n x_n\]

And:

\[class(x)=\left\{\begin{matrix} +1 & \operatorname{if}\ \hat{y}(x)\geq 0\\ -1 & \operatorname{if}\ \hat{y}(x)< 0 \end{matrix}\]

It helps seeing a graphical example of how this looks in practice. As usual, we'll have to stick to low dimensionality if we want to visualize things, so let's use 2D data points.

Since our data is in 2D, we need a 3D \theta (\theta_0 for the bias). Let's pick \theta=(4,-0.5, -1). Plotting \hat{y}(x)=\theta \cdot x will give us a plane in 3D, but what we're really interested in is just to know whether \hat{y}(x) \geq 0. So we can draw this plane's intersection with the x/y axis:

Line for binary classification

We can play with some sample points to see that everything "to the right" of the line gives us \hat{y}(x) > 0, and everything "to the left" of it gives us \hat{y}(x) < 0 [3].

Loss functions for binary classification

How do we find the right \theta for a classification problem? Similarly to linear regression, we're going to define a "loss function" and then train a classifier by minimizing this loss with gradient descent. However, here picking a good loss function is not as simple - it turns out square loss doesn't work very well, as we'll see soon.

Let's start by considering the most logical loss function to use for classification - the number of misclassified data samples. This is called the 0/1 loss, and it's the true measure of how well a classifier works. Say we have 1000 samples, our classifier placed 960 of them in the right category, and got the wrong answer for the other 40 samples. So the loss would be 40. A better classifier may get it wrong only 35 times, so its loss would be smaller.

It will be helpful to plot loss functions, so let's add another definition we're going to be using a lot here: the margin. For a given sample x, and its correct classification y, the margin of classification is m=\hat{y}(x)y. Recall that y is either +1 or -1, so the margin is either \hat{y}(x) or its negation, depending on the correct answer. Note that the margin is positive when our guess is correct (both \hat{y}(x) and y have the same sign) and negative when our guess is wrong. With this in hand, we define 0/1 loss as:

\[L_{01}(m) = \mathbb{I}(m \leq 0)\]

Where \mathbb{I} is an indicator function taking the value 1 when its condition is true and the value 0 otherwise. Here is the plot of L_{01} as a function of margin:

0/1 loss for binary classification

Unfortunately, the 0/1 loss is fairly hostile to gradient descent optimization, since it's not convex. This is easy to see intuitively. Suppose we have some \theta that gives us a margin of -1.5. The 0/1 loss for this margin is 1, but how can we improve it? Small nudges to \theta will still give us a margin very close to -1.5, which results in exactly the same loss. We don't know which way to nudge \theta since either way we get the same outcome. In other words, there's no slope to follow here.

That's not to say all is lost. Some work is being done with optimizing 0/1 losses for classification, but this is a bit outside the mainstream of machine learning. Here's an interesting paper that discusses some approaches. It's fascinating for computer science geeks since it uses combinatorial search techniques. The rest of this post, however, will use 0/1 loss only as an idealized limit, trying other kinds of loss we can actually run gradient descent with.

The first such loss that comes to mind is square loss, the same one we use in linear regression. We'll define the square loss as a function of margin:

\[L_2(m) = (m - 1)^2\]

The reason we do this is to get two desired outcomes at important points: at m=1 we want the loss to be 0, since this is actually the correct classification: we only get m=1 when either both y=1 and \hat{y}=1 or when both y=-1 and \hat{y}=-1.

Furthermore, to approximate the 0/1 loss, we want our loss at m=0 to be 1. Here's a plot of the square loss together with 0/1 loss:

0/1 loss and square loss for binary classification

A couple of problems are immediately apparent with the square loss:

  1. It penalizes correct classification as well, in case the margin is very positive. This is not something we want! Ideally, we want the loss to be 0 starting with m=1 and for all subsequent values of m.
  2. It very strongly penalizes outliers. One sample that we misclassified badly can shift the training too much.

We could try to fix these problems by using clamping of some sort, but there is another loss function which serves as a much better approximation to 0/1 loss. It's called "hinge loss":

\[L_h(m) = max(0, 1-m)\]

And its plot, along with the previously shown losses:

0/1 loss, square loss and hinge loss for binary classification

Note that the hinge loss also matches 0/1 loss on the two important points: m=0 and m=1. It also has some nice properties:

  1. It doesn't penalize correct classification after m=1.
  2. It penalizes incorrect classifications, but not as much as square loss.
  3. It's convex (at least where it matters - where the loss is nonzero)! If we get m=-1.5 we can actually examine the loss in its very close vicinity and find a slope we can use to improve the loss. So, unlike 0/1 loss, it's amenable to gradient descent optimization.

There are other loss functions used to train binary classifiers, such as log loss, but I will leave them out of this post.

This is a good place to mention that hinge loss leads naturally to SVMs (support vector machines), an interesting technique I'll leave for some other time.

Finding a classifier with gradient descent

With a loss function in hand, we can use gradient descent to find a good classifier for some data. The procedure is very similar to what we've been doing for linear regression:

Given a loss function, we compute the loss gradient with respect to each \theta and update \theta for the next step:

\[\theta_{j}=\theta_{j}-\eta\frac{\partial L}{\partial \theta_{j}}\]

Where \eta is the learning rate.

Computing gradients for our loss functions, with regularization

The only missing part remaining is computing the gradients for the square and loss hinge functions we've defined. In addition, I'm going to add "L_2 regularization" to the loss as a means to prevent overfitting for the training data. Regularization is an important component of the learning algorithm. L_2 regularization adds the sum of the squares of all parameters to the loss, and thus "tries" to keep parameters low. This way, we don't end up over-emphasizing one or a group of parameters over the others.

Here is square loss with regularization [4]:

\[L_2=\frac{1}{k}\sum_{i=1}^{k}(m^{(i)}-1)^2+\frac{\beta}{2}\sum_{j=0}^{n}\theta_{j}^2\]

This is assuming we have k data points (n+1 dimensional) and n+1 parameters (including the special 0th parameter representing the bias). The total loss is the square loss averaged over all data points, plus the regularization loss. \beta is the regularization "strength" (another hyper-parameter in the learning algorithm).

Let's start by computing the derivative of the margin. Using superscripts for indexing data items, recall that:

\[m^{(i)}=\hat{y}^{(i)}y^{(i)}=(\theta_0 x_0^{(i)}+\cdots + \theta_n x_n^{(i)})y^{(i)}\]

Therefore:

\[\frac{\partial m^{(i)}}{\partial \theta_j}=x_j^{(i)}y^{(i)}\]

With this in hand, it's easy to compute the gradient of L_2 loss.

\[\frac{\partial L_2}{\partial \theta_j}=\frac{2}{k}\sum_{i=1}^{k}(m^{(i)}-1)x_{j}^{(i)}y^{(i)}+\beta\theta_j\]

Now let's turn to hinge loss. The total loss for the data set with regularization is:

\[L_h=\frac{1}{k}\sum_{i=1}^{k}max(0, 1-m^{(i)})+\frac{\beta}{2}\sum_{j=0}^{n}\theta_{j}^2\]

The tricky part here is finding the derivative of the max function with respect to \theta_j. I find it easier to reason about functions like max when the different cases are cleanly separated:

\[max(0,1-m^{(i)})=\left\{\begin{matrix} 1-m^{(i)} & \operatorname{if}\ m^{(i)}< 1\\ 0 & \operatorname{if}\ m^{(i)}\geq 1 \end{matrix}\right.\]

We already know the derivative of m^{(i)} with respect to \theta_j. So it's easy to derive this expression case-by-case:

\[\frac{\partial max(0,1-m^{(i)})}{\partial \theta_j}=\left\{\begin{matrix} -x_j^{(i)}y^{(i)} & \operatorname{if}\ m^{(i)}< 1\\ 0 & \operatorname{if}\ m^{(i)}\geq 1 \end{matrix}\right.\]

And the overall gradient of the hinge loss is:

\[\frac{\partial L_h}{\partial \theta_j}=\frac{1}{k}\sum_{i=1}^{k}\frac{\partial max(0,1-m^{(i)})}{\partial \theta_j}+\beta\theta_j\]

Experiments with synthetic data

Let's see an example of learning binary classifier in action. This code sample generates some synthetic data in two dimensions and then uses the approach described so far in the post to train a binary classifier. Here's a sample data set:

Synthetic data for binary classification

The data points for which the correct answer is positive (y=1) are the green crosses; the ones for which the correct answer is negative (y=-1) are the red dots. Note that I include a small number of negative outliers (red dots where we'd expect only green crosses to be) to test the classifier on realistic, imperfect data.

The sample code can use combinatorial search to find a "best" set of parameters that results in the lowest 0/1 loss - the lowest number of misclassified data items. Note that misclassifying some items in this data set is inevitable (with a linear classifier), because of the outliers. Here is the contour line showing how the classification decision is made with parameters found by doing the combinatorial search:

Synthetic data for binary classification with only 0/1 loss

The 0/1 loss - number of misclassified data items - for this set of parameters is 20 out of 400 data items (95% correct prediction rate).

Next, the code trains a classifier using square loss, and another using hinge loss. I'm not using regularization for this data set, since with only 3 parameters there can't be too much selective bias between them; in other words, \beta=0.

A classifier trained with square loss misclassifies 32 items (92% success rate). A classifier trained with hinge loss misclassifies 26 items (93.5% success rate, much closer to the "perfect" rate). This is to be expected from the earlier discussion - square loss very strongly penalizes outliers, which makes it more skewed on this data [5]. Here are the contour plots for all losses that demonstrate this graphically:

Synthetic data for binary classification with all losses

Binary classification of MNIST digits

The MNIST dataset is the "hello world" of machine learning these days. It's a database of grayscale images representing handwritten digits, with a correct label for each of these images.

MNIST is usually employed for the more general multinomial classification problem - classifying a given data item into one of multiple classes (0 to 9 in the case of MNIST). We'll address this in a later section.

Here, however, we can experiment with training a binary classifier on MNIST. The idea is to train a classifier that recognizes some single label. For example, a classifier answering the question "is this an image of the digit 4". This is a binary classification problem, since there are only two answers - "yes" and "no".

Here's a code sample that trains such a classifier, using the hinge loss function (since we've already determined it gives better results than square loss for classification problems).

It starts by converting the correct labels of MNIST from the numeric range 0-9 to +1 or -1 based on whether the label is 4:

         0           -1
         1           -1
         4            1
         9           -1
y =      3     ==>   -1
         8           -1
         5           -1
        ...
         4            1

Then all we have is a binary classification problem, albeit one that is 785-dimensional (784 dimensions for each of the 28x28 pixels in the input images, plus one for bias). Visualizing the separating contours would be quite challenging here, but we can now trust the math to know what's going on. Other than this, the code for gradient descent is exactly the same as for the simple 2D synthetic data shown earlier.

My goal here is not to design a state-of-the-art machine learning architecture, but to explain how the main parts work. So I didn't tune the model too much, but it's possible to get 98% accuracy on this binary formulation of MNIST by tuning the code a bit. While 98% sounds great, recall that we could get 90% just by saying "no" to every digit :-) Feel free to play with the code to see if you can get even higher numbers; I don't really expect record-beating numbers from this model, though, since it's so simple.

Logistic regression - predicting probabilities

So far the predictors we've been looking at were trained to return a binary yes/no response; a more useful model would also tell us how sure it is. For example "what is the chance of rain tomorrow", rather than "will there be rain, yes or no"? The probability gives additional information. "90% chance of rain" vs. "56% chance of rain" gives us additional information over the binary "yes" for both cases (assuming a 50% cutoff).

Moreover, note that the linear model we've trained actually provides more information already, giving a numerical answer. We choose to cut it off at 0, saying yes for positive and no for negative numbers. But some numbers are more positive (or negative) than others!

Quick thought experiment: can we somehow interpret the response before cutoff as probability? The main problem here is that probabilities must be in the range [0, 1], while the linear model gives us an arbitrary real number. We may end up with negative probabilities or probabilities over 1, neither of which makes much sense. So we'll want to find some mathematical way to "squish" the result into the valid [0, 1] range. A common way to do this is to use the logistic function:

\[S(z) = \frac{1}{1 + e^{-z}}\]

It's also known as the "sigmoid" function because of its S-like shape:

Sigmoid function

We're going to assign \hat{y}(x) into the z variable of the sigmoid, to get the function:

\[S(x) = \frac{1}{1 + e^{-(\theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_n x_n)}}\]

And now, the answer we get can be interpreted as a probability between 0 and 1 (without actually touching either asymptote) [6]. We can train a model to get as close to 1 as possible for training samples where the true answer is "yes" and as close to 0 as possible for training samples where the true answer is "no". This is called "logistic regression" due to the use of the logistic function.

Training logistic regression with the cross-entropy loss

Earlier in this post, we've seen how a number of loss functions fare for the binary classifier problem. It turns out that for logistic regression, a very natural loss function exists that's called cross-entropy (also sometimes "logistic loss" or "log loss"). This loss function is derived from probability and information theory, and its derivation is outside the scope of this post (check out Chapter 3 of Michael Nielsen's online book for a nice intuitive explanation for why this loss function makes sense).

The formulation of cross-entropy we're going to use here starts from the most general:

\[C(x^{(i)})=-\sum_{t} p^{(i)}_t log(p(y^{(i)}=t|\theta))\]

Let's unravel this definition, step by step. The parenthesized superscript x^{(i)} denotes, as usual, the ith input sample. t runs over all the possible outcomes; p_t is the actual probability of outcome t and inside the log we have the conditional probability of this outcome given the regression parameters - in other words, this is the model's prediction [7].

To make this more concrete, in our case we have two possible outcomes in the training data: either y^{(i)}=+1 or y^{(i)}=-1. Given any such outcome, its "actual" probability is either 1 (when we get this outcome in the training data) or 0 (when we don't). So for any given sample, one of the two possible values of t has p^{(i)}_t=0 and the other has p^{(i)}=1. Therefore, we get [8]:

\[C(x^{(i)})=\left\{ \begin{matrix} -log(S(x^{(i)}) & \operatorname{if}\ y^{(i)}=+1 \\ -log(1-S(x^{(i)})) & \operatorname{if}\ y^{(i)}=-1 \end{matrix}\]

The second possibility has -log(1-S(x^{(i)})) because we define S(z) to predict the probability of the answer being +1; therefore, the probability of the answer being -1 is 1-S(z).

This is the cross-entropy loss for a single sample x^{(i)}. To get the total loss over a data set, we take the average sample loss, as usual:

\[C = \frac{1}{k}\sum_{i=1}^{k} C(x^{(i)})\]

Now let's compute the gradient of this loss function, so we can use it to train a model. Starting with the +1 case, we have:

\[C_{+1} = -log(S(x^{(i)}))\]

Then:

\[\frac{\partial C_{+1}}{\partial \theta_j} = \frac{-1}{S(x^{(i)})}\frac{\partial S(x^{(i)})}{\partial \theta_j}\]

Here it will be helpful to use the following identity, which can be easily verified by going through the math [9]:

\[S'(z)=S(z)(1-S(z))\]

Since in our case S(x^{(i)}) is actually S(\hat{y}(x^{(i}))) where \hat{y}(x) = \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_n x_n, we can apply the chain rule:

\[\frac{\partial S(x^{(i)})}{\partial \theta_j}=S(x^{(i)})(1-S(x^{(i)}))x^{(i)}_j\]

Substituting back into \frac{\partial C_{+1}}{\partial \theta_j}, we get:

\[\begin{align*} \frac{\partial C_{+1}}{\partial \theta_j} &= \frac{-1}{S(x^{(i)})}S(x^{(i)})(1-S(x^{(i)}))x^{(i)}_j \\                                        &= (S(x^{(i)})-1)x^{(i)}_j \end{align*}\]

Similarly, for C_{-1}=-log(1-S(x^{(i)})) we can compute:

\[\frac{\partial C_{-1}}{\partial \theta_j} = S(x^{(i)})x^{(i)}_j\]

Putting it all together, we find that the contribution of x^{(i)} to the gradient of \theta_j is:

\[\frac{\partial C(x^{(i)})}{\partial \theta_j}=\left\{ \begin{matrix} (S(x^{(i)})-1)x^{(i)}_j & \operatorname{if}\ y^{(i)}=+1 \\ S(x^{(i)})x^{(i)}_j & \operatorname{if}\ y^{(i)}=-1 \end{matrix}\]

Using these formulae, we can train a binary logistic classifier for MNIST that gives us a probability of some input image being a 4, rather than a yes/no answer. The binary MNIST code sample trains either a binary or a logistic classifier using a lot of shared infrastructure.

The probability gives us more information than just a yes/no answer. Consider, for example the following image from the MNIST database. When I trained a binary classifier with hinge loss to recognize the image 4 for 1200 steps, it wrongly predicted it is a 4:

Image of a 9 from MNIST

The model clearly made a mistake here, but can we know how wrong it was? It would be hard to know with a binary classifier that only gives us a yes/no answer. However, when I run a logistic regression model on the same image, it tells me it is 53% confident this is a 4. Since our cutoff for yes/no is 50%, this is quite close to the threshold and thus I'd say the model didn't make a huge mistake here.

Multiclass logistic regression

The previous example is a great transition into the topic of multiclass logistic regression. Most real-life problems have more than one possible answer and it would be nice to train models to select the most suitable answer for any given input.

Our input is still a vector x, but now instead of assigning +1 or -1 as the answer, we'll be assigning one of a fixed set of classes. If there are T classes, the answer will be a number in the closed range [0..T-1].

The good news is that we can use the building blocks developed in this post to put together a multiclass classifier. There are many ways to do this; here I'll focus on two: one-vs-all classification and softmax.

One-vs-all classification

The One-vs-all (OvA), also known as one-vs-rest (OvR) approach is a natural extension of binary classification:

  1. For each class t\in[0..T-1] we train a logistic classifier where we set t as the "correct" answer, and the other classes as the "incorrect" answers (+1 and -1 respectively).
  2. The result of each such classifier is the probability that an input sample belongs to class t.
  3. Given a new input, we run all T classifiers on it and the one that gives us the highest probability is chosen as the true class of the input.

As a completely synthetic example to make this clearer, suppose that T=3. We take the training data and train 3 logistic regressions. In the first - C_0, we set 0 as the right answer, 1 and 2 as the wrong answers. In the second - C_1 we set 1 as the right answer, 0 and 2 as the wrong answers. Finally in the third - C_2 we set 2 as the right answer, 0 and 1 the wrong answers.

Now, given a new input vector x we run C_0(x), C_1(x) and C_2(x). Each of these gives us the probability of x belonging to the respective class. If we put all the classifiers in a vector, we get:

\[C(x)=[C_0(x), C_1(x), C_2(x)]\]

We pick the class where the probability is highest. Mathematically, we can use the argmax function for this purpose. argmax returns the index of the maximal element in the given vector. For example, given:

\[C(x)=[0.45, 0.42, 0.09]\]

We get:

\[\underset{t \in [0..2]}{argmax}(C(x))=0\]

Therefore, the chosen class is 0. These class/index numbers are just labels of course. They can stand for anything depending on the problem domain: medical condition names, digits and so on.

This approach doesn't require any additional math over what we've already covered in this post. This multinomial MNIST classifier code sample implements it. The error rate it achieves is ~11%, similar to what LeCun's 1998 paper achieved with a simple linear classifier. Much better than 11% can be done for MNIST, even with a single-layer linear model. However, my model is very far from the state of art - there's no preprocessing, no artificially-enlarged training set, no adaptive learning rate; I didn't even spend time tuning the hyperparameters (regularization type and constants, learning rate, batch size etc.) The goal here was just to demonstrate the basics of logistic regression, not to compete for the state of the art in MNIST.

Softmax

An alternative to OvA is to use the softmax function. I covered softmax in some detail previously; just briefly, softmax is a function S(\mathbf{a}):\mathbb{R}^{N}\rightarrow \mathbb{R}^{N} such that:

\[S_j=\frac{e^{a_j}}{\sum_{k=1}^{N}e^{a_k}} \qquad \forall j \in 1..N\]

It is very useful for multiclass classification, since it lets us generate probabilities of the input belonging to one of N classes. Similarly to the OvA case, here we have to train 10 different parameter vectors, one for each digit. However, unlike OvA, this training doesn't happen separately but occurs at the same time. Instead of training a model to find a single parameter vector each time, we train a parameter matrix once.

The model structure is as follows:

Model of softmax logistic regression

I've chosen the number of classes to be 10 to reflect MNIST where we have 10 possible digits to assign to every input. In MNIST N is 785 (784 for each of 28x28 pixels in the image, plus one for bias). "Logits" is a common name to assign to the output of a fully connected layer (which is what we have with the matrix-vector multiplication in the first stage); the logits are arbitrary real numbers. The softmax function is responsibe for squeezing them into the range of probabilities (0, 1) and making sure they all add up to 1.

This diagram shows what happens to a single input as it goes through the model. In a realistic program, there will be another dimension - the batch dimension, used to vectorize the computation over a whole batch of inputs.

For training this model, we need a loss function. It turns out cross-entropy is a very popular loss function to use for softmax. In the softmax post I also covered how to compute the gradient of cross-entropy on a softmax, so we're all set to write some code: the full sample is here. Running it on MNIST for a couple of minutes produces a 9.5% error rate - slightly better than the OvA approach, but very close. This is to be expected, since OvA and softmax compute very similar results (finding the maximal probability from a set of probabilities), just in a different way. Softmax regression is much faster, however, since we can vectorize the training for all 10 digits in the same run.


[1]In this post I'm following many of the conventions established in my post on linear regression. In particular, by construction x_0=1 so that \theta_0 is the bias.
[2]Why? Because we have the bias as part of the model, so any constant offset can be absorbed into the learned bias.
[3]Note that this outcome is, once again, somewhat arbitrary. We could find another plane that intersects the x/y axis on the same line, and get a different classification. For example, if we flip the sign of all the elements of \theta, we get the same intersection line. In that case, however, values "to the right" of the line give us \hat{y}(x) < 0. Since the labels we attach are arbitrary, this really makes no difference. The only important thing is that we find a line that separates "true" from "false" samples and be consistent with our signs and labels throughout the process.
[4]Note that both the loss and the regularization are called L_2. This is a bit confusing, but both are essentially 2nd norms. It's best to ignore the name of the regularization factor and just refer to it as "regularization". I thought it's important to mention initially as there are other kinds of regularization being used for machine-learning algorithms and I wanted to make it clear which one is being used here.
[5]As an exercise, play with the code to increase or decrease the number of outliers (the code makes it easily controllable), and observe the effects on the misclassification rates of the different loss functions.
[6]Note that using the logistic function on the model's output is strictly a generalization of the binary classifier. We can still make a binary interpretation of the result if we're so inclined, interpreting S(z) \geq 0.5 as "yes" and otherwise as "no". In terms of the input to S(z), this means "yes" for z=\hat{y}(x) \geq 0 which is exactly the formulation we've been using for the binary classifier.
[7]In essence, cross entropy is computed between two probability distributions. Here, one of them is the "real" distribution observed in the y data. The other is what we predict given X data and our regression parameters \theta. The observed real probability is either 0 or 1 for any given data item, and the corresponding predicted probability is our model's output. I also discussed cross-entropy in the post about softmax.
[8]Many resources online condense this formula to a single line without the condition: C(x)=-ylog(S(x))-(1-y)log(1-S(x)). I'm avoiding this formulation on purpose, because it requires the possible values of y to be 0 and 1, not -1 and +1. Although it's possible to play with constants a bit to reformulate the -1/+1 case in a similarly condensed fashion, I find the version with the condition more explicit and thus easier to follow, even if it requires a bit more typing.
[9]See also my post about the chain rule, where this derivation is shown.