Part 2: Softmax Regression [Draft]

Sait Celebi (celebisait@gmail.com)
Last updated: January 06 2018

"What I cannot create, I do not understand." -- Richard Feynman

Introduction

Let's say we want to build a model to discriminate the following red, green and blue points in 2-dimensional space:

import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=3, suppress=True)

X = np.array([[-0.1, -0.5, 1.3, -0.6, -1.5, 0.2, -0.3,  0.7,  1.1, -1.0,
               -0.5, -1.3,  -1.4, -0.9, 0.4 , -0.4, 0.3, -1.6, -0.5, -1.0],
              [1.4,  -0.1, 0.9,  0.4,  0.4, 0.2, -0.4, -0.8, -1.5,  0.9,
               1.5, -0.45, -1.2, -0.7, -1.3, 0.6, -0.5, -0.7, -1.4, -1.4]])
Y = np.array([[0, 0, 1, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 2, 1, 0, 1, 2, 2, 2]])
Y_one_hot = np.eye(3)[Y[0]].T
colormap = np.array(['r', 'g', 'b'])

def plot_scatter(X, Y, colormap, path):
   plt.grid()
   plt.xlim([-2.0, 2.0])
   plt.ylim([-2.0, 2.0])
   plt.xlabel('$x_1$', size=20)
   plt.ylabel('$x_2$', size=20)
   plt.title('Input 2D points', size=18)
   plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])
   plt.savefig(path)

plot_scatter(X, Y, colormap, 'image.png')

plt.close()
plt.clf()
plt.cla()

In other words, given a point, $(x_1, x_2)$, we want to output either red, green or blue.

We can use Softmax Regression for this problem. We first learn weights ($w_{1,1}, w_{1,2}, w_{2,1}, w_{2,2}, w_{3,1}, w_{3,2}$) and bias ($b_1, b_2, b_3$). This phase is called training. Then we use the following formula to predict if the new point is red, blue or green. This phase is called prediction or inference.

One hot vector representation

We represent the output as a one hot vector. In other words, we represent red points using $\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$ and similarly for green points using $\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$ and lastly for blue points using $\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$.

Computation Graph

Here is a visual representation of our model:

and simply pick the biggest $a_i$ to do the final prediction.

Feed-forward Phase

Let's assume that we are given the weights and bias. How do we calculate the output?

We represent $X$ as a matrix. $X$ contains all the points. In our case $X$ contains $M=20$ samples and for each sample we have $(x,y)$. $Y$ contains all the labels (red, green and blue) as a one hot vector. $W$ has the weights. $b$ has the bias:

$$ X = \begin{bmatrix} 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ \end{bmatrix}_{2 \times M}, \quad Y = \begin{bmatrix} 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ \end{bmatrix}_{3 \times M}, \quad W = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ \end{bmatrix}_{3 \times 2} b = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix}_{3 \times 1} $$

Feed-forward basically means given $X, Y, W$ and $b$, will produce us $a$ and $L$.

$$ Z = W X + b $$

Here we can see it visually:

$$ \begin{bmatrix} 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ \end{bmatrix}_{3 \times M} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ \end{bmatrix}_{3 \times 2} \begin{bmatrix} 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ \end{bmatrix}_{2 \times M} + \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix}_{3 \times 1} $$

As you may realized, the summation here is called broadcasting.

After getting $Z$, we apply softmax function over $Z$:

\begin{equation} \label{eq:softmax} a_i = \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}} \end{equation}

However, this may be problematic to compute for big values of $z_i$. We call this phenomena numerically unstable. Because $e^{z_i}$ easily overflows 64bit (even 128bit). We need to approach it slightly differently.

Numerical Stability of Softmax function

The Softmax function takes an N-dimensional vector of real values and returns a new N-dimensional vector that sums up to $1$. The exact formula is (Softmax equation):

Let's make an example:

def softmax(a):
  return np.exp(a) / np.sum(np.exp(a))

a = np.array([1.0, 2.0, 3.0])
print a
print softmax(a)
[ 1.  2.  3.]
[ 0.09   0.245  0.665]

Intuitively, softmax increases/emphasizes the relative difference between large and small values.

However, let's look at this:

def softmax(a):
  return np.exp(a) / np.sum(np.exp(a))

a = np.array([1000.0, 2000.0, 3000.0])
print a
print softmax(a)
[ 1000.  2000.  3000.]
[ nan  nan  nan]

We are getting nan values along with a RuntimeWarning: overflow encountered in exp. Simply because:

print(np.exp(1000))
inf
Instead, we can approach it differently: $$ a_i = \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}} = \frac{e^{z_i}e^K}{\sum_{j=1}^N e^{z_j} e^K} = \frac{e^{z_i + K}}{\sum_{j=1}^N e^{z_j + K}} $$

for some fixed $K$. And we can pick $K = - max(z_1, z_2, \dots, z_N)$.

More practically:

def softmax(a):
  return np.exp(a-max(a)) / np.sum(np.exp(a-max(a)))

a = np.array([1000.0, 2000.0, 3000.0])
print a
print softmax(a)
[ 1000.  2000.  3000.]
[ 0.  0.  1.]

As you can see, we still have some numerical issues. First and second value of the softmax shouldn't be $0$, they should be very close $0$, but not exactly $0$. Hmmm, but this is not as bad as nan issue.

So, to wrap-up the Feed-Forward phase, we can finalize the forward propagation step:

def stable_softmax(Z):
  return np.exp(Z - Z.max(axis = 0)) / np.sum(np.exp(Z - Z.max(axis = 0)), axis = 0)

def forward_propagate(X, W, b):
  Z = np.matmul(W, X) + b
  A = stable_softmax(Z)

  return Z, A

W = np.array([[ 0.31, 3.95],
      [ 7.07, -0.23],
      [-6.27, -2.35]])

b = np.array([[ 1.2  ],
      [ 2.93 ],
      [-4.14 ]])

Z, A = forward_propagate(X, W, b)
print(Y_one_hot[:,0:10])
print(A[:,0:10])
[[ 1.  1.  0.  1.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.  1.  1.  1.  1.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]]
[[ 0.992  0.652  0.001  0.963  0.118  0.096  0.186  0.     0.     0.988]
 [ 0.008  0.19   0.999  0.018  0.     0.904  0.734  1.     1.     0.   ]
 [ 0.     0.158  0.     0.019  0.882  0.     0.08   0.     0.     0.012]]

Above we print the first 10 predictions, and they look pretty good. In fact, we have 100% accuracy. So, given the weights, and bias, it is pretty straight-forward to calculate the final predictions. The tricky part is to learn those weights properly.

Defining Loss function using Maximum Likelihood Estimation

In training, our goal is to learn a matrix $W$ of size $(3 \times 2)$ and a $\mathbf{b}$ of size $(3 \times 1)$ that best discriminates red, green and blue points.

We want to find $W$ and $\mathbf{b}$ that minimizes some definition of a cost function. Let's attempt to write a cost function for this problem.

Let's say we have three points:

$$\mathbf{x} = \begin{bmatrix} -0.1 \\ 1.4 \end{bmatrix}, y=0$$ $$\mathbf{x} = \begin{bmatrix} 1.3 \\ 0.9 \end{bmatrix}, y=1$$

and similarly,

$$\mathbf{x} = \begin{bmatrix} -1.4 \\ -1.1 \end{bmatrix}, y=2$$

Now, let's list these $y$ as a one hot vector and, their corresponding imaginary $\mathbf{a}$ values:

$$\mathbf{y} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \mathbf{a} = \begin{bmatrix} 0.9 \\ 0.1 \\ 0.0 \end{bmatrix} $$ $$\mathbf{y} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \mathbf{a} = \begin{bmatrix} 0.1 \\ 0.8 \\ 0.1 \end{bmatrix} $$ $$\mathbf{y} = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}, \mathbf{a} = \begin{bmatrix} 0.1 \\ 0.2 \\ 0.7 \end{bmatrix} $$

Intuitively, we want a classifier that produces similar looking $\mathbf{a}$ and $\mathbf{y}$. This means, if $\mathbf{y} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$, then, for example, having $\mathbf{a} = \begin{bmatrix} 0.8 \\ 0.1 \\ 0.1 \end{bmatrix}$ is more desirable than having $\mathbf{a} = \begin{bmatrix} 0.6 \\ 0.2 \\ 0.2 \end{bmatrix}$.

In other words, we want to maximize:

$$P(\mathbf{y}|\mathbf{x}) = \prod_{j=1}^{3} a_j^{y_j} $$

Here, $a_j$ represents the jth item in the vector $\mathbf{a}$, and similarly $y_j$ represents the jth value in $\mathbf{y}$. For example, when $\mathbf{a} = \begin{bmatrix} 0.9 \\ 0.1 \\ 0.0 \end{bmatrix}$, then, $a_1 = 0.9, a_2 = 0.1$ and $a_3 = 0.0$.

$$\mathbf{y} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \mathbf{a} = \begin{bmatrix} 0.9 \\ 0.1 \\ 0.0 \end{bmatrix}, P(\mathbf{y}|\mathbf{x}) = 0.9 \times 1 \times 1 = 0.9 $$ $$\mathbf{y} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \mathbf{a} = \begin{bmatrix} 0.1 \\ 0.8 \\ 0.1 \end{bmatrix}, P(\mathbf{y}|\mathbf{x}) = 1 \times 0.8 \times 1 = 0.8 $$ $$\mathbf{y} = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}, \mathbf{a} = \begin{bmatrix} 0.1 \\ 0.2 \\ 0.7 \end{bmatrix}, P(\mathbf{y}|\mathbf{x}) = 1 \times 1 \times 0.7 = 0.7 $$

Bigger the $P(\mathbf{y}|\mathbf{x})$ is the better.

Similar to Logistic Regression, in order to define the loss for multiple samples, we will simply multiply each value for each sample (Maximum Likelihood Estimation):

$$J = \prod_{i=1}^{M} P(y^{(i)}|x^{(i)}) $$

here, $y^{(i)}$, $x^{(i)}$ and $a^{(i)}$ corresponds to ith sample in the training set out of $M$ training samples. We can rewrite it as:

$$J = \prod_{i=1}^{M} \prod_{j=1}^{3} (a_j^{(i)}) ^ {y_j^{(i)}} $$

Maximizing above is equal to maximizing:

$$J = log \left( \prod_{i=1}^{M} \prod_{j=1}^{3} (a_j^{(i)}) ^ {y_j^{(i)}} \right ) $$

we can write it as:

$$J = \sum_{i=1}^{M} \sum_{j=1}^{3} y_j^{(i)} log(a_j^{(i)}) $$

since we like to minimize things instead of maximizing:

$$J = - \sum_{i=1}^{M} \sum_{j=1}^{3} y_j^{(i)} log(a_j^{(i)}) $$

If we add our Log Loss to our computation graph:

Also, this loss function is sometimes called Cross Entropy Loss Function in some contexts.

Gradient Descent

Ideally, we want to start with random parameters and make our parameters better and better gradually as an iterative manner. Gradient descent is simply:

$$ W = W - \alpha \frac{dL}{dW} \quad \mathbf{b} = \mathbf{b} - \alpha \frac{dL}{d \mathbf{b}} $$

The tricky part here is to compute $\frac{dL}{dW}$ and $\frac{dL}{d \mathbf{b}}$. We need to do a small scale back propagation of derivatives here.

But first, let's see what happens if we change $w_{2, 1}$ in the network, step by step:

In order to do gradient descent, we need the derivatives:

$$ \frac{dL}{dw_{m,n}} = \sum_{i=1}^3 \left (\frac{dL}{da_i} \right) \left (\frac{da_i}{dz_m} \right) \left(\frac{dz_m}{dw_{m,n}} \right ), \quad \frac{dL}{db_m} = \sum_{i=1}^3 \left (\frac{dL}{da_i} \right) \left (\frac{da_i}{dz_m} \right) \left(\frac{dz_m}{b_m} \right ) $$

Let's do some calculus:

$$ \frac{dL}{da_i} = \frac{d}{da_i} \left ( - \sum_{j=1}^3 y_j log(a_j) \right ) = \frac{d}{da_i} \left ( - y_i log(a_i) \right ) = \frac{-y_i}{a_i} $$ $$ \frac{dz_m}{dw_{m,n}} = x_n, \quad \frac{dz_m}{b_m} = 1 $$

Derivative of Softmax Function

A nice property about the Softmax function that it produces a legit Probability Distrubition. In classification (or more generally in Machine Learning) we often want to assign prababilities to categories or classes. Softmax function is known to work well numereous applications/areas.

Softmax is a vector function -- it takes a vector as an input and returns another vector. Therefore, we cannot just ask for the derivative of softmax, we can only ask the derivative of softmax regarding particular elements.

For example,

$$ \frac{d}{d z_2} a_1 $$

refers to how much $a_1$ will change if play with $z_2$.

Using the same logic for each element for $a_i$ and $z_j$ would produce us $N \times N$ matrix of derivatives.

Let's try to take derivative for one particular element:

$$ \frac{d}{d z_m} a_i = \frac{d}{d z_m} \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}} $$

We can use Quotient Rule here. Recall that:

$$ \frac{d}{dx} \frac{f(x)}{g(x)} = \frac{f'(x)g(x) - g'(x)f(x)}{ [g(x)]^2 } $$

In our case:

$$ f(x) = e^{z_i}, \quad g(x) = \sum_{j=1}^N e^{z_j} $$

Let's apply Quotient Rule, if $i=m$:

$$ \frac{d}{d z_m} a_i = \frac{d}{d z_m} \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}} = \frac{ (e^{z_i})' \sum_{j=1}^N e^{z_j} - (\sum_{j=1}^N e^{z_j})' e^{z_i} }{ [\sum_{j=1}^N e^{z_j}] ^ 2 } = \frac{ e^{z_i} \sum_{j=1}^N e^{z_j} - e^{z_m} e^{z_i} }{ [\sum_{j=1}^N e^{z_j}] ^ 2 } = \frac{ e^{z_i} } { \sum_{j=1}^N e^{z_j} } \frac{ \sum_{j=1}^N e^{z_j} - e^{z_m} } { \sum_{j=1}^N e^{z_j} } = (a_i)(1-a_m) $$

Notice that we simplify, by plugging $a_i$ and $a_m$ (see Equation \ref{eq:softmax}) in the last step above.

Similarly, if $i \neq m$:

$$ \frac{d}{d z_m} a_i = \frac{d}{d z_m} \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}} = \frac{ (e^{z_i})' \sum_{j=1}^N e^{z_j} - (\sum_{j=1}^N e^{z_j})' e^{z_i} }{ [\sum_{j=1}^N e^{z_j}] ^ 2 } = \frac{ 0 - e^{z_m} e^{z_i} }{ [\sum_{j=1}^N e^{z_j}] ^ 2 } = -\frac{ e^{z_m} } { \sum_{j=1}^N e^{z_j} } \frac{ e^{z_i} }{ \sum_{j=1}^N e^{z_j} } = - (a_m) (a_i) $$

More succintly, we can summarize all above:

$$ \frac{d}{d z_m} a_i = \begin{cases} (a_i)(1-a_m), & \text{if}\ \quad i = m \\ - (a_m) (a_i), & \text{if}\ \quad i \neq m \end{cases} $$

or equally:

$$ \frac{d}{d z_m} a_i = (a_i)(\delta_{i,m} - a_m) $$

where $\delta_{i,m} = 1$ if $i=m$, and $0$ otherwise.

The reason we want to write it this way is that we don't want to use any loops. And we can execute the above using matrix operations like:

$$ \frac{d}{dz} a = \mathbf{a} \mathbf{e^T} \circ (\mathbf{I} - \mathbf{e} \mathbf{a^T}) $$

where $\mathbf{e}$ is a vector of $1$'s of size $K\times1$ for a suitable $K$ and $\circ$ represents Hadamard product, in other words element-wise product of matrices. And it is of size $(3 \times 3)$. We plug this below:

$$ \begin{bmatrix} \frac{da_1}{dz_1} \frac{da_2}{dz_1} \frac{da_3}{dz_1} \\ \frac{da_1}{dz_2} \frac{da_2}{dz_2} \frac{da_3}{dz_2} \\ \frac{da_1}{dz_3} \frac{da_2}{dz_3} \frac{da_3}{dz_3} \end{bmatrix}_{3 \times 3} \begin{bmatrix} \frac{dL}{da_1} \\ \frac{dL}{da_2} \\ \frac{dL}{da_3} \end{bmatrix}_{3 \times 1} = \begin{bmatrix} \frac{dL}{dz_1} \\ \frac{dL}{dz_2} \\ \frac{dL}{dz_3} \end{bmatrix}_{3 \times 1} $$

This is exactly what we are going to do. However, this is defined for only one sample. We don't want to loop over each sample and compute this sequantially. Ideally, we want to do everything using matrix operations including this step. However, it is a bit trickier. So, first let's see if this approach works using loops. After, we will see how to convert this to fully matrix operations.

Back propagation Phase

We will learn the weights using Back Propagation.

ALPHA = 2.0 # learning rate

# this simple implementation is numerically unstable, because:
# np.log() returns -inf for small inputs very close to 0
def get_loss(A, Y_one_hot):
  loss = -1 * np.sum(Y_one_hot * np.log(A))
  return loss

# semantically same with above function, and numerically stable.
def get_loss_numerically_stable(Z, Y_one_hot):
  loss = -1 * np.sum(Y_one_hot * ( (Z - Z.max(axis = 0)) -
			    np.log(np.sum(np.exp(Z - Z.max(axis = 0)), axis = 0))
			 ))
  return loss

def get_gradients(Y_one_hot, Z, A):
  dA = (-Y_one_hot / A)

  dZ = np.zeros((3, 20))
  for i in range(20):
    a = A[:, [i]]
    da = dA[:, [i]]
    matrix = np.matmul(a, np.ones((1, 3))) * (np.identity(3) - np.matmul(np.ones((3, 1)), a.T))
    dZ[:, [i]] = np.matmul(matrix, da)

  dW = np.zeros((3,2))
  db = np.zeros((3,1))
  for i in range(20):
    x = X[:, [i]]
    dz = dZ[:, [i]]
    dw = dz * np.tile(x, 3).T
    dW += (1.0/20) * dw
    db += (1.0/20) * dz

  return dZ, dW, db

def gradient_descent(W, b, dW, db, alpha):
  W = W - alpha * dW
  b = b - alpha * db
  return W, b

# random initialization
W = np.random.rand(3, 2)
b = np.zeros((3, 1))

W_cache = []
b_cache = []
L_cache = []

for i in range(40):
  Z, A = forward_propagate(X, W, b)
  L = (1.0 / 20) * get_loss_numerically_stable(Z, Y_one_hot)

  W_cache.append(W)
  b_cache.append(b)
  L_cache.append(L)

  dZ, dW, db = get_gradients(Y_one_hot, Z, A)
  W, b = gradient_descent(W, b, dW, db, ALPHA)

plt.grid()
plt.title('Loss', size=18)
plt.xlabel('Number of iterations', size=15)
plt.ylabel('Loss', size=15)
plt.plot(L_cache)

plt.savefig('image.png')

plt.close()
plt.clf()
plt.cla()

Decision Boundary

So, essentially, we have 3 equations:

$$ \text{Equation A.} \quad w_{1,1} x_1 + w_{1,2} x_2 + b_1 = z_1 \\ \text{Equation B.} \quad w_{2,1} x_1 + w_{2,2} x_2 + b_2 = z_2 \\ \text{Equation C.} \quad w_{3,1} x_1 + w_{3,2} x_2 + b_3 = z_3 $$

We have 3 boundaries between all 3 choose 2 of above:

If we focus on Equation A - Equation B:

$$ w_{1,1} x_1 + w_{1,2} x_2 + b_1 = w_{2,1} x_1 + w_{2,2} x_2 + b_2 \\ w_{1,1} x_1 - w_{2,1} x_1 + w_{1,2} x_2 - w_{2,2} x_2 = b_2 - b_1 \\ (w_{1,1} - w_{2,1}) x_1 + (w_{1,2} - w_{2,2}) x_2 = b_2 - b_1 \\ $$

If we plot these three lines:

def plot_decision_boundary(X, Y, W, b, path):
  plt.grid()
  plt.xlim([-2.0, 2.0])
  plt.ylim([-2.0, 2.0])
  plt.xlabel('$x_1$', size=20)
  plt.ylabel('$x_2$', size=20)
  plt.title('Decision boundary', size = 18)

  plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])

  xs = np.array([-2.0, 2.0])
  ys1 = ((b[1, 0] - b[0, 0]) - (W[0, 0] - W[1, 0]) * xs) / (W[0, 1] - W[1, 1])
  ys2 = ((b[2, 0] - b[0, 0]) - (W[0, 0] - W[2, 0]) * xs) / (W[0, 1] - W[2, 1])
  ys3 = ((b[2, 0] - b[1, 0]) - (W[1, 0] - W[2, 0]) * xs) / (W[1, 1] - W[2, 1])

  plt.plot(xs, ys1, c='black')
  plt.plot(xs, ys2, c='black')
  plt.plot(xs, ys3, c='black')

  plt.savefig(path)

plot_decision_boundary(X, Y, W, b, 'image.png')

plt.close()
plt.clf()
plt.cla()

Above, we simply find the boundaries and plot them. The definition of the boundary is that the region in which the predictions are equally confident for both of the classifiers. Since we have three classes, there are 3 choose 2 = 3 boundaries.

Similarly, we can plot the same as our classifier progresses through the learning process. As you may guess, it should start from a random point and get smarter in each step.

import matplotlib.animation as animation

fig = plt.figure()

ax = fig.add_subplot(111)
ax.set_xlim([-2.0, 2.0])
ax.set_ylim([-2.0, 2.0])
ax.set_xlabel('$x_1$', size=20)
ax.set_ylabel('$x_2$', size=20)

ax.set_title('Decision boundary - Animated', size = 18)

def animate(i):
  xs = np.array([-2.0, 2.0])
  W = W_cache[i]
  b = b_cache[i]

  ys1 = ((b[1, 0] - b[0, 0]) - (W[0, 0] - W[1, 0]) * xs) / (W[0, 1] - W[1, 1])
  ys2 = ((b[2, 0] - b[0, 0]) - (W[0, 0] - W[2, 0]) * xs) / (W[0, 1] - W[2, 1])
  ys3 = ((b[2, 0] - b[1, 0]) - (W[1, 0] - W[2, 0]) * xs) / (W[1, 1] - W[2, 1])

  lines1.set_data(xs, ys1)
  lines2.set_data(xs, ys2)
  lines3.set_data(xs, ys3)

  text_box.set_text('Iteration: {}'.format(i))
  return lines1, lines2, lines3, text_box

lines1, = ax.plot([], [], c='black')
lines2, = ax.plot([], [], c='black')
lines3, = ax.plot([], [], c='black')

ax.scatter(X[0], X[1], s=50, c=colormap[Y[0]])
text_box = ax.text(1.1, 1.6, 'Iteration 0', size = 16)

anim = animation.FuncAnimation(fig, animate, len(W_cache), blit=True, interval=500)
anim.save('animation.mp4', writer='avconv', fps=6, codec="libx264")

plt.close()
plt.clf()
plt.cla()

As you can see, it starts from a random classifier that does not seem to be working well in the beginning. And the learning process figures out where to go next to find a better classifier. After the learning is done, the final classifier is pretty good, in fact it has 100% accuracy.

Let's see the decision boundary in a more lazy setting. Here, we simply classify every single point in the grid and then give the predictions to a contour plot. Comparing to the previous animation, contour plot shows the prediction of every single point in the grid in the final version of the classifiers parameters. On the other hand, the previous animation shows the parameters step by step through the gradient descent iterations.

NX = 100
NY = 100

def plot_decision_boundary_lazy(X, Y, W, b, counter_param):
  plt.grid()
  plt.xlim([-2.0, 2.0])
  plt.ylim([-2.0, 2.0])
  plt.xlabel('$x_1$', size=20)
  plt.ylabel('$x_2$', size=20)
  plt.title('Decision boundary - Contour plot', size = 18)

  xs = np.linspace(-2.0, 2.0, NX)
  ys = np.linspace(2.0, -2.0, NY)
  xv, yv = np.meshgrid(xs, ys)
  X_fake = np.stack((xv.flatten(), yv.flatten()), axis=0)
  _, predictions = forward_propagate(X_fake, W, b)

  plt.imshow(predictions.T.reshape(NX, NY, 3), extent=[-2.0, 2.0, -2.0, 2.0])
  plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])

  plt.savefig('image.png')

plot_decision_boundary_lazy(X, Y, W_cache[-1], b_cache[-1], 50)

plt.close()
plt.clf()
plt.cla()

Here, the color of the background depicts our prediction for that imaginary point. Remember that our prediction is $\mathbf{a}$ and it is three dimensional. So, we simply convert that vector to RGB space. So, for example if the prediction is: $[0.98, 0.01, 0.01]$, it will be almost a perfect red, and so on.

If you look closely, you will see some purple color between red and blue points. That is because the predictions in that region is something similar to $[0.45, 0.1, 0.45]$. And this means a mix of red and blue which gives us purple. Similar phenomena happens between other decision boundary intersections.

Gradient Descent Parameter Updates

Here we see the updates of parameters step by step in the gradient descent.

import matplotlib.animation as animation

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8), sharex=False, sharey=False)
fig.suptitle('Parameter Updates vs Loss - Animated', size=24)

ax1.grid()
ax1.set_xlim([-0.5, 8.6])
ax1.set_ylim([-5.0, 5.0])
ax1.set_xlabel('Parameter', size=20)
ax1.set_ylabel('Value', size=20)
ax1.set_title('Parameter Values', size = 18)

xlabels = ['$w_{1,1}$', '$w_{1,2}$', '$w_{2,1}$', '$w_{2,2}$',
   '$w_{3,1}$', '$w_{3,2}$', '$b_1$', '$b_2$', '$b_3$']
ax1.set_xticks(range(9))
ax1.set_xticklabels(xlabels, size=20)

ax2.grid()
ax2.set_xlim([0.0, 40])
ax2.set_ylim([0, max(L_cache) * 1.1])
ax2.set_xlabel('Number of iterations', size=20)
ax2.set_ylabel('Loss', size=20)
ax2.set_title('Loss', size=18)

def animate(i):
  ys = np.concatenate((W_cache[i].flatten(), b_cache[i].flatten()))
  xs = np.arange(len(ys))

  for j in range(len(ys)):
    bars[j].set_height(ys[j])

  lines.set_data(range(i+1), L_cache[0:i+1])

  text_box.set_text('Iteration: {}'.format(i))
  return bars, text_box

bars = ax1.bar(range(9), np.zeros(9), color='blue', align='center')
text_box = ax1.text(6.2, 3.6, 'Iteration 0', size = 16)
lines, = ax2.plot([], [], c='black')

anim = animation.FuncAnimation(fig, animate, len(W_cache), blit=True, interval=500)
anim.save('animation.mp4', writer='avconv', fps=4, codec="libx264")

plt.close()
plt.clf()
plt.cla()

This is a nice animation that shows the progress of our parameters $W$ and $\mathbf{b}$ in each iteration of gradient descent along with the corresponding loss value using the given parameters.

We are starting with small random initial values for: $w_{1,1}, w_{1,2}, w_{2,1}, w_{2,2}, w_{3,1}, w_{3,2}$. We start with all $0$ values for $b_1, b_2, b_3$. Then, we start applying gradient descent. Every step we take in the gradient descent is giving us a better set of parameters so that we see that the loss is decreasing.

Calculating Derivatives using Matrix Operations

Recall that we are trying to convert this to matrix operations:

$$ \mathbf{a} \mathbf{e^T} \circ (\mathbf{I} - \mathbf{e} \mathbf{a^T}) $$

Now, instead of $\mathbf{a}$, we have $A$:

$$ A = \left[ \begin{array}{c|c|c|c} \mathbf{a}^{(1)} & \mathbf{a}^{(2)} & \dots & \mathbf{a}^{(20)} \\ \end{array} \right]_{3 \times 20} $$

where each $\mathbf{a}$ of size $(3 \times 1)$.

Similarly, we also have:

$$ \frac{dL}{dA} = \left[ \begin{array}{c|c|c|c} \frac{dL}{d \mathbf{a}^{(1)}} & \frac{dL}{d \mathbf{a}^{(2)}} & \dots & \frac{dL}{d \mathbf{a}^{(20)}} \\ \end{array} \right]_{3 \times 20} $$

again, each $\frac{dL}{d \mathbf{a}}$ of size $(3 \times 1)$.

Goal: calculate $\frac{dL}{dZ}$ of size $(3 \times 20)$ using only matrix operations.

$$ \left[ \begin{array}{c|c|c|c} \mathcal{A}_1 & \mathcal{A}_2 & \dots & \mathcal{A}_{20} \\ \end{array} \right]_{3 \times (3 \times 20)} \left[ \begin{array}{c|c|c|c} \frac{dL}{d\mathbf{a}^{(1)}} & 0 & \dots & 0 \\ \hline 0 & \frac{dL}{d\mathbf{a}^{(2)}} & \dots & 0 \\ \hline \vdots & \vdots & \vdots & \vdots \\ \hline 0 & 0 & \dots & \frac{dL}{d\mathbf{a}^{(20)}} \\ \end{array} \right]_{(3 \times 20) \times 20} = \left[ \begin{array}{c|c|c|c} \frac{dL}{d\mathbf{z}^{(1)}} & \frac{dL}{d\mathbf{z}^{(2)}} & \dots & \frac{dL}{d\mathbf{z}^{(20)}} \end{array} \right]_{3 \times 20} $$

where each $\mathcal{A}_i$ of size $(3 \times 3)$ and each $\frac{dL}{d\mathbf{a}^{(i)}}$ of size $(3 \times 1)$.

We are almost done, if we can define a proper $\mathcal{A}$. Here it comes: $\mathcal{A}_{3 \times (3 \times 20)}$

$$ \left ( A_{3 \times 20} \left[ \begin{array}{c|c|c|c|c} e^T & 0 & 0 & \dots & 0 \\ \hline 0 & e^T & 0 & \dots & 0 \\ \hline \vdots & \vdots & \vdots & \vdots & \vdots \\ \hline 0 & 0 & 0 & \dots & e^T \\ \end{array} \right]_{20 \times (3 \times 20)} \right ) \circ \left ( \left[ \begin{array}{c|c|c|c} I & I & \dots & I \end{array} \right]_{3 \times (3 \times 20)} - ee^T_{3 \times 20} \left[ \begin{array}{c|c|c|c|c} a_1^T & 0 & 0 & \dots & 0 \\ \hline 0 & a_2^T & 0 & \dots & 0 \\ \hline \vdots & \vdots & \vdots & \vdots & \vdots \\ \hline 0 & 0 & 0 & \dots & a_{20}^T \\ \end{array} \right]_{20 \times (3 \times 20)} \right ) $$

where all $e$ of size $(K \times 1)$ containing all $1$s for a suitable $K$ and all $I$ is of size ($3 \times 3$) identity matrix.

Or equivalently, we can write this as:

$$ \left ( \left[ \begin{array}{c|c|c|c|c|c|c|c} a_1 & a_1 & a_1 & a_2 & a_2 & a_2 & \dots & a_{20} \end{array} \right]_{3 \times (3 \times 20)} \right ) \circ \left ( \left[ \begin{array}{c|c|c|c} I & I & \dots & I \end{array} \right]_{3 \times (3 \times 20)} - \left[ \begin{array}{c|c|c|c} a_1^T & a_2^T & \dots & a_{20}^T \\ \hline a_1^T & a_2^T & \dots & a_{20}^T \\ \hline a_1^T & a_2^T & \dots & a_{20}^T \end{array} \right]_{3 \times (3 \times 20)} \right ) $$

Note that this calculates exact same gradients with the function get_gradients() which uses a for loop instead. The advantage of the matrix operations approach is being more parallelizable compared to the for loop approach.

However, in order to be able to appreciate the speed of matrix approach, we should implement it carefully. This includes initializing sparse matrices without using any explicit for loops. Applying the above matrix operations using Numpy/Scipy carefully to make sure it is faster than the for loop approach is beyond the scope of this tutorial, so we leave this as an exercise to the reader.

Also keep in mind that the speed advantage of the matrix operations may not be obvious in a small sample of $20$ elements. You may want to compare both approaches using more sample points and also a higher dimensional feature space.

Applying Softmax Regression using low-level Tensorflow APIs

Here is how to train the same classifier for the above red, green and blue points using low-level TensorFlow API. It produces almost exact output with our own hand crafted model. Be aware that there may be small differences because of the initial random start of both models. (Remember that $W$ is initialized with random values.)

import tensorflow as tf

t_X = tf.placeholder(tf.float32, [None, 2])
t_Y = tf.placeholder(tf.float32, [None, 3])

t_W = tf.Variable(tf.random_uniform((2, 3)))
t_b = tf.Variable(tf.zeros((1, 3)))

t_logits = tf.matmul(t_X, t_W) + t_b
t_Softmax = tf.nn.softmax(t_logits)
t_Accuracy = tf.contrib.metrics.accuracy(labels = tf.argmax(t_Y, axis=1),
                                         predictions = tf.argmax(t_Softmax, axis=1))

t_Loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                         labels=t_Y, logits=t_logits))

train = tf.train.GradientDescentOptimizer(ALPHA).minimize(t_Loss)
init = tf.global_variables_initializer()

losses = []
accs = []
with tf.Session() as session:
   session.run(init)
   for i in range(40):
      ttrain, loss, acc = session.run([train, t_Loss, t_Accuracy], feed_dict={t_X:X.T, t_Y:Y_one_hot.T})
      losses.append(loss)
      accs.append(acc)

plt.grid()
plt.title('Loss', size=18)
plt.xlabel('Number of iterations', size=15)
plt.ylabel('Loss', size=15)
plt.plot(losses)

plt.savefig('image.png')

plt.close()
plt.clf()
plt.cla()

Exercises

  1. If you look at the decision boundary, the three lines of the decision boundary always intersects at one single point. Is this always the case? (i.e., is this an invariant of some sort?) If this is an invariant, why this is the case?
  2. Looking to the structure of our decision boundary, try to come up with a set of samples including three classes (red, green and blue) that is linearly separable that our classifier would not be able to successully discriminate all of them.
  3. Assume we had four classes: red, green, blue and orange instead of only three classes. How would the decision boundary look like? Try to guess its potential structure. Is that boundary able to classify any linearly separable possible inputs containing four classes?

References