Part 3: Building a Simple Neural Network [Draft]

Sait Celebi (celebisait@gmail.com)
Last updated: March 24 2018

"Truth is ever to be found in the simplicity, and not in the multiplicity and confusion of things." -- Sir Isaac Newton

Introduction

Let's say we want to build a model to discriminate the following red and green points in 2-dimensional space:

import numpy as np
import matplotlib.pyplot as plt
import collections

np.set_printoptions(precision=3, suppress=True)

X = np.array([[0.65, 0, -0.75, 1.299, -0.46, -1.299, 0, 0.46, 0.46, -1.299,
 	       -0.46, 0,  0.75, 1.299, -0.65, 1.5, -1.5, 0.75, -0.75, 0],
 	      [0, -0.65, -1.299, 0.75, -0.46, -0.75, -1.5, 0.46, -0.46, 0.75,
               0.46, 0.65, -1.299, -0.75, 0, 0, 0, 1.299, 1.299, 1.5]])
Y = np.array([[0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1]])

colormap = np.array(['r', 'g'])

def plot_scatter(X, Y, colormap, path):
   plt.grid()
   plt.xlim([-2.0, 2.0])
   plt.ylim([-2.0, 2.0])
   plt.xlabel('$x_1$', size=20)
   plt.ylabel('$x_2$', size=20)
   plt.title('Input 2D points', size=18)
   plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])
   plt.savefig(path)

plot_scatter(X, Y, colormap, 'image.png')

plt.close()
plt.clf()
plt.cla()
(Executed in 0.255 seconds.)

In other words, given a point, $(x_1, x_2)$, we want to output either red, or green. (In this tutorial 0.0 means red and 1.0 means green.)

We can build a simple Neural Network for this problem. Neural Networks are widely used for applications ranging from face recognition, machine translation, speech to text, self driving cars, etc. This is a very simple neural network in terms of number of layers and number of neurons it contains.

It's basically a Logistic Regression (or Softmax Regression) with more layers.

Computation Graph

Here is the visual representation of our model:

and decide if $A^{(3)} > 0.5$ for our final prediction.

Feed-forward Phase

Let's assume that we are given the weights and biases. How do we calculate the output?

We represent $X$ as a matrix. $X$ contains all the points. In our case $X$ contains $M=20$ samples and for each sample we have $(x,y)$. $Y$ contains all the labels (red or green):

$$ X = \begin{bmatrix} 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ \end{bmatrix}_{2 \times M}, \quad Y = \begin{bmatrix} 0 & 0 & \dots & 0 \\ \end{bmatrix}_{1 \times M}, \quad $$

Here is the parameters of our model:

$$ W^{(1)} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ \end{bmatrix}_{3 \times 2}, \quad b^{(1)} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix}_{3 \times 1} $$ $$ W^{(2)} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix}_{3 \times 3}, \quad b^{(2)} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix}_{3 \times 1} $$ $$ W^{(3)} = \begin{bmatrix} 0 & 0 & 0 \\ \end{bmatrix}_{1 \times 3}, \quad b^{(3)} = \begin{bmatrix} 0 \\ \end{bmatrix}_{1 \times 1} $$

Feed-forward basically means given $X, Y, W^{(1)}, W^{(2)}, W^{(3)}$ and $b^{(1)}, b^{(2)}, b^{(3)}$ will produce us $A^{(3)}$. Here is step by step how we do the feed forward phase.

$$ Z^{(1)} = W^{(1)} X + b^{(1)} $$ $$ A^{(1)} = g(Z^{(1)}) $$ $$ Z^{(2)} = W^{(2)} A^{(1)} + b^{(2)} $$ $$ A^{(2)} = g(Z^{(2)}) $$ $$ Z^{(3)} = W^{(3)} A^{(2)} + b^{(3)} $$ $$ A^{(3)} = g(Z^{(3)}) $$

We are using sigmoid function as the activation $g()$ function as we used before.

Let's try to apply the above equations using an initial random set of weights and bias.

sigmoid = lambda x: 1/(1+np.exp(-x))

def forward_propagate(X, W1, b1, W2, b2, W3, b3):
  Z1 = np.matmul(W1, X) + b1
  A1 = sigmoid(Z1)

  Z2 = np.matmul(W2, A1) + b2
  A2 = sigmoid(Z2)

  Z3 = np.matmul(W3, A2) + b3
  A3 = sigmoid(Z3)

  return Z1, A1, Z2, A2, Z3, A3

W1_initial = np.random.rand(3, 2)
W1 = W1_initial.copy()
b1 = np.zeros((3, 1))
W2_initial = np.random.rand(3, 3)
W2 = W2_initial.copy()
b2 = np.zeros((3, 1))
W3_initial = np.random.rand(1, 3)
W3 = W3_initial.copy()
b3 = np.zeros((1, 1))

Z1, A1, Z2, A2, Z3, A3 = forward_propagate(X, W1, b1, W2, b2, W3, b3)
print(Y)
print(A3)
(Executed in 0.003 seconds.)
[[0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 1]]
[[0.635 0.624 0.613 0.643 0.621 0.613 0.617 0.637 0.629 0.624 0.629 0.634
  0.625 0.634 0.623 0.641 0.617 0.643 0.633 0.64 ]]

Above we print the predictions for a random initial set of weights and bias. We are randomly initializing weights and bias and feeding $X$ to our random initial model. As you can see, our predictions are pretty random too, as expected. However, you can see that, given the weights, and bias, it is pretty straight-forward to calculate the final predictions. The tricky part is to learn those weights properly.

Cross Entropy Loss Function

In training, our goal is to learn: $W^{(1)}, W^{(2)}, W^{(3)}, b^{(1)}, b^{(2)}, b^{(3)}$ that best discriminates red and green points. These are called the parameters of our model.

We want to find parameters that minimizes some definition of a cost function. We will use the same cost function we have used before in Logistic Regression and Softmax Regression:

$$L = - \sum_{i=1}^{M} Y_{i} log( A^{(3)}_{i} ) + (1-Y_{i}) log( 1 - A^{(3)}_{i} )$$

Please see previous lectures if you are interested understanding how we derived the cross entropy loss function.

If we add our loss to our computation graph:

Backpropagation

Now, we are going to have some fun.

Backprop is probably one of the most fundamental algorithms/techniques in Deep Learning. In my opinion, mastering backprop is curicial to be able to (1) fully understand deep learing techniques and (2) to be able to contribute new approaches in the field.

Backpropagation is simply calculating derivatives of loss w.r.t. each parameter in our model. We need this information because we need to feed the derivatives to Gradient Descent. The parameters of the model we want to compute with backprop is:

$$ \frac{dL}{dW^{(i)}}, \quad \frac{dL}{db^{(i)}} $$

We need to use chain rule, iteratively, to get there. The direction we apply chain rule will be the opposite direction to the feed forward direction. In fact the name backpropagation comes from the fact that we propagate the error (well actually the derivatives, really) through backwards (compared to the feed-forward direction) in a step by step manner.

Here is our updated computation graph after we add the Backprop details.

Recall that we already calculated the first step of derivatives:

$$ \frac{dL}{dZ^{(3)}} = A^{(3)} - Y $$

in Logistic Regression tutorial. Notice that we are using exactly the same loss function. (Both Logistic Regression and the model discussed in this tutorial is a binary classification task.)

Let's try to go one step further and calculate $\frac{dL}{dW^{(3)}}$ by keeping in mind that we already know $\frac{dL}{dZ^{(3)}}$. (This sentence mimicks why we need Chain Rule)

In other words, we know how much $L$ changes if we play with $Z^{(3)}$ and we are trying to find out how much $L$ changes if we play with $W^{(3)}$.

Let's look at how we calculate $Z^{(3)}$:

$$ \begin{matrix} (W^{(3)}) \\ \begin{bmatrix} 0 & 0 & 0 \end{bmatrix} \\ \\ \mbox{} \end{matrix} \begin{matrix} (A^{(2)}) \\ \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix} \end{matrix} + \begin{matrix} (b^{(3)}) \\ \begin{bmatrix} 0 \end{bmatrix} \\ \\ \end{matrix} = \begin{matrix} (Z^{(3)}) \\ \begin{bmatrix} 0 \end{bmatrix} \\ \\ \end{matrix} $$

Let's look this equation very carefully. Our simpler task for now is: "How much $Z^{(3)}$ will be affected if we play with $W^{(3)}$?", i.e.,

$$ \frac{dZ^{(3)}}{dW^{(3)}} $$

After finding out what the above is, we will extend it and find: $\frac{dL}{dW^{(3)}}$ using the Chain Rule.

It seems obvious visually that if we change, say, $W^{(3)}_1$, $Z^{(3)}$ will change proportional to $A^{(2)}_1$. Notice that we are assuming that $A^{(2)}$ and $b^{(3)}$ is fixed and frozen. Using similar logic for the other items of $W^{(3)}$, we can conclude:

$$ \frac{dZ^{(3)}}{dW^{(3)}} = A^{(2)}.T $$

Similarly, if we change $b^{(3)}$ a small amount, $Z^{(3)}$ will also change as same amount in the same direction, so:

$$ \frac{dZ^{(3)}}{db^{(3)}} = 1 $$

To finalize the calculation of $\frac{dL}{dW^{(3)}}$, for one sample:

$$ \frac{dL}{dW^{(3)}} = \frac{dL}{dZ^{(3)}} \frac{dZ^{(3)}}{dW^{(3)}} $$

and similarly for $\frac{dL}{db^{(3)}}$:

$$ \frac{dL}{db^{(3)}} = \frac{dL}{dZ^{(3)}} $$

Nice.

At this moment, we calculated $\frac{dL}{dW^{(3)}}$ and $\frac{dL}{db^{(3)}}$. Now, we want to calculate $\frac{dL}{dW^{(2)}}$ and $\frac{dL}{db^{(2)}}$. In order to calculate them, we need $\frac{dL}{dA^{(2)}}$ and $\frac{dL}{dZ^{(2)}}$.

Looking to the same figure above visually, and using a similar logic, it looks obvious for one sample:

$$ \frac{dZ^{(3)}}{dA^{(2)}} = W^{(3)}.T $$

To finalize the calculation of $\frac{dL}{dA^{(2)}}$:

$$ \frac{dL}{dA^{(2)}} = \frac{dL}{dZ^{(3)}} \frac{dZ^{(3)}}{dA^{(2)}} $$

Given $\frac{dL}{dA^{(2)}}$, it is easy to get $\frac{dL}{dZ^{(2)}}$. We just need to apply the gradient of sigmoid function.

$$ \frac{dL}{dZ^{(2)}} = \frac{dL}{dA^{(2)}} \circ \left ( A^{(2)} \circ \left ( 1 - A^{(2)} \right ) \right ) $$

Now, with knowing $\frac{dL}{dZ2}$, we are set to calculate $\frac{dL}{dW2}$ and $\frac{dL}{db2}$. Here is how we compute $Z^{(2)}$:

$$ \begin{matrix} (W^{(2)}) \\ \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix} \end{matrix} \begin{matrix} (A^{(1)}) \\ \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix} \end{matrix} + \begin{matrix} (b^{(2)}) \\ \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix} \end{matrix} = \begin{matrix} (Z^{(2)}) \\ \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix} \end{matrix} $$

It is easy to observe:

$$ \frac{dZ^{(2)}_{1}}{dW^{(2)}} = \begin{bmatrix} & A^{(1)}.T & \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix}, \quad \frac{dZ^{(2)}_{2}}{dW^{(2)}} = \begin{bmatrix} 0 & 0 & 0 \\ & A^{(1)}.T & \\ 0 & 0 & 0 \\ \end{bmatrix}, \quad \frac{dZ^{(2)}_{3}}{dW^{(2)}} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ & A^{(1)}.T & \\ \end{bmatrix} $$

Applying the chain rule, we should do something like:

$$ \frac{dL}{dW^{(2)}} = \sum_{i=1}^{3} \frac{dL}{dZ^{(2)}_i} \frac{dZ^{(2)}_i}{dW^{(2)}} $$

which is a more fancy way of writing:

$$ \frac{dL}{dW^{(2)}} = \frac{dL}{dZ^{(2)}_1} \begin{bmatrix} & A^{(1)}.T & \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix} + \frac{dL}{dZ^{(2)}_2} \begin{bmatrix} 0 & 0 & 0 \\ & A^{(1)}.T & \\ 0 & 0 & 0 \\ \end{bmatrix} + \frac{dL}{dZ^{(2)}_3} \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ & A^{(1)}.T & \\ \end{bmatrix} $$

which is idential to:

$$ \frac{dL}{dW^{(2)}} = \begin{bmatrix} \\ \frac{dL}{dZ^{(2)}} \\ \\ \end{bmatrix} \begin{bmatrix} A^{(1)}.T \\ \end{bmatrix} $$

We are half-way there in terms of finishing the backprop.

We need to calcualte $\frac{dL}{dA^{(1)}}$. Let's look at $\frac{dZ^{(2)}}{dA^{(1)}}$.

$$ \frac{dL}{A^{(1)}_1} = W^{(2)}_{:,1} \frac{dL}{dZ^{(2)}}.T, \quad \frac{dL}{A^{(1)}_2} = W^{(2)}_{:,2} \frac{dL}{dZ^{(2)}}.T, \quad \frac{dL}{A^{(1)}_3} = W^{(2)}_{:,3} \frac{dL}{dZ^{(2)}}.T $$

or, equivalently:

$$ \frac{dL}{A^{(1)}} = W^{(2)}.T \frac{dL}{dZ^{(2)}} $$

Now we only missing $\frac{dL}{dW^{(1)}}$ and $\frac{dL}{db^{(1)}}$. Let's look how we calculate $Z^{(1)}$:

$$ \begin{matrix} (W^{(1)}) \\ \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ \end{bmatrix} \end{matrix} \begin{matrix} (X) \\ \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} \end{matrix} + \begin{matrix} (b^{(1)}) \\ \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix} \end{matrix} = \begin{matrix} (Z^{(1)}) \\ \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix} \end{matrix} $$

Now, let's look at:

$$ \frac{dL}{dW^{(1)}_{1,:}} = \frac{dL}{dZ^{(1)}_1} X.T, \quad \frac{dL}{dW^{(1)}_{2,:}} = \frac{dL}{dZ^{(1)}_2} X.T, \quad \frac{dL}{dW^{(1)}_{3,:}} = \frac{dL}{dZ^{(1)}_3} X.T $$

or equivalently:

$$ \frac{dL}{dW^{(1)}} = \frac{dL}{dZ^{(1)}} X.T $$

Let's apply the backprop altogether!

ALPHA = 0.3 # learning rate
Model = collections.namedtuple('Model', ['W1', 'b1', 'W2', 'b2', 'W3', 'b3'])

# this simple implementation is numerically unstable, because:
# np.log() returns -inf for small inputs very close to 0
def get_loss(Y, Y_hat):
  loss = -1 * np.sum(Y * np.log(Y_hat) +
                     (1-Y) * np.log(1-Y_hat))
  return loss

# semantically same with above function, and numerically stable.
def get_loss_numerically_stable(Y, Z3):
  loss = -1 * np.sum(Y * -1 * np.log(1 + np.exp(-Z3)) +
                     (1-Y) * (-Z3 - np.log(1 + np.exp(-Z3))))
  return loss

def get_gradients_matrix_ops(Z1, A1, Z2, A2, Z3, A3, Y):
  dZ3 = A3 - Y  # size (1x20)

  dW3 = (1.0/Y.size) * np.matmul(dZ3, A2.T)  # size (1x3)
  db3 = (1.0/Y.size) * np.sum(dZ3, axis = 1, keepdims = True)  # size (1x1)

  dA2 = np.matmul(W3.T, dZ3)  # size (3x20)
  dZ2 = dA2 * (A2 * (1-A2)) # size (3x20)

  dW2 = (1.0/Y.size) * np.matmul(dZ2, A1.T)  # size (3x3)
  db2 = (1.0/Y.size) * np.sum(dZ2, axis = 1, keepdims= True)  # size (3x1)

  dA1 = np.matmul(W2.T, dZ2)  # size (3x20)
  dZ1 = dA1 * (A1 * (1 - A1))  # size (3x20)

  dW1 = (1.0/Y.size) * np.matmul(dZ1, X.T)  # size (3x2)
  db1 = (1.0/Y.size) * np.sum(dZ1, axis = 1, keepdims = True) # size (3x1)

  return dW1, db1, dW2, db2, dW3, db3

def get_gradients_loops(Z1, A1, Z2, A2, Z3, A3, Y):
  dZ3 = A3 - Y  # size (1x20)

  dW3 = np.zeros((1, 3))
  db3 = np.zeros((1, 1))
  for i in range(20):
    a2 = A2[:, [i]]
    dz3 = dZ3[:, [i]]
    dW3 += (1.0/20) * np.matmul(dz3, a2.T)
    db3 += (1.0/20) * dz3

  dA2 = np.zeros((3, 20))
  for i in range(20):
    dz3 = dZ3[:, [i]]
    dA2[:, [i]] = dz3 * W3.T

  dZ2 = dA2 * (A2 * (1-A2)) # size (3x20)

  dW2 = np.zeros((3, 3))
  db2 = np.zeros((3, 1))
  for i in range(20):
    a1 = A1[:, [i]]
    dz2 = dZ2[:, [i]]
    dW2 += (1.0/20) * np.matmul(dz2, a1.T)
    db2 += (1.0/20) * dz2

  dA1 = np.zeros((3, 20))
  for i in range(20):
    dz2 = dZ2[:, [i]]
    dA1[:, [i]] = np.matmul(W2.T, dz2)

  dZ1 = dA1 * (A1 * (1-A1)) # size (3x20)

  dW1 = np.zeros((3, 2))
  db1 = np.zeros((3, 1))
  for i in range(20):
    x = X[:, [i]]
    dz1 = dZ1[:, [i]]
    dW1 += (1.0/20) * np.matmul(dz1, x.T)
    db1 += (1.0/20) * dz1

  return dW1, db1, dW2, db2, dW3, db3

def gradient_descent(W1, b1, W2, b2, W3, b3, dW1, db1, dW2, db2, dW3, db3, alpha):
  W1 = W1 - alpha * dW1
  b1 = b1 - alpha * db1
  W2 = W2 - alpha * dW2
  b2 = b2 - alpha * db2
  W3 = W3 - alpha * dW3
  b3 = b3 - alpha * db3

  return W1, b1, W2, b2, W3, b3

L_cache = []
models_cache = []

for i in range(10000):
  Z1, A1, Z2, A2, Z3, A3 = forward_propagate(X, W1, b1, W2, b2, W3, b3)

  L = (1.0 / 20) * get_loss_numerically_stable(Y, Z3)

  # dW1, db1, dW2, db2, dW3, db3 = get_gradients_matrix_ops(Z1, A1, Z2, A2, Z3, A3, Y)
  dW1, db1, dW2, db2, dW3, db3 = get_gradients_loops(Z1, A1, Z2, A2, Z3, A3, Y)

  W1, b1, W2, b2, W3, b3 = gradient_descent(W1, b1, W2, b2, W3, b3, dW1, db1, dW2, db2, dW3, db3, ALPHA)
  m = Model(W1, b1, W2, b2, W3, b3)

  models_cache.append(m)
  L_cache.append(L)

  if L < 0.02:
    break

plt.grid()
plt.title('Loss', size=18)
plt.xlabel('Number of iterations', size=15)
plt.ylabel('Loss', size=15)
plt.plot(L_cache)

plt.savefig('image.png')

plt.close()
plt.clf()
plt.cla()
(Executed in 4.809 seconds.)

Let's see our inferences on the same points using our classifier. We are doing pretty good on inference (100%).

Z1, A1, Z2, A2, Z3, A3 = forward_propagate(X, W1, b1, W2, b2, W3, b3)
print(Y)
print(A3)
(Executed in 0.001 seconds.)
[[0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 1]]
[[0.022 0.041 0.98  0.993 0.019 0.988 0.981 0.028 0.014 0.984 0.043 0.015
  0.975 0.986 0.025 0.992 0.978 0.992 0.979 0.984]]

Backpropagation using Matrix Operations

Now, we will try to make things faster by applying matrix operations to do the backprop. We will eliminate the inside for loops by doing this.

To calculate $\frac{dL}{dW^{(3)}}$, using matrix operations:

$$ \frac{dL}{dW3} = \begin{matrix} (\frac{dL}{dZ^{(3)}}) \\ \begin{bmatrix} 0 & 0 & ... & 0 \\ \end{bmatrix}_{1 \times 20} \end{matrix} \begin{matrix} (A^{(2)}.T) \\ \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ \vdots & \vdots & \vdots \\ 0 & 0 & 0 \\ \end{bmatrix}_{20 \times 3} \end{matrix} = \begin{matrix} (\frac{dL}{dW^{(3)}}) \\ \begin{bmatrix} 0 & 0 & 0 \\ \end{bmatrix}_{1 \times 3} \end{matrix} $$ and similarly for $\frac{dL}{db^{(3)}}$: $$ \begin{matrix} (\frac{dL}{dZ^{(3)}}) \\ \begin{bmatrix} 0 & 0 & ... & 0 \\ \end{bmatrix}_{1 \times 20} \end{matrix} \stackrel{\text{average}}{\rightarrow} \begin{matrix} (\frac{dL}{db^{(3)}}) \\ \begin{bmatrix} 0 \\ \end{bmatrix}_{1 \times 1} \end{matrix} $$

Similarly, using matrix operations: (Again, assuming $W^{(3)}$ and $b^{(3)}$ is fixed and frozen.)

$$ \frac{dL}{dA^{(2)}} = \begin{matrix} (W^{(3)}.T)\\ \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix}_{3 \times 1} \end{matrix} \begin{matrix} (\frac{dL}{dZ^{(3)}})\\ \begin{bmatrix} 0 & 0 & ... & 0 \\ \end{bmatrix}_{1 \times 20} \end{matrix} = \begin{matrix} (\frac{dL}{dA^{(2)}}) \\ \begin{bmatrix} 0 & 0 & ... & 0 \\ 0 & 0 & ... & 0 \\ 0 & 0 & ... & 0 \\ \end{bmatrix}_{3 \times 20} \end{matrix} $$

And, finally, using matrix operations:

$$ \frac{dL}{dW2} = \begin{matrix} (\frac{dL}{dZ^{(2)}}) \\ \begin{bmatrix} 0 & 0 & ... & 0 \\ 0 & 0 & ... & 0 \\ 0 & 0 & ... & 0 \\ \end{bmatrix}_{3 \times 20} \end{matrix} \begin{matrix} (A1.T) \\ \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ \vdots & \vdots & \vdots \\ 0 & 0 & 0 \\ \end{bmatrix}_{20 \times 3} \end{matrix} $$

Decision Boundary

Let's try to visualize the decision boundary that error final classifier learns.

The decision boundaries of both Logistic Regression and Softmax Regression were linear. However, the decision boundary of this lecture is not linear, for obvious reasons. Thanks to our non-linear sigmoid function. (Please see Exercise 1)

NX = 100
NY = 100

def plot_decision_boundary_lazy(X, Y, W1, b1, W2, b2, W3, b3):
  plt.grid()
  plt.xlim([-2.0, 2.0])
  plt.ylim([-2.0, 2.0])
  plt.xlabel('$x_1$', size=20)
  plt.ylabel('$x_2$', size=20)
  plt.title('Decision boundary - Contour plot', size = 18)

  xs = np.linspace(-2.0, 2.0, NX)
  ys = np.linspace(2.0, -2.0, NY)
  xv, yv = np.meshgrid(xs, ys)
  X_fake = np.stack((xv.flatten(), yv.flatten()), axis=0)
  _, _, _, _, _, predictions = forward_propagate(X_fake, W1, b1, W2, b2, W3, b3)

  predictions = np.stack( (1.0 - predictions, predictions, np.zeros((1, NX * NY))) )

  plt.imshow(predictions.T.reshape(NX, NY, 3), extent=[-2.0, 2.0, -2.0, 2.0])
  plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])

  plt.savefig('image.png')

plot_decision_boundary_lazy(X, Y, W1, b1, W2, b2, W3, b3)

plt.close()
plt.clf()
plt.cla()
(Executed in 0.214 seconds.)

Here, the color of the background depicts our prediction for that imaginary point. Remember that our prediction is one dimensional and between $[0,1]$. If it is $0.0$ it means we are classifying the point confidently as red and if it is $1.0$ we are very confident on green.

In order to create this figure, we simply convert our one dimensional prediction into three dimensions (RGB) using the following logic. R = prediction, G = 1 - prediction, B = 0. So, for example, if our final prediction is, say, 0.9, then the color on that point will be $[0.9, 0.1, 0]$ which is kind of dark green.

Similarly, we can plot the same as our classifier progresses through the learning process. As you may guess, it should start from a random point and get smarter in each step.

import matplotlib.animation as animation

NX = 100
NY = 100

xs = np.linspace(-2.0, 2.0, NX)
ys = np.linspace(2.0, -2.0, NY)
xv, yv = np.meshgrid(xs, ys)
X_fake = np.stack((xv.flatten(), yv.flatten()), axis=0)

def get_predictions(X_fake, model):
  _, _, _, _, _, predictions = forward_propagate(X_fake, *model)
  predictions = np.stack( (1.0 - predictions, predictions, np.zeros((1, NX * NY))) )
  return predictions.T.reshape(NX, NY, 3)

fig = plt.figure()

ax = fig.add_subplot(111)
ax.set_xlim([-2.0, 2.0])
ax.set_ylim([-2.0, 2.0])
ax.set_xlabel('$x_1$', size=20)
ax.set_ylabel('$x_2$', size=20)

ax.set_title('Decision boundary - Animated', size = 18)

def animate(i):
  im.set_array(get_predictions(X_fake, models_cache[i * 50]))
  text_box.set_text('Iteration: {}'.format(i * 50))
  return im, text_box

im = ax.imshow(get_predictions(X_fake, models_cache[0]), extent=[-2.0, 2.0, -2.0, 2.0], animated = True)

ax.scatter(X[0], X[1], s=50, c=colormap[Y[0]])
text_box = ax.text(0.6, 1.6, 'Iteration 0', size = 16)

anim = animation.FuncAnimation(fig, animate, len(models_cache) / 50, blit=True, interval=500)
anim.save('animation.mp4', writer='avconv', fps=6, codec="libx264")

plt.close()
plt.clf()
plt.cla()
(Executed in 7.539 seconds.)

As you can see, it starts from a random classifier that does not seem to be working well in the beginning. And the learning process figures out where to go next to find a better classifier. After the learning is done, the final classifier is pretty good, in fact it has 100% accuracy.

Applying Simple Neural Network using low-level Tensorflow APIs

Here is how to train the same classifier for the above red and green points using low-level TensorFlow API. It produces exact same output with our own hand crafted model. Notice that it is because we start both models from the same initial random starting point (same $W$ and same $b$) using the same learning rate.

import tensorflow as tf

t_X = tf.placeholder(tf.float32, [2, None])
t_Y = tf.placeholder(tf.float32, [1, None])

# t_W1 = tf.Variable(tf.random_uniform((3, 2)))
t_W1 = tf.Variable(W1_initial.astype('f'))
t_b1 = tf.Variable(tf.zeros([3, 1]))

# t_W2 = tf.Variable(tf.random_uniform((3, 3)))
t_W2 = tf.Variable(W2_initial.astype('f'))
t_b2 = tf.Variable(tf.zeros([3, 1]))

#t_W3 = tf.Variable(tf.random_uniform((1, 3)))
t_W3 = tf.Variable(W3_initial.astype('f'))
t_b3 = tf.Variable(tf.zeros([1]))

t_Z1 = tf.matmul(t_W1, t_X) + t_b1
t_A1 = tf.sigmoid(t_Z1)

t_Z2 = tf.matmul(t_W2, t_A1) + t_b2
t_A2 = tf.sigmoid(t_Z2)

t_Z3 = tf.matmul(t_W3, t_A2) + t_b3
t_A3 = tf.sigmoid(t_Z3)

t_Loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits = t_Z3,  labels = t_Y))

train = tf.train.GradientDescentOptimizer(ALPHA).minimize(t_Loss)
init = tf.global_variables_initializer()

with tf.Session() as session:
   session.run(init)
   losses = []
   for i in range(10000):
      ttrain, ttloss = session.run([train, t_Loss], feed_dict={t_X:X, t_Y:Y})
      losses.append(ttloss)

      if ttloss < 0.02:
        break

plt.grid()
plt.plot(losses)
plt.title('Tensorflow Loss', size = 18)
plt.xlabel('Number of iterations', size=15)
plt.ylabel('Loss', size=15)
plt.savefig('image.png')

plt.close()
plt.clf()
plt.cla()
(Executed in 2.187 seconds.)

Exercises

  1. We claim that the decision boundary of the 3-layer simple neural network as it explained in this chapter is not linear because using a non-linear activation function (sigmoid). However, we have used the same sigmoid function in Logistic Regression as well, and the decision boundary was linear. Also, softmax function doesn't look like linear at all. (Actually, what the heck is the defition (or the test) of a linear function?) And also (assuming sigmoid is a non-linear activation function) why we had a linear decision boundary on logistic regression but not the 3-layer simple neural network?

References