Part 1: Logistic Regression [Draft]

Introduction

Let's say we want to build a model to discriminate the following red and blue points in 2-dimensional space:

import numpy as np
import matplotlib.pyplot as plt

X = np.array([[-0.1, -0.5, 1.3,  -0.6, -1.5, 0.2, -0.3, 0.7,  1.1,
	       -1.0, -0.5, -1.3, -1.4, -0.9, 0.4, -0.4, 0.3],
              [1.4, -0.1,  0.9,  0.4,  0.4,  0.2, -0.4, -0.8, -1.5,
	       0.9, -1.5, -0.45, -1.2, -1.1, -1.3, 0.6, -0.5]])
Y = np.array([[0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1]])
colormap = np.array(['r', 'b'])

def plot_scatter(X, Y, colormap, path):
   plt.grid()
   plt.xlim([-2.0, 2.0])
   plt.ylim([-2.0, 2.0])
   plt.xlabel('$x_1$', size=20)
   plt.ylabel('$x_2$', size=20)
   plt.title('Input 2D points', size=18)
   plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])
   plt.savefig(path)

plot_scatter(X, Y, colormap, 'image.png')

plt.close()
plt.clf()
plt.cla()

In other words, given a point, $(x_1, x_2)$, we want to output either $0$ or $1$. (In this tutorial $0$ represents red, $1$ represents blue.)

We can use Logistic Regression for this problem. In Logistic Regression, we first learn weights ($w_1, w_2$) and bias ($b$). This phase is called training. Then we use the following formula to predict if the new point is blue or red. This phase is called prediction or inference.

\begin{equation} \label{eq:inference} \hat{y} = \begin{cases} 0, & \text{if}\ \quad \frac{1}{1+e^{-(w_1x_1 + w_2x_2 + b)}} < 0.5 \\ 1, & \text{otherwise} \end{cases} \end{equation}

In the above equation, $\hat{y}$ depicts our guess for a given label, or our prediction.

Parameters of a Logistic Regression model contains weights ($w_1, w_2$) and bias ($b$). These parameters are learned with a learning algorithm. After they are learned, we apply them using a function to predict a new sample's class.

Let's make an example prediction for a new given point. Let's assume somebody already learned some weights ($W$) and bias ($b$) for us:

$$ W = \begin{bmatrix} 6.33 \\ -4.22 \end{bmatrix}, \quad b=1.99 $$

For a new given point ($x_1, x_2$) in the two dimensional space, say, $X = \begin{bmatrix} 1.1 \\ -0.6 \end{bmatrix}$, we can predict the class using Equation \ref{eq:inference}.

sigmoid = lambda x: 1/(1+np.exp(-x))

W = np.array([6.33, -4.22]) # some magical W
X = np.array([1.1, -0.6])   # point we want to classify
b = 1.99

print sigmoid(W.dot(X) + b)
0.999989716915

Let's try another point, $X = \begin{bmatrix} -1.2 \\ 1.0 \end{bmatrix}$.

sigmoid = lambda x: 1/(1+np.exp(-x))

W = np.array([6.33, -4.22]) # some magical W
X = np.array([-1.2, 1.0])   # point we want to classify

b = 1.99

print sigmoid(W.dot(X) + b)
5.40255201872e-05

We see that we get a value close to $1$ first and close to $0$ secondly. Remember: if the value is smaller than 0.5, it means our prediction is red and otherwise it is blue.

Computation Graph

Here is a visual representation of our model:

or alternatively we can visualize the same:

We typically see the first version of the computation graph visualization more than the second one in the literature. The first version looks more compact, however, I personally enjoy the second one better since it is more explicit in terms of the input-output at each node.

In the second representation, if there is an arrow from node $A$ to node $B$, it means $A$ is needed to compute $B$. It is that simple. We will dive more into computation graph internals in the following tutorials.

We use sigmoid function as $g(z)$ in logistic regression:

$$ g(z) = \frac{1}{1+e^{-z}} $$
sigmoid = lambda x: 1/(1+np.exp(-x))

def plot_sigmoid():
   plt.grid()
   plt.xlim([-10.0, 10.0])
   plt.ylim([-0.1, 1.1])
   xs = np.arange(-10, 10, 0.001)
   plt.xlabel('$z$', size=20)
   plt.ylabel('$g(z)$', size=20)
   plt.title('Sigmoid function', size=18)
   plt.plot(xs, sigmoid(xs), label=r'$g(z)= \frac{1}{1+e^{-z}}$')
   plt.legend(loc='upper left', fontsize=17)
   plt.savefig('image.png')

plot_sigmoid()

plt.close()
plt.clf()
plt.cla()

Maximum Likelihood Estimation

In training, our goal is to learn three numbers: $w_1, w_2, b$ that best discriminates red and blue points.

We want to find $w_1, w_2, b$ that minimizes some definition of a cost function. Let's attempt to write a cost function for this problem. Let's say we have two points:

$$x = \begin{bmatrix} -0.1 \\ 1.4 \end{bmatrix}, y=0$$

and similarly:

$$x = \begin{bmatrix} 1.3 \\ 0.9 \end{bmatrix}, y=1$$

We want a classifier that produces very high $\hat{y}$ when $y=1$, and conversely very low $\hat{y}$ when $y=0$. ($\hat{y}$ represents our prediction, we hope that $\hat{y} = y$.)

In other words,

  1. If $y=1$, we want to maximize $\hat{y}$.
  2. If $y=0$, we want to maximize $1-\hat{y}$.

If we combine (1) and (2), we want to maximize:

$$P(y|x) = \hat{y}^y.(1-\hat{y})^{(1-y)}$$

Maximizing above is equal to maximizing:

$$log(P(y|x)) = log(\hat{y}^y.(1-\hat{y})^{(1-y)}) = ylog(\hat{y}) + (1-y)log(1-\hat{y})$$

or we want to minimize:

$$L(y, \hat{y}) = - \left(ylog(\hat{y}) + (1-y)log(1-\hat{y})\right)$$

The above formula defines a cost function for only one sample. We call this $L$, and $L$ stands or loss. We also need a loss function for multiple samples (which we will call $J$).

Let's start by an example. Say, we have three positive samples and two different classifiers output the following $\hat{y}$ for those three samples:

Which classifier is better? (Classifier 1 or Classifier 2?)

There are multiple answers for this question. One of the answers is Maximum Likelihood Estimation (MLE). MLE decides this question by multiplying those numbers and taking the maximum:

So, in this case, Classifier 2 is more likely. More formally, for multiple samples, MLE wants to maximize:

$$ P(Y|X) = \prod P(y|x) $$

this is called maximum likelihood. Maximizing the above is equal to maximizing below:

$$ log(P(Y|X)) = \sum log P(y|x) $$

or, equivalently, we need to minimize:

$$J = - \sum log P(y|x) = - \sum \left ( ylog(\hat{y}) + (1-y)log(1-\hat{y}) \right )$$

Here we use the notation $J$ for the quantity we want to minimize for all samples, and $L$ for one sample. Let's add these to our computation graph visualization:

Gradient Descent

Gradient descent is probably one of the most beautiful algorithm ever invented, in my opinion. It is an iterative algorithm that makes the output better and better in each step. In each iteration, we make a step towards to the opposite of the gradient. Hence, it is gradient descent.

By the way; "gradient" and "derivative" pretty much mean the same thing.

In order to do gradient descent, we need the derivatives:

$$\frac{dL}{dw_1} = \frac{dL}{d\hat{y}} \frac{d\hat{y}}{dz} \frac{dz}{dw_1}, \quad \frac{dL}{dw_2} = \frac{dL}{d\hat{y}} \frac{d\hat{y}}{dz} \frac{dz}{dw_1}, \quad \frac{dL}{db} = \frac{dL}{d\hat{y}} \frac{d\hat{y}}{dz} \frac{dz}{db}$$

Because we want to do:

$$w_1 := w_1 - \alpha \frac{dL}{dw_1}, \quad w_2 := w_2 - \alpha \frac{dL}{dw_2}, \quad b := b - \alpha \frac{dL}{db} $$

Let's do some calculus:

$$ \frac{dL}{d\hat{y}} = \frac{d}{d\hat{y}} - \left(ylog(\hat{y}) + (1-y)log(1-\hat{y})\right) = \frac{d}{d\hat{y}} - ylog(\hat{y}) + \frac{d}{d\hat{y}} - (1-y)log(1-\hat{y}) = \frac{-y}{\hat{y}} + \frac{1-y}{1-\hat{y}} $$ $$ \frac{d\hat{y}}{dz} = \frac{e^{-z}}{(1+e^{-z})^2} = \frac{1 + e^{-z} - 1}{(1+e^{-z})^2} = \frac{1 + e^{-z}}{(1+e^{-z})^2} - \frac{1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}} - \left( \frac{1}{(1+e^{-z})^2} \right )^2 = g(z) - (g(z)) ^2 = g(z) (1-g(z)) $$ $$ \frac{dz}{dw_1} = x_1, \quad \frac{dz}{dw_2} = x_2, \quad \frac{dz}{db} = 1 $$ $$ \frac{dL}{dz} = \frac{dL}{d\hat{y}} \frac{d\hat{y}}{dz} = \left ( \frac{-y}{\hat{y}} + \frac{1-y}{1-\hat{y}} \right ) \left ( \hat{y} (1-\hat{y}) \right ) = \frac{-y}{\hat{y}} \hat{y} (1-\hat{y}) + \frac{1-y}{1-\hat{y}} \hat{y} (1-\hat{y}) = -y (1-\hat{y}) + (1-y) \hat{y} = -y + y\hat{y} + \hat{y} - y\hat{y} = \hat{y} - y $$ $$\frac{dL}{dw_1} = \frac{dL}{dz} \frac{dz}{dw_1} = (\hat{y} - y) x_1, \quad \frac{dL}{dw_2} = \frac{dL}{dz} \frac{dz}{dw_2} = (\hat{y} - y) x_2, \quad \frac{dL}{db} = \frac{dL}{dz} \frac{dz}{db} = (\hat{y} - y)$$

We took the derivative of sigmoid function while deriving the above gradients:

$$ \frac{d\hat{y}}{dz} = g(z) (1-g(z)) $$

Let's see how this actually looks like:

sigmoid_der = lambda x: sigmoid(x)*(1-sigmoid(x))

def plot_sigmoid_der():
   plt.grid()
   plt.ylim([-10.0, 10.0])
   plt.ylim([-0.1, 0.4])
   xs = np.arange(-10, 10, 0.001)
   plt.xlabel('$z$', size=20)
   plt.ylabel("$g'(z)$", size=20)
   plt.title('Derivative of sigmoid function', size=18)
   plt.plot(xs, sigmoid_der(xs), label=r"$g'(z) = g(z)(1-g(z))}}$")
   plt.legend(loc='upper left', fontsize=17)
   plt.savefig('image.png')

plot_sigmoid_der()

plt.close()
plt.clf()
plt.cla()

This intuitively means that the change close to $0$ is fast, but when you get far away than $0$ the change gets slower.

Applying Gradient Descent to Minimize Loss

ALPHA = 0.4 # learning rate

# this simple implementation is numerically unstable, because:
# np.log() returns -inf for small inputs very close to 0
def get_loss(Y, Y_hat):
   loss = -1 * np.sum(Y * np.log(Y_hat) +
                      (1-Y) * np.log(1-Y_hat))
   return loss

# semantically same with above function, and numerically stable.
def get_loss_numerically_stable(Y, z):
   loss = -1 * np.sum(Y * -1 * np.log(1 + np.exp(-z)) +
                      (1-Y) * (-z - np.log(1 + np.exp(-z))))
   return loss

W_cache = []
b_cache = []
losses_cache = []

# some nice initial value, so that the plot looks nice.
W = np.array([[-4.0], [29.0]])
b = 0

for i in range(20):
   z = np.matmul(W.T, X) + b
   Y_hat = sigmoid(z)
   loss = get_loss_numerically_stable(Y, z)

   dw = np.matmul(X, (Y_hat - Y).T)
   db = np.sum(Y_hat - Y)

   W -= ALPHA * dw
   b -= ALPHA * db

   W_cache.append(W.copy())
   b_cache.append(b)
   losses_cache.append(loss)

plt.grid()
plt.title('Loss', size=18)
plt.xlabel('Number of iterations', size=15)
plt.ylabel('Loss', size=15)
plt.plot(losses_cache)

plt.savefig('image.png')

plt.close()
plt.clf()
plt.cla()

It turns out we just trained a pretty good classifier for this problem. We achieved 100% accuracy. Let's try to visualize our decision boundary. Remember: if the value is smaller than 0.5, it means our prediction is red and otherwise it is blue. (see Equation \ref{eq:inference})

So, we predict red if:

$$ \frac{1}{1+e^{-(w_1x_1+w_2x_2+b)}} < 0.5 $$

and blue otherwise. So, our decision boundary is:

$$ \frac{1}{1+e^{-(w_1x_1+w_2x_2+b)}} = 0.5 $$

If we do some math:

$$ e^{-(w_1x_1+w_2x_2+b)} = 1 $$ $$ w_1x_1+w_2x_2+b = 0 $$ $$ x_2 = \frac{-w_1x_1 - b}{w_2} $$

Now, let's see what will be the value of $x_2$ when $x_1=-1.5$ and $x_1 =1.5$.

Decision boundary

def plot_decision_boundary(X, Y, W, b, path):
   plt.grid()
   plt.xlim([-2.0, 2.0])
   plt.ylim([-2.0, 2.0])
   plt.xlabel('$x_1$', size=20)
   plt.ylabel('$x_2$', size=20)
   plt.title('Decision boundary', size = 18)

   xs = np.array([-2.0, 2.0])
   ys = (-W[0] * xs - b)/W[1]

   plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])
   plt.plot(xs, ys, c='black')
   plt.savefig(path)

plot_decision_boundary(X, Y, W_cache[-1], b_cache[-1], 'image.png')

plt.close()
plt.clf()
plt.cla()

We can do better. Now, let's see the decision boundary step by step as an animation. Everybody loves animations. Something everybody loves even more is that animations that you can see the source code as well.

import matplotlib.animation as animation

fig = plt.figure()

ax = fig.add_subplot(111)
ax.set_xlim([-2.0, 2.0])
ax.set_ylim([-2.0, 2.0])
ax.set_xlabel('$x_1$', size=20)
ax.set_ylabel('$x_2$', size=20)

ax.set_title('Decision boundary - Animated', size = 18)

def animate(i):
  xs = np.array([-2.0, 2.0])
  ys = (-W_cache[i].flatten()[0] * xs - b)/W_cache[i].flatten()[1]
  lines.set_data(xs, ys)

  text_box.set_text('Iteration: {}'.format(i))

  return lines, text_box

lines, = ax.plot([], [], c='black')
ax.scatter(X[0], X[1], s=50, c=colormap[Y[0]])
text_box = ax.text(1.1, 1.6, 'Iteration 0', size = 16)

anim = animation.FuncAnimation(fig, animate, len(W_cache), blit=True, interval=500)
anim.save('animation.mp4', writer='avconv', fps=2, codec="libx264")

plt.close()
plt.clf()
plt.cla()

Let's see the decision boundary in a more lazy setting. Here, we simply classify every single point in the grid and then give the predictions to a contour plot. Comparing to the previous animation, contour plot shows the prediction of every single point in the grid in the final version of the classifiers parameters. On the other hand, the previous animation shows the parameters step by step through the gradient descent iterations.

NX = 10
NY = 10

def plot_decision_boundary_lazy(X, Y, W, b, counter_param):
   plt.grid()
   plt.xlim([-2.0, 2.0])
   plt.ylim([-2.0, 2.0])
   plt.xlabel('$x_1$', size=20)
   plt.ylabel('$x_2$', size=20)
   plt.title('Decision boundary - Contour plot', size = 18)

   xs = np.linspace(-2.0, 2.0, NX)
   ys = np.linspace(-2.0, 2.0, NY)
   xv, yv = np.meshgrid(xs, ys)

   X_fake = np.stack((xv.flatten(), yv.flatten()), axis=0)
   predictions = sigmoid(np.matmul(W.T, X_fake) + b)

   plt.contourf(xv, yv, predictions.reshape((NX, NY)), counter_param)
   plt.scatter(X[0], X[1], s=50, c=colormap[Y[0]])

   plt.savefig('image.png')

plot_decision_boundary_lazy(X, Y, W_cache[-1], b_cache[-1], 50)

plt.close()
plt.clf()
plt.cla()

Here, background blue means our prediction is very close to $0$ which means it is confidently classified as red. And conversely, background red means our prediction is very close to $1$ which means it is confidently classified as blue.

For our predictions in the greenish middle area, we are not confident on our predictions. Our predictions are close to $0.5$, as expected.

Visualizing the error surface

We can also visualize the cost as a function of $W$, on 2-D surface. Here, we fix $b$ to the optimal value, and plug all possible values of $W$ and compute the error value for that given $w$ with that fixed $b$. This surface has an interesting property of being convex.

from mpl_toolkits.mplot3d.axes3d import Axes3D

NX = 100
NY = 100

def get_losses_numerically_stable(Y, z):
   temp = -1 * (Y * -1 * np.log(1 + np.exp(-z)) + (1-Y) * (-z - np.log(1 + np.exp(-z))))
   return np.sum(temp, axis = 0, keepdims = True)

def plot_error_surface(X, Y, b):
   plt.grid()

   xs = np.linspace(-30, 30, NX)
   ys = np.linspace(-30, 30, NY)
   xv, yv = np.meshgrid(xs, ys)

   W_fake = np.stack((xv.flatten(), yv.flatten()), axis=0)

   z = np.matmul(X.T, W_fake) + b_cache[-1]
   losses = get_losses_numerically_stable(Y.T, z)

   best_W = W_cache[-1]
   min_loss = np.min(losses)

   fig = plt.figure(figsize=(12,6))
   fig.suptitle('Error surface - Optimal loss', fontsize=17)
   ax = fig.add_subplot(1, 2, 1, projection='3d')
   ax.set_xlabel('$w_1$', size=20)
   ax.set_ylabel('$w_2$', size=20)
   ax.plot_surface(xv, yv, losses.reshape(NX, NY), rstride=4, cstride=4, alpha=0.25)
   ax.scatter(best_W[0], best_W[1], [min_loss], s=30, c='red')

   ax = fig.add_subplot(1, 2, 2, projection='3d')
   ax.set_xlabel('$w_1$', size=20)
   ax.set_ylabel('$w_2$', size=20)
   ax.plot_surface(xv, yv, losses.reshape(NX, NY), rstride=4, cstride=4, alpha=0.25)
   ax.scatter(best_W[0], best_W[1], [min_loss], s=30, c='red')
   ax.view_init(45, 45)

   fig.tight_layout()
   plt.savefig('image.png')

plot_error_surface(X, Y, b)

plt.close()
plt.clf()
plt.cla()

Let's see the parameters step by step through the iterations of gradient descent. We start from a random initial point and keep iterating to refine our parameters.

NX = 100
NY = 100

xs = np.linspace(-30, 30, NX)
ys = np.linspace(-30, 30, NY)
xv, yv = np.meshgrid(xs, ys)

W_fake = np.stack((xv.flatten(), yv.flatten()), axis=0)

z = np.matmul(X.T, W_fake) + b_cache[-1]
losses = get_losses_numerically_stable(Y.T, z)

fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111, projection='3d')
ax.set_xlim3d([-40, 40.0])
ax.set_ylim3d([-40, 40.0])
ax.set_xlabel('$w_1$', size=20)
ax.set_ylabel('$w_2$', size=20)

ax.set_title('Gradient descent updates', fontsize=17)
ax.plot_surface(xv, yv, losses.reshape(NX, NY), rstride=4, cstride=4, alpha=0.25)

def animate(i):
   graph.set_offsets([W_cache[i][0], W_cache[i][1]])
   graph.set_3d_properties([losses_cache[i]], zdir='z')

   line_data[0].append(W_cache[i].flatten()[0])
   line_data[1].append(W_cache[i].flatten()[1])
   line_data[2].append(losses_cache[i])

   lines.set_data(line_data[0], line_data[1])
   lines.set_3d_properties(line_data[2])

   text_box.set_text('Iteration: {}'.format(i))

   return graph, lines, text_box

line_data = [[], [], []]
graph = ax.scatter([], [], [], s=50, c='red')
lines, = ax.plot([], [], [], c='red')

text_box = ax.text(20.0, 20.0, 500.0, 'Iteration 0', size = 16)

anim = animation.FuncAnimation(fig, animate, len(W_cache), interval=50, blit=True)
anim.save('animation.mp4', writer='avconv', fps=2, codec="libx264")

plt.close()
plt.clf()
plt.cla()

Applying Logistic Regression using low-level Tensorflow APIs

TensorFlow is an open-source software library for Machine Intelligence. Nodes in the TensorFlow graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

It is very common to write your classifier using TensorFlow APIs, rather than using simple Python/Numpy especially if you are having big data and want to parallelize computation over multiple machines/CPU/GPU.

Here is how to train the same classifier for the above red and blue points using low-level TensorFlow API:

import tensorflow as tf

t_X = tf.placeholder(tf.float32, [2, None])
t_Y = tf.placeholder(tf.float32, [1, None])

t_W = tf.Variable([[-4.0, 29.0]])
t_b = tf.Variable(tf.zeros([1]))

t_Z = tf.matmul(t_W, t_X) + t_b
t_Yhat = tf.sigmoid(t_Z)
t_Loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits = t_Z,  labels = t_Y))

train = tf.train.GradientDescentOptimizer(0.4).minimize(t_Loss)
init = tf.global_variables_initializer()

with tf.Session() as session:
   session.run(init)
   losses = []
   for i in range(20):
      ttrain, ttloss = session.run([train, t_Loss], feed_dict={t_X:X, t_Y:Y})
      losses.append(ttloss)

plt.grid()
plt.plot(losses)
plt.title('Tensorflow Loss', size = 18)
plt.xlabel('Number of iterations', size=15)
plt.ylabel('Loss', size=15)
plt.savefig('image.png')

plt.close()
plt.clf()
plt.cla()

References

  1. Andrew Ng Coursera Machine Learning course.
  2. http://colah.github.io/posts/2015-08-Backprop/
  3. http://neuralnetworksanddeeplearning.com/chap2.html
  4. https://theclevermachine.wordpress.com/2014/09/08/derivation-derivatives-for-common-neural-network-activation-functions/
  5. http://ufldl.stanford.edu/tutorial/
  6. https://distill.pub/
  7. http://cs231n.github.io/
  8. https://www.tensorflow.org/