## Thursday, 12 January 2017

### Feedforward Artificial Neural Network pt4: Implementation!

So if you've survived this far, it's all worth it! We've done all the groundwork required to code up our ANN to classify our toy example (make moons dataset kindly provided by sklearn).

### Recap

Recall that a motivation for exploring ANN's was that our standard Logistic Regression classification technique could only ever give us a linear separating boundary:

### ANN structure and Matrices

In the last blog post we derived general expressions for a single hidden layer feedforward ANN. The results were;
For a network with the following features:
• 2 inputs, $x_1$, $x_2$
• 1 Hidden Layer with n neurons - we will treat n as a parameter and see how it affects the output classification
• 2 outputs, $\hat{y}_1$, $\hat{y}_2$
• Activation function: $\sigma(x) = \tanh(x)$
• Softmax output
• Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$

#### Forward Propagation Summary

• $Z^{(1)} = xW^{(1)} + b^{(1)}$
• $a^{(1)} = \tanh(Z^{(1)})$
• $Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$
• $\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$
Backpropagation Summary

• $\delta^{(2)} = \hat{y} - y$
• $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
• $\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$
• $\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$
• $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$
• $\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$
With the above, we have entirely defined how our ANN classifies training examples, how the weights and biases of the network affect the loss function and hence have our gradient descent update rule completely specified (up to the learning rate parameter).

### Implementation in Python

Now instead of defining our network and neurons as a class, I've decided to closely follow this page (which actually inspired all of these ANN related posts) as it is more transparent and easier to grasp what is going on in the code. So without further delay, I present the code for an ANN with a single hidden layer of size 3:

### Code

 import numpy as np import sklearn.datasets
import pandas as pd import matplotlib.pyplot as plt 
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
df = pd.DataFrame()
df['x1'] = X[:,0]
df['x2'] = X[:,1]
df['z'] = y

##define ANN which takes inputs X and y with parameters n_iter as number of iterations and learning_rating the rate used in gradient descent
def calculate(X ,y, num_hidden_neurons, n_iter=2000, learning_rate = 0.01, regularisation_rate = 0):
loss = [];
#initialise weights and biases
X_size = len(X)
np.random.seed(0)
W1 = np.random.randn(2, num_hidden_neurons) / np.sqrt(2)
b1 = np.zeros((1, num_hidden_neurons))
W2 = np.random.randn(num_hidden_neurons, 2) / np.sqrt(num_hidden_neurons)
b2 = np.zeros((1, 2))

model = {}

for i in xrange(n_iter):
#feedforward
Z1 = X.dot(W1)+b1
a1 = np.tanh(Z1)
Z2 = a1.dot(W2)+b2
yhat = softmax(Z2)
#backpropagation
d2 = yhat
d2[range(X_size),y] -= 1
d1 = (1-a1**2)*d2.dot(W2.T)
dW2 = a1.T.dot(d2)
db2 = np.sum(d2, axis=0, keepdims = True)
dW1 = X.T.dot(d1)
db1 = np.sum(d1, axis=0)
#regularisation
dW1 += regularisation_rate * W1
dW2 += regularisation_rate * W2
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2

model = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}
loss.append(predict(model,X,y,True))

return loss, model

The code is fairly self explanatory, but I'll walk through the steps
1. Importing all the libraries we need
2. Initialisation of the weights and biases (more on this later)
3. Feedforward step - implemented via matrices exactly how we calculated them in the previous blog post
4. Backpropagation step, again - implemented just how we derived them before. Note as the $b^{(i)}$ matrices consist of a single unique row each, we let python broadcast them into the appropriate dimensions when it does the matrix addition in the calculation of $Z^{(1)}$ and $Z^{(2)}$.
5. Gradient descent to alter the weights after each iteration in order to minimize the loss function.
6. Regularisation rate - I haven't mentioned this before, I've put it in there but set it to $0$ as to have no effect on the results. I will discuss this term later.
7. Output a dictionary Model which contains our weight and bias matrices $W^{(1)}$, $b^{(1)}$, $W^{(2)}$, $b^{(2)}$.
8. Note that I call a function (that's not defined here) called predict which will make predictions based on the input $X$ or return the cross entropy loss of a set of weights and biases when applied to $X$. The calculation of the cross entropy loss is the reason that $y$ is an input into the overall calculate function; if we were just calculating the weights and biases we would only require $X$ as an input. See code for predict at the bottom of this post.

### Results

#function call - 3 neurons in the hidden layer
loss, model = calculate(X, y, 3)
The decision boundary is plotted below
Notice the non-linear decision boundary, and how well it captures the separation of data points! This gives us great hope that if the data generating process is distinctly non-linear then even though the standard GLM (i.e logistic regression) decision boundaries are crude, with an ANN we can potentially get a good fit!
The corresponding cross entropy loss as a function of iterations is also plotted below:
Ok, the overall trend of the loss looks alright, but what the hell are the spikes??? Stay tuned...

What about if we have 10 neurons in the hidden layer?
#function call - 10 neurons in the hidden layer
loss, model = calculate(X, y, 10)
and the corresponding loss:
Now the spikes are gone and the loss is lower than in the 3 neuron case. Whoa this is a really good fit, or is it? How good is too good? Obviously here, with the 10 neurons the ANN is not only fitting the overall general shape of the data, but its capturing the noise from the input too. Just looking at the data (ignoring the decision boundary) we can imagine a kind of smooth sinusoidal shape separating the classes (as in the 3 neuron case above). Notice the jagged edge near the long red dot amongst the blue dots at coordinates (-0.5, 0.7), you can see that the single red training example has 'pulled' the decision boundary towards it. This is an ominous sign as the decision boundary is far too sensitive to a single data point, remember the idea is to capture the essence of the data generating process excluding noise.
If we were to assess the performance of our ANN purely on the cross entropy loss - it would seem that the network with 10 neurons in the hidden layer outperforms the one with 3. However we always need to test the ANN on data it hasn't seen before - cross validation is usually employed to do so.

Well there it is, if you've followed the posts from pt.1 of this series knowing the bare minimum about ANNs only armed with a pen, paper and a desire to learn, then you should be able to implement your own feedforward ANN from scratch! It was a laborious task (all the matrix algebra, etc) but I hope it provided insight into the construction and architecture of ANNs.

### Useful functions

# Predict y_hat from input X or calculate loss
def predict(model, X, y, calc_loss = False):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# Forward propagation
Z1 = X.dot(W1) + b1
a1 = np.tanh(Z1)
Z2 = a1.dot(W2) + b2
probs = softmax(Z2)
if calc_loss:
loss = -np.log(probs[range(len(X)), y])
tot_loss = np.sum(loss)
return 1./len(X) * tot_loss
return np.argmax(probs, axis=1)
#Compute softmax values for each sets of scores in x.
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims = True)
Given a model, the predict function returns a classification (0 or 1) for our input data X or the cross entropy loss. This is used to keep track of the loss at each iteration of the weight updates. Note that this is just for illustrative purposes, since it is an expensive calculation in reality, it isn't wise to call it every iteration - maybe every 100, or 1000 or so.

Softmax is used to calculate the softmax of the entries of a vector - used to calculate our $\hat{y}$.

### Next steps

Currently I'm working my way through the CS229 Machine Learning course by Andrew Ng which takes up the majority of my spare time. It's a fantastic course which has accompanying lectures available on youtube and is a great introduction to ML techniques.

However, in the future I am to continue with this implementation but tweak it to explore the following
• Why the weights were initialised in such a peculiar way (why not just set them all to 0?)
• Extending to more than 1 hidden layer
• How to prevent overfitting
• Regularisation (I mentioned it above, but did not explain what/how/why)
• Drop out techniques
• Stochastic gradient descent
• A variant of our batch gradient descent which is industry practise when training an ANN (due to massive datasets)
• Different applications - when I get around to it, I'll implement an ANN for the classic MNIST handwritten digit recognition problem.