*all*the groundwork required to code up our ANN to classify our toy example (

*make moons*dataset kindly provided by sklearn).

### Recap

Recall that a motivation for exploring ANN's was that our standard Logistic Regression classification technique could only ever give us a*linear*separating boundary:

### ANN structure and Matrices

In the last blog post we derived general expressions for a single hidden layer feedforward ANN. The results were;

For a network with the following features:

- 2 inputs, $x_1$, $x_2$
- 1 Hidden Layer with
*n*neurons - we will treat*n*as a parameter and see how it affects the output classification - 2 outputs, $\hat{y}_1$, $\hat{y}_2$
- Activation function: $\sigma(x) = \tanh(x)$
- Softmax output
- Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$

#### Forward Propagation Summary

- $Z^{(1)} = xW^{(1)} + b^{(1)}$
- $a^{(1)} = \tanh(Z^{(1)})$
- $Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$
- $\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$

- $\delta^{(2)} = \hat{y} - y$
- $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
- $\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$
- $\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$
- $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$
- $\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$

### Implementation in Python

Now instead of defining our network and neurons as a class, I've decided to closely follow this page (which actually inspired all of these ANN related posts) as it is more transparent and easier to grasp what is going on in the code. So without further delay, I present the code for an ANN with a single hidden layer of size 3:

### Code

```
import numpy as np
import sklearn.datasets
```

```
import pandas as pd
import matplotlib.pyplot as plt
```

`np.random.seed(0) `

`X, y = sklearn.datasets.make_moons(200, noise=0.20)`

`df = pd.DataFrame()`

`df['x1'] = X[:,0]`

`df['x2'] = X[:,1]`

`df['z'] = y`

##define ANN which takes inputs X and y with parameters n_iter as number of iterations and learning_rating the rate used in gradient descent

def calculate(X ,y, num_hidden_neurons, n_iter=2000, learning_rate = 0.01, regularisation_rate = 0):

loss = [];

#initialise weights and biases

X_size = len(X)

np.random.seed(0)

W1 = np.random.randn(2, num_hidden_neurons) / np.sqrt(2)

b1 = np.zeros((1, num_hidden_neurons))

W2 = np.random.randn(num_hidden_neurons, 2) / np.sqrt(num_hidden_neurons)

b2 = np.zeros((1, 2))

model = {}

for i in xrange(n_iter):

#feedforward

Z1 = X.dot(W1)+b1

a1 = np.tanh(Z1)

Z2 = a1.dot(W2)+b2

yhat = softmax(Z2)

#backpropagation

d2 = yhat

d2[range(X_size),y] -= 1

d1 = (1-a1**2)*d2.dot(W2.T)

dW2 = a1.T.dot(d2)

db2 = np.sum(d2, axis=0, keepdims = True)

dW1 = X.T.dot(d1)

db1 = np.sum(d1, axis=0)

#regularisation

dW1 += regularisation_rate * W1

dW2 += regularisation_rate * W2

#gradient descent

W1 = W1 - learning_rate * dW1

b1 = b1 - learning_rate * db1

W2 = W2 - learning_rate * dW2

b2 = b2 - learning_rate * db2

model = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}

loss.append(predict(model,X,y,True))

return loss, model

### Comments

The code is fairly self explanatory, but I'll walk through the steps

- Importing all the libraries we need
- Initialisation of the weights and biases (more on this later)
- Feedforward step - implemented via matrices exactly how we calculated them in the previous blog post
- Backpropagation step, again - implemented just how we derived them before. Note as the $b^{(i)}$ matrices consist of a single unique row each, we let python broadcast them into the appropriate dimensions when it does the matrix addition in the calculation of $Z^{(1)}$ and $Z^{(2)}$.
- Gradient descent to alter the weights after each iteration in order to minimize the loss function.
- Regularisation rate - I haven't mentioned this before, I've put it in there but set it to $0$ as to have no effect on the results. I will discuss this term later.
- Output a dictionary
*Model*which contains our weight and bias matrices $W^{(1)}$, $b^{(1)}$, $W^{(2)}$, $b^{(2)}$. - Note that I call a function (that's not defined here) called
*predict*which will make predictions based on the input $X$ or return the cross entropy loss of a set of weights and biases when applied to $X$. The calculation of the cross entropy loss is the reason that $y$ is an input into the overall*calculate*function; if we were just calculating the weights and biases we would only require $X$ as an input. See code for*predict*at the bottom of this post.

### Results

#function call - 3 neurons in the hidden layer

loss, model = calculate(X, y, 3)

The decision boundary is plotted below

Notice the non-linear decision boundary, and how well it captures the separation of data points! This gives us great hope that if the data generating process is distinctly non-linear then even though the standard GLM (i.e logistic regression) decision boundaries are crude, with an ANN we can potentially get a good fit!

The corresponding cross entropy loss as a function of iterations is also plotted below:

Ok, the overall trend of the loss looks alright, but what the hell are the spikes??? Stay tuned...

What about if we have 10 neurons in the hidden layer?

#function call - 10 neurons in the hidden layer

loss, model = calculate(X, y, 10)

###

loss, model = calculate(X, y, 3)

The decision boundary is plotted below

Notice the non-linear decision boundary, and how well it captures the separation of data points! This gives us great hope that if the data generating process is distinctly non-linear then even though the standard GLM (i.e logistic regression) decision boundaries are crude, with an ANN we can potentially get a good fit!

The corresponding cross entropy loss as a function of iterations is also plotted below:

Ok, the overall trend of the loss looks alright, but what the hell are the spikes??? Stay tuned...

What about if we have 10 neurons in the hidden layer?

#function call - 10 neurons in the hidden layer

loss, model = calculate(X, y, 10)

and the corresponding loss:

Now the spikes are gone and the loss is lower than in the 3 neuron case. Whoa this is a

*really*good fit, or is it? How good is too good? Obviously here, with the 10 neurons the ANN is not only fitting the overall general shape of the data, but its capturing the noise from the input too. Just looking at the data (ignoring the decision boundary) we can imagine a kind of smooth sinusoidal shape separating the classes (as in the 3 neuron case above). Notice the jagged edge near the long red dot amongst the blue dots at coordinates (-0.5, 0.7), you can see that the single red training example has 'pulled' the decision boundary towards it. This is an ominous sign as the decision boundary is far too sensitive to a single data point, remember the idea is to capture the essence of the data generating process*excluding*noise.
If we were to assess the performance of our ANN purely on the cross entropy loss - it would seem that the network with 10 neurons in the hidden layer outperforms the one with 3. However we

*always*need to test the ANN on data it hasn't seen before - cross validation is usually employed to do so.
Well there it is, if you've followed the posts from pt.1 of this series knowing the bare minimum about ANNs only armed with a pen, paper and a desire to learn, then you should be able to implement your own feedforward ANN from scratch! It was a laborious task (all the matrix algebra, etc) but I hope it provided insight into the construction and architecture of ANNs.

### Useful functions

# Predict y_hat from input X or calculate loss

def predict(model, X, y, calc_loss = False):

W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']

# Forward propagation

Z1 = X.dot(W1) + b1

a1 = np.tanh(Z1)

Z2 = a1.dot(W2) + b2

probs = softmax(Z2)

if calc_loss:

loss = -np.log(probs[range(len(X)), y])

tot_loss = np.sum(loss)

return 1./len(X) * tot_loss

return np.argmax(probs, axis=1)

#Compute softmax values for each sets of scores in x.

def softmax(x):

return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims = True)

Given a model, the

*predict*function returns a classification (0 or 1) for our input data X or the cross entropy loss. This is used to keep track of the loss at each iteration of the weight updates. Note that this is just for illustrative purposes, since it is an expensive calculation in reality, it isn't wise to call it*every*iteration - maybe every 100, or 1000 or so.*Softmax*is used to calculate the softmax of the entries of a vector - used to calculate our $\hat{y}$.

### Next steps

Currently I'm working my way through the CS229 Machine Learning course by Andrew Ng which takes up the majority of my spare time. It's a fantastic course which has accompanying lectures available on youtube and is a great introduction to ML techniques.

However, in the future I am to continue with this implementation but tweak it to explore the following

- Why the weights were initialised in such a peculiar way (why not just set them all to 0?)
- Extending to more than 1 hidden layer
- How to prevent overfitting
- Regularisation (I mentioned it above, but did not explain what/how/why)
- Drop out techniques
- Stochastic gradient descent
- A variant of our
*batch gradient descent*which is industry practise when training an ANN (due to massive datasets) - Different applications - when I get around to it, I'll implement an ANN for the classic MNIST handwritten digit recognition problem.

## No comments:

## Post a Comment