The tricky part about training ANNs is that the loss function isn't necessarily convex, which means that we can't use our usual optimisation routines. The fact that the loss function isn't necessarily convex means that just because we find a local minimum, it doesn't mean it's the global minimum. Thus non convex problems may converge on different local minima depending on the parameters of the optimisation routine. We'll explore some of these parameters below.

### Learning Rate

Recall how the learning rate $\eta$ enters our optimisation procedure via the weight updates in gradient descent; $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ It essentially controls the step size at each update. Recall we had some funny bumps in our loss function at certain iterations, let's take a closer look: I've plotted the results of two different iterations of training the ANN below
The two lines correspond to the total loss as a function of the number of iterations in our training of the ANN. The blue line has $\eta = 0.001$ and the green line has $\eta = 0.01$. You can see that the green line has those funny bumps we witnessed before - this is the training example with a larger learning rate. The spikes occur when the step size is too large and we overshoot the minimum. Notice that the blue line doesn't have these overshoots, however it takes more iterations to approach the minimum. If we take a step size which is

*too large,*then we consistently overshoot the minima - never converging on the minimum:Depending on the problem at hand, you may value accuracy more than efficiency or vice versa, this will dictate how you choose your learning rate, of which you will usually calculate using cross validation.

### Regularisation / Weight Decay

Say we have our initial loss function (the cross entropy Loss) $L_0$ and we add a*regularisation*term such that we now have $$L = L_0 + \frac{\lambda}{2n} \sum_{w} w^2$$ where the sum is over all weights. Now if $\lambda$ is large then the second term will dominate $L$ and the task of optimising the entire expression will be reduced to minimising $\sum_w w^2$. If $\lambda$ is small then the first term dominates and there are less restrictions place on $w$. This regularisation term controls $w$ by preventing it from becoming overly large and helps us from overfitting the model. If we want to use gradient descent to minimise this regularised loss function we have $$ \frac{\partial L}{\partial w} = \frac{\partial L_0}{\partial w} + \frac{\lambda}{n} \sum_w w$$ so our update at each iteration is $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ becomes $$ w \rightarrow w - \eta \frac{\partial L_0}{\partial w} - \frac{\eta \lambda}{n} w$$ $$\implies w \rightarrow \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \frac{\partial L_0}{\partial w}$$ That is at each update, the weight $w$ is rescaled by a factor of $\left( 1 - \frac{\eta \lambda}{n} \right)$ at each iteration; this is referred to as

*weight decay*and as mentioned before, limits the magnitude of $w$.

### Weight initialisation

In this section we'll take a look at why we chose to intialise our values as we did (from a normal distribution with specific parameters). Recall the definition of the weight update from our gradient descent algorithm $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$, if the second term in this expression is small or zero, then there is effectively no (or very little) weight update to $w$. This causes our training to slow down incredibly, such that after each iteration our weight $w$ is only changing ever so slightly; obviously we would like to avoid this situation at the start of the procedure. Recall the backpropagation rules for $W^{(1)}$:
we see that $(1-\tanh^{2}(Z^{(1)})$ term enters the equation (more generally, this will be the derivative of the activation function). So we have our update to the weight $w$ as $$w \rightarrow w - \eta (1-\tanh^{2}(Z^{(1)})) \odot {x}^T \delta^{(2)}{W^{(2)}}^T$$ That is the amount we update $w$ by is proportional to the derivative of our activation function. Thus we want to avoid initialising our weights in a region where this derivative is close to zero. Below is a plot of the function

We can see that this activation function has its derivative approach zero at both extremes: as $x \rightarrow \infty$ and as $x \rightarrow -\infty$. Let's think about a more general ANN for a moment - suppose we have an ANN with 1000 inputs and a single training example where each input is equal to $1$. We have as usual $$Z^{(1)} = x W^{(1)} +b^{(1)}$$ If we have intialised each entry $W^{(1)}_{ij}$ and $b^{(1)}_j$ as selected from a standard normal distribution (iid), then each entry $Z^{(1)}_{i}$ will be the sum of 1001 iid standard normal variables. Then since the sum of $N$ standard normal variables will have mean $0$ and standard deviation $\sqrt{1001}$ i.e a very wide distribution with a relatively high probability of giving a large negative or positive result (this almost looks like a uniform distribution), the derivative of the activation function will be very close to zero! This isn't what we want.

What about if we initialise with a random normal with mean $0$ and standard deviation $\frac{1}{\sqrt{1000}}$? Now we know that the variance of a sum of iid random normal variances is the sum of the variances so we now have each entry in $Z^{(1)}_{ij}$ has mean $0$ and standard deviation $$\sigma = \sqrt{\frac{1000}{1000}+1} = \sqrt{2}$$ which is a lot narrower than our distribution before - there is a lot smaller chance intialising at values where the derivative of the activation function is close to $0$. Below is a comparison of the resulting initialisation distributions from the toy example - the green line is the resulting distribution for the refined initialisation, where the red line results from initialisation by standard normal variables.

More generally for a given network we will initialise from a Gaussian distribution with mean $0$ and standard deviation $\frac{1}{\sqrt{N_{in}}}$ where $N_{in}$ is the number of inputs into the neural network.

Next time we'll have a look at optimising our network using stochastic gradient descent and maybe play around with some different datasets.

Next time we'll have a look at optimising our network using stochastic gradient descent and maybe play around with some different datasets.