Wednesday, 7 December 2016

Derivative of cross entropy loss with softmax

Derivative of the softmax function

I've dedicated a separate post for the derivation of the derivative of the cross entropy loss function with softmax as the activation function in the output layer, which I'll reference in the future. It's nothing groundbreaking but sometimes it's nice to work through some of the results which are often quoted without derivation. 

The softmax function

The softmax function is often used as the activation function for neurons in the output layer of an ANN. We borrow the notation from a previous blog post, where $\hat{y}$ is the output of a particular neuron in the output layer and $$z_o = \sum_{j \in input(o)} z_j w_{j \rightarrow o}$$ i.e. we sum over all paths $j$ which are an input into our output neuron $o$.

It is defined as the following:
$$ \hat{y}_o = \frac{e^{z_o}}{\sum_k e^{z_k}}$$ where the sum is over the $k$ output neurons in the network. The softmax essentially takes an input $z_o \in \mathbb{R}$ and maps it to the interval $(0,1)$ such that the sum of all outputs is 1 - it gives us a probability distribution over our $k$ outputs (i.e. $k$ different classifications). 

Cross entropy loss function

Recalling the definition of the cross entropy loss function from a previous blog post we have
$$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$
where $N$ is the number of training examples and $C$ is the number of classifications (i.e 2 for binary classification, etc...). Let us assume $N = 1$ without loss of generality, thus 
$$L = - \sum_{o} y_{o} \ln{\hat{y}_{o}}$$ where the sum is over the output neurons in our network (given that we have 1 output neuron for each class, the sum over $i \in C$ and $o$ are equivalent). Note that $y_o$ is the actual classification of our data, i.e. $y_o \in \{0,1\}$ and that if $y_o=1$ then all other $y_j =0$ where $ i \neq o$ see a description of one hot encoding. Now our object of interest is $$ \frac{ \partial L}{\partial z_i} = \frac{\partial L}{\partial \hat{y}_o} \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Let's focus on the second term $\frac{\partial \hat{y}_o}{\partial z_i}$ for now and split it into two cases.

  • Case 1: $ i = o$
$$\frac{\partial \hat{y}_o}{\partial z_i} = \frac{\partial \hat{y}_o}{\partial z_o} =  \frac{\partial}{\partial z_o} \left( \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = \frac{e^{z_o}}{\sum_k e^{z_k}} - e^{z_o} (\sum_k e^{z_k})^{-2}e^{z_o}$$
$$= \frac{e^{z_o}}{\sum_k e^{z_k}} \left ( 1 - \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = \hat{y}_o (1 - \hat{y}_o)$$
  • Case 2: $ i \neq o $
$$ \frac{\partial \hat{y}_i}{\partial z_i} = \frac{\partial}{\partial z_i} \left( \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = -e^{z_o}(\sum_k e^{z_k})^{-2} e^{z_i}$$
$$= -\hat{y}_o \hat{y}_i$$ Now $$\frac{\partial L}{\partial \hat{y}_o} = - \sum_o \frac{1}{\hat{y}_o} y_o$$ Therefore $$ \frac{\partial L}{\partial z_i} = -\sum_o \frac{1}{\hat{y}_o} y_o  \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Substituting in our above computations and splitting out the $i=o$ case from the sum we get

$$\frac{\partial L}{\partial z_i} = -\frac{1}{\hat{y}_i} y_i \times \hat{y}_i(1-\hat{y}_i) - \sum_{o \neq i} \frac{1}{\hat{y}_o} y_o \times (- \hat{y}_o \hat{y}_i)$$
$$ = -y_i(1-\hat{y}_i) + \sum_{o \neq i} y_o \hat{y}_i$$ $$= y_i \hat{y}_i -y_i + \sum_{o \neq i} y_o \hat{y}_i$$ Moving the first term inside the sum and summing over all $o$ we get 
$$\frac{\partial L}{\partial z_i} = -y_i + \sum_{o} y_o \hat{y}_i$$ $$ = \hat{y}_i \sum_o y_o - y_i$$ But $\sum_o y_o$ is the sum over all the actual labels of our data which is one hot encoded (as mentioned above). Thus $y_o = 1$ for exactly one class and $0$ for all else - hence $\sum_o y_o = 1$. We finally arrive at a result we'll use in the future in constructing and implementing our ANN to classify the make moons dataset
$$ \frac{\partial L}{\partial z_i}  = \hat{y}_i - y_i$$

Monday, 5 December 2016

Feedforward Artificial Neural Network pt2

Feedforward Artificial Neural Network pt2

In the previous post, we briefly looked at the general structure of a Feedforward ANN and why such a construction should be useful. Now that we have a general handle of the idea behind a Feedforward ANN, we need to train the network - find a method to calculate the appropriate weights for our hidden layer neurons. In the context of our classification case, we want to train the network by providing it a dataset with labels so that it can predict the labels of new data once trained - a classic supervised learning scenario.


Backpropagation is the algorithm we'll use to train (i.e minimize the error function of) our ANN. It is essentially repeated applications of the chain rule - chaining together all possible paths in the network to establish how the error changes as a function of the inputs and outputs of each neuron. This may not seem like a very profound approach, nor may it be clear that this use of the chain rule should deserve it's own name, however the power from this method comes from the implementation. The set up of backpropagation allows the derivatives we use in the chain rule to be calculated using intermediate results in our network. This becomes especially important when the size of our network grows - it is efficient in that it saves the recalculation of derivatives.

Some notation

Suppose we have two neurons labelled by $i$ and $j$ respectively, as depicted below:

where $\sigma_j$ is a label for the neuron with the following meaning:
  • We define the activation function for neuron $i$ as $a_j(w_{i \rightarrow j} \cdot a_i)$, where $w_{i \rightarrow j}$ is the weight applied to the input $a_i$ of neuron $j$. 

Gradient Descent
In order to minimize the Loss function, one might think to simply take the derivatives and throw them into a prebuilt optimisation routine from your favourite library. This approach is fine when the network is small, but when it scales to millions of neurons it becomes computationally intractable. We will in fact use a standard optimisation routine, but it is the clever implementation which makes it so attractive. The optimisation method we'll be using is called the batch gradient descent method, this is in contrast to the stochastic gradient descent method. It is a fairly standard iterative method to minmize a loss function $L\left(w_{i \rightarrow j} \right)$ by updating the target variables by
$$ w_{i\rightarrow j} \rightarrow w_{i\rightarrow j} - \eta \frac{\partial L}{\partial w_{i\rightarrow j}} $$
at each iteration. $\eta$ is the so called learning rate which determines how big/small the steps are at each iteration. Intuitively small steps will take a longer time for the algorithm to converge, while larger steps cover more ground, they may overshoot the true minimum - the ideal value is dependent on the loss function and is somewhere in between.

The Loss Function

Once we have the architecture of our ANN set up, we need to measure its performance. In order to do so, we borrow from information theory, a loss function which is commonly found in ANN classification tasks: the cross entropy loss
$$ L = -\frac{1}{N} \sum_{n \in N} \sum_{ i \in C} y_{n,i} \ln \hat{y}_{n,i} $$
where we sum over N, the number of training examples and C, the number of classes in our classification problem. The hat $\hat{y}$ indicates an output classification from our network, where $y$ is the actual classification of the training sample. This is the Loss function we will use once we begin to classify the examples from the previous blog post. Note that if we were building the ANN with a continuous output (i.e. regression) we could use the usual Residual Sum of Squares
$$L_{reg} = \sum_{i=1}^{N} \frac{1}{2} \left( \hat{y_{i}} - y_{i} \right)^{2} $$
where the factor of a half is chosen for convenience (it prevents stray factors of $\frac{1}{2}$ floating around when we differentiate).
Thus the task of training our neural network then becomes an optimisation problem; minimize the classification error function $L$ with respect to the weights $w_{i \rightarrow j}$ of each neuron in the hidden layer -  see below for an explanation of the notation. Note this is ignoring any potential problems of overfitting.

Illuminating example - a sample ANN

I'm still working my way around TikZ which will allow me to make nice diagrams in LaTeX, but in the interest of time I've borrowed the below image from here. The network pictured has a single input $x_i$, 2 hidden layers and a single output. We will not assume the functional form of the loss function $L$ or the activation functions $\sigma$. 

We want to understand how the output $\hat{y}_i$ changes with respect to the input weight $w_{in \rightarrow i}$, we can use the chain rule to get this relationship. Noting that there are two paths from the input layer to the output, we need to consider each path separately. 
We define the following in the network, where $i$, $j$, $k$, $o$ index the neurons 
$$z_i = w_{in \rightarrow i} x_i,  \quad a_i = \sigma(z_i)$$
$$z_j = w_{i \rightarrow j} a_i, \quad a_j = \sigma(z_j)$$
$$z_k = w_{i \rightarrow k} a_i, \quad a_j = \sigma(z_k)$$
$$z_o = w_{k \rightarrow o} a_k + w_{j \rightarrow o} a_j, \quad \hat{y} \equiv a_j = output(z_o)$$

where $output(x)$ is the output function, which for now we will keep generic and $\sigma(x)$ is the activation function, which is the same for each neuron in the network.

We decompose $\frac{\partial L}{\partial w_{in \rightarrow j}}$: starting at the right side of the graph and moving through $o \rightarrow k \rightarrow i$ as below:

$$\frac{\partial L}{\partial z_o} \times \frac{\partial z_o}{\partial z_k} \times \frac{\partial z_k}{\partial z_i} \times \frac{\partial z_i}{\partial w_{in \rightarrow j}}$$

Let's look at an individual component $$\frac{\partial z_o}{ \partial z_k} = \frac{\partial}{\partial z_k} \left( w_{k \rightarrow o} a_k + w_{j \rightarrow o} a_j \right)$$ noting the second term does not depend on $z_k$
$$\frac{\partial}{\partial z_k}  w_{k \rightarrow o} \sigma(z_k)  = w_{k \rightarrow o} \sigma ' (z_k)$$With a similar analysis we see that the product of derivatives can be expressed as
$$\frac{\partial L}{\partial z_o} \times w_{k \rightarrow o} \sigma ' (z_k) \times w_{i \rightarrow k} \sigma ' (z_i) \times x_i$$
and now moving through $o \rightarrow j \rightarrow i$

$$\frac{\partial L}{\partial z_o} \times \frac{\partial z_o}{\partial z_j} \times \frac{\partial z_j}{\partial z_i} \times \frac{\partial z_i}{\partial w_{in \rightarrow j}}$$
$$=\frac{\partial L}{\partial z_o} \times w_{j \rightarrow o} \sigma ' (z_j) \times w_{i \rightarrow k} \sigma ' (z_i) \times x_i$$
Putting this all together we have $$\frac{\partial L}{\partial w_{in \rightarrow i}} = \frac{\partial L}{\partial z_o} \left[ w_{k \rightarrow o} \sigma ' (z_k) \times w_{i \rightarrow k} \sigma '(z_i) + w_{j \rightarrow o} \sigma ' (z_j) \times w_{i \rightarrow j} \sigma '(z_i) \right] x_i$$
where we have yet to specify the exact form of the loss function $L$ and activation function $\sigma(x)$ in this example. Once these are specified, we can just plug them in to the above formula. Hopefully this simplified example illuminates how the derivatives of the loss function wrt to each neuron is calculated. In terms of backpropagation, we can think of starting out the output and working backwards through each neuron in the path leading to it and understanding how the error changes at each point. In the next section I will generalise these results so they are easily extendable to a larger network.

Error propagation

We will now generalise the above computation, in order to get the derivative of the loss function with respect to any $z_j$ in any of the layers of the network. We define the error signal $$\delta_j = \frac{\partial L}{\partial z_j}$$ Now, using the chain rule as in the example above we have $$\delta_j = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z_j}$$ Once we specify our loss function $L$ we'll have a nice expression for the first term in the expression above, for now let's keep it generic. Now using the definitions of the $z_i$ and $a_i$ above: $$\frac{\partial \hat{y}}{\partial z_j} = \frac{\partial \hat{y}}{\partial a_j} \times \frac{\partial a_j}{\partial z_j} = \sigma ' (z_j) \sum_{k \in path(j)} \frac{\partial \hat{y}}{\partial z_k} \frac{\partial z_k}{\partial a_j}$$ where the sum is over the $k$ paths which are connected to the output of neuron $j$. We have $z_k = a_j w_{j \rightarrow k}$ so our expression becomes $$\frac{ \partial \hat{y}}{\partial z_j} = \sigma '(z_j) \sum_{k \in path(j)} \frac{\partial \hat{y}}{\partial z_k} w_{j \rightarrow k}$$ $$\implies \delta_j = \sigma '(z_j) \sum_{k \in path(j)} \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_k} w_{j \rightarrow k}$$ Note the first two terms in the summation are actually $\delta_k$ in disguise! Thus we have a recursive definition for our error propagation: $$\delta_j =  \sigma ' (z_j) \sum_{k \in path(j)} \delta_k w_{j \rightarrow k}$$
Thus if we have $\delta_\text{output}$, we can work our way backwards through the neural network, calculating the error for each neuron as a function of the error of the neurons that its output is connected to. This is the key concept of back propagation - although it is just an application of the chain rule, because we can propagate the error at each point in the network through subsequent neurons, the calculations can be done on the fly by utilising calculations that have already been performed.

Next time we'll describe the architecture of the actual ANN we'll use to classify the make moons example detailted in pt1. To be honest, these concepts didn't fully sink in until the first time I implemented an ANN, going through the details of the calculations, I suggest you work through this and the next blog post by hand in order to ensure you have a handle on how the feedforward and error propagations are affected by different topologies of ANNs.

Tuesday, 15 November 2016

Feedforward Artificial Neural Network pt1

Architecture of an ANN

The history of ANNs can be traced back to Warren McCulloch, who proposed a mechanism of how neurons might work by modelling a simple neural network with an electrical circuit. This line of thinking was reinforced by Donald Hebb, who discovered that neural pathways between neurons were strengthened through repetitive use. Understanding the motivation behind ANNs will help us understand why they are structured the way they are, particularly the action potential of a neuron.

An ANN consists of three major components;
  • Input Layer
    • This layer takes the inputs as is - they do not modify the data at all; they push it into the hidden layers.
  • Hidden Layer(s)
    • In line with the aforementioned action potential of a neuron in the brain, the hidden layer neurons take an input, applies an activation function and pushes the result to the next layer in the network.
    • The activation function $\phi :  \mathbb{R} \rightarrow \mathbb{R}$ takes $\sum_{i=1}^{n} x_i w_i + b$ as an argument, where $n$ is the number of inputs flowing into the particular hidden layer neuron, $w_{i}$ is the associated weight with input $x_{i}$ and $b$ is called the bias. Our task is to find the $w_{i}$ and $b$ for each neuron in the hidden layer which minimize our error function (more on this later).
    • The choice of activation function is entirely up to you. Obviously there are some standard choices for certain tasks. See below for a (by no means exhaustive) list of activation functions
    • The activation functions are they key to capturing non-linearities in our decision boundary.
  • Output Layer
    • The output of the neural network. In our case (classification) it will be a label classifying our input data to one of two classes.
These three components together with the topology of the network will define the entire structure of our ANN. Below is a stock image of a typical ANN
On the left there are 3 inputs which flow into the hidden layers (more on these below) which then in turn flow into the outputs. There are two hidden layers in this setup, each with 4 neurons - these are hyper parameters we can tune to increase performance and reduce overfitting.

Note that the flow of the network is strictly left to right, none of the neurons in the hidden layer are connected to one another and none of the neurons in the network loop back into themselves. This single direction of information flow and the fact that there are no loops or cycles in the network define a Feedforward Artificial Neural Network. Also note that the number of inputs and outputs are fixed once we define the network under this architecture. Other types of neural networks exist that are able to handle inputs with multiple lengths, but our focus for now will be Feedforward ANNs.

By why this structure?

The above structure of a neural network may seem like an arbitrary construction - how can we guarantee that it can be trained in such a way to produce our desired output?

Well..luckily for us there exists a theorem called the Universal Approximation Theorem which states that under reasonable circumstances, a feedforward network containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^{n}$.


Let $\phi \left( \cdot \right)$ be a nonconstant, bounded and monotonically-increasing continuous function. Let $I_{m}$ denote the $m$-dimensional unit hypercube $\left[ 0, 1 \right]$. The space of continuous functions on $I_{m}$ is $C \left( I_{m} \right)$. Then $\forall f \in C \left( I_{m} \right)$ and $ \epsilon > 0$ $\exists N \in \mathbb{Z}$, $v_{i}, b_{i} \in \mathbb{R}$ and $w_i \in \mathbb{R}^{m}$ where $ i = 1, \dots, N$ such that we may define

$$F(x) \equiv \sum_{i=1}^{N} v_{i} \phi \left( w_{i}^{T}x + b_{i} \right)$$
such that $$\left| F(x) - f(x) \right| < \epsilon \quad \forall x \in I_{m}$$

Which is quite fascinating really, that we can essentially approximate any continuous function to an arbitrary degree using a linear combination of our activation functions. Please note that this tells us nothing of actually how to approximate the function; what the parameters $N, v_{i}, b_{i}$ or $w_{i}$ are - only that they exist.


So there it is, a whirlwind introduction to Feedforward Artificial Networks. By no means is this supposed to be a comprehensive introduction, but to set the scene for the next few blog posts. In my next blog I'll define the error functions and consequently the Backpropogation algorithm which we will use to train our Feedforward ANN to find the weights and biases of our hidden layer neurons.

Monday, 14 November 2016

Back.. again! Artificial neural networks have tickled my fancy

I'm really impressed with my foresight when naming this blog all those years ago, it truly does live on the edge of abandon as I haven't touched it in 3 years! I've decided to fire it back up again and will (at least try) to actively post as I delve into the world of statistical / machine learning! I just want a place to document my thoughts/learnings/projects as I delve into this fascinating area. I've decided to use Python as my language of choice as I've always wanted to learn it but never got around to it or had the chance. Now I have no excuses!

So, without further ado...

Artificial Neural Networks

I'm going to assume some familiarity with ANNs - what they are, their general structure/philosophy. See here for a comprehensive introduction.
As a first pass, I'll introduce a motivating example, go through the math and then implement (in Python) an ANN from scratch.

Definition of the problem

We'll look at a classification problem in 2 dimensions (to make visualisation easier, read: possible) where our old friend Logistic Regression struggles. It will highlight a key fact that needs to be understood in any sort of machine learning / statistical problem; understanding when and what tools to use to attack a problem.

The data:

We'll use a toy dataset from sklearn called make_moons, with the following call:

import pandas as pd
import sklearn.datasets
import seaborn as sb
import sklearn.linear_model

X,y = sklearn.datasets.make_moons(200, noise=0.2) 
df = pd.DataFrame() 
df['x1'] = X[:,0] 
df['x2'] = X[:,1] 
df['z'] = y
sb.lmplot('x1','x2',hue='z', data=df, fit_reg = False)

The task:

We want to calculate the decision boundary between the two (blue and green) classes. At a glance it's clearly non-linear, so let's see how a Logistic Regression classification goes!


We can see that the Logistic Regression can only fit a linear decision boundary (see below) - even though the actual decision boundary is decidedly non-linear.

We can see why the Logistic Regression decision boundary is linear from the following:
$$ P_{boundary} \equiv \frac{1}{1+e^{- \theta \cdot x}} = 0.5 $$
$$ \implies 1 = e^{- \theta \cdot x}$$
$$ \implies \theta \cdot x = 0 $$ which defines a plane (line in 2 dimensions).

In part 2 I will define an ANN that we will work through to establish exactly how it works and how it is trained. We will then construct and implement it in python to get an idea of the decision boundaries it can produce.