Friday, 2 February 2018

Recommender Systems: Collaborative Filtering and matrix factorization: Part 2

The previous post looked at the mechanics of SVD and how we can interpret and use it in the collaborating filtering approach to building a recommender system.

User and Items rating matrix

Suppose we have a matrix $r \in \mathbb{R}^{m \times n}$ which each row corresponds to a single user (i.e. person) and each column corresponds to a single item (i.e. movie) such that $r_{ui}$ corresponds to the rating user $u$ gave rating $i$. Now this matrix will generally be constructed from all user and rating information that is available and it is quite obvious that not every user will have rated every item available. In fact when we use the MovieLens dataset we'll see that only about 6% of the matrix $r$ are valid entries - the rest are missing and are the ratings we would like to predict by using our recommender system once it is built.

The sparsity of the matrix $r$ poses an issue using the SVD approach covered previously, as SVD aims to reconstruct a matrix as is; it considers all entries in the matrix $r$ as valid in the reconstruction process. So we need to make a slight tweak to our approach to building the recommender system. We still want a latent variable representation, but we only want to use the 6% of ratings that are populated.

Latent space representation

In order to define the latent space representations, we make the following assumptions:
  1. Each user can be described by a $k$ dimensional latent vector $\textbf{x}_u =  \left[ x_{u1} \dots x_{uk} \right] \in \mathbb{R}^k$
  2. Each item can be described by a $k$ dimensional latent vector $\textbf{y}_i =  \left[ y_{i1} \dots y_{ik} \right]\in \mathbb{R}^k$
Hence we can approximate the rating user $u$ gave item $j$ as $$\hat{r}_{ui} = \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}$$
Hence we can define the loss function (with some regularisation) as trying to minimise the difference between the predicted and actual ratings a every user gave each item:
$$ L = \sum_{u,i \in S}\left( r_{ui} - \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}\right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2$$ where $S$ is the set of valid ratings and $\lambda_x$, $\lambda_y$ are regularisation hyperparameters. We can rewrite a little more succinctly using matrix notation: $$ L = \left( \textbf{r}_u - \textbf{x}^T_uY^T \right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2 $$ where $$\textbf{r}_u = \left[ r_{u_1} \dots r_{u_m} \right]$$ is the rating vector for user u's ratings of the m items and $$Y = \left[ \begin{array}{ccc} -& y^{T}_{1} & - \\ - & y^{T}_{2} & - \\ & \vdots    &          \\ - & y^{T}_{m} & - \end{array} \right] $$ is a composite vector of the $y_i$'s. Now this equation looks quite familiar (recall the definition of the OLS loss function...) hence if we fix one of the $\textbf{x}^T_u$ or $\textbf{y}_{i}$ then we have effectively reduced the cost function to that of an OLS problem. The procedure is simple, we fix $\textbf{y}_i$ and find the corresponding  $\textbf{x}^T_u$ which minimises the loss function, this then feeds into the next iteration where we fix  $\textbf{x}^T_u$ and find the  $\textbf{y}_i$ which minimises the updated loss function and continue until convergence (or some other appropriate stopping criteria). This process is known as (for obvious reasons) as Alternating Least Squares and we'll derive the close form solution in the next section.

Alternating Least Squares

We can then minimize $L$ by differentiating with respect to the components of our user vector $x_u$ and setting to $0$. Using summation notation:
$$ \frac{\partial L}{\partial x_{ua}} = \sum_{i}^{m} -2 \left(r_{ui} - \sum_{j}^k x_{uj} Y_{ij} \right)Y_{ai} + 2 \lambda_x x_{ua}$$ Hence for a minimum and in matrix form $$ 0 = -\left( \textbf{r}_u - \textbf{x}^T_uY^T \right) Y + \lambda_x \textbf{x}_{u}$$ we can re-arrange to get $$ \textbf{x}_{u}^T = \textbf{r}_u Y \left( I \lambda_x + Y^TY\right)^{-1}$$ We can perform an analogous calculation for $\textbf{y}_i$ which yields $$ \textbf{y}_i^T = r_i X \left( I \lambda_i + X^TX \right)^{-1} $$ where $I$ is the $k \times k$ identity matrix. Hence we have derived the updates for our parameters required at each iteration - the next section looks at the results.

Next time we'll have a look at the python implementation and results to see how well our recommendation system works!

Thursday, 26 October 2017

Recommender Systems: Collaborative Filtering and matrix factorization

What is a recommender system?

Think about any website which has a community of users who purchase/interact with content (items) on that website. Based on users' interactions with the items, various personalised recommendations can be tailored to specific user behaviour. More concretely, consider the purchase suggestions Amazon offers you, or movies that Netflix recommend you watch based on your past viewing behaviour. The machinery behind these recommendations is the basis of recommender systems.

Companies are looking to personalise a user's experience in a way that resonates with them - to make the user experience feel unique and tailored to their needs. These days websites capture all sorts of attributes from a user's experience/interaction - time spent on pages, scroll speed, viewing behaviour, purchasing behaviour etc... One may ask how we can use such detailed information to offer a better user experience; through the construction of a recommender system.

Now recommender systems are generally based on two types of user behaviour:
  1. Implicit feedback
    • Information based on user behaviour such as viewing/purchasing/click behaviour; user preferences can be inferred.
  2. Explicit feedback
    • Where the user has given an explicit rating to a movie or item (a 5 star rating for example).
There are generally two approaches to building a recommender system, content based filtering and collaborative filtering.

Content based filtering

The approach of content based filtering is as follows:
  • Each item is characterised by a series of properties or features. For example if we are using content based filtering to build a movie recommendation system then each movie could have attributes such as lead actor, director, genre, etc. 
  • A profile based on these attributes is built for each user and hence new items can be recommended to the user via a similarity measure between the user's profile and a specific items profile.
This approach requires the features of each item to be defined explicitly and hence it is limited to the level of feature engineering. It also means that as new items are added - which have new characteristics - these must be calculated for each of the existing items too and the dimensionality of our feature space will rise exponentially. We will not focus on this approach in the remainder of the blog.

Collaborative Filtering

As its name suggests, collaborative filtering uses an aggregate (or collaborative) approach to suggests new items, based on the premise that users who share a common interest in certain items will also share a common interest in other items. Unlike content based filtering, this approach doesn't require hand crafted features for each item and hence can be more easily scaled to larger and even different domains. We will focus on the collaborative filtering approach in building our recommender system and will use the MovieLens dataset in our example [1].

The approach we take in building the collaborative filter borrows from some linear algebra results, namely the Singular Value Decomposition (SVD) of a matrix. I will first define exactly what SVD is and then I'll add some context into how it helps us with creating a recommender system.

Singular Value Decomposition (SVD)

Given a matrix $M \in \mathbb{R}^{m \times n}$ there exists a factorisation $ M = U \Sigma V^{*}$ where 
  • $U \in \mathbb{R}^{m \times m}$ us a unitary matrix (i.e  $UU^{*} = U^{*}U = I$)
  • $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal, with non-negative numbers (singular values) 
  • $V^{*} \in \mathbb{R}^{n \times n}$ is the conjugate transpose of a unitary matrix $V$
Note that since I've taken $\mathbb{R}$ as the field the matrix entries come from, the conjugate transpose is just the transpose (i.e. $V^{*} \equiv V^{T}$) and the unitary matrices are also known as orthogonal matrices.

What this is telling us, is that any matrix can be decomposed into the above structure. If you think about the eigenvalue decomposition (EVD) of a matrix, it only exists for a square matrix, whereas the SVD of an arbitrarily (non-square) shaped matrix exists. There is a deep connection between the SVD and EVD of a matrix - the non zero entries of $\Sigma$ are the square root of the eigenvalues of $MM^T$. See section 3.2 of [2].

Yeah, yeah that's a cool result, but how does that help us with the recommender system? Well, the motivation is borrowed from Latent Semantic Analysis  (LSA) - a topic in Natural Language Processing which aims to understand relationships between documents and the words they contain. In LSA, a term-document matrix is created in which each entry contains the number of occurrences of a particular word in a particular document. This term-document matrix is then decomposed via SVD with the resultant matrices representing documents and terms which capture the pattern of term usage among documents. The top $k$ singular values are kept, such that the decomposition results in a representation of the original matrix in a lower dimensional space. The idea is that this lower dimensional representation captures the structure of the relationship between documents and terms, whilst filtering out the noise. In this lower dimensional 'latent' space one can simply find the similarity of documents or terms by using our $U$ and $V$ matrices.

So in our case, the analogy of a term-document matrix is a user-item matrix in which each row of the matrix is a user, and each column is an item. For our example the items will be movies. Each entry $(i,j)$ corresponds to the rating the $i^{th}$ user gave to the $j^{th}$ movie. The SVD will result in a latent space in which user and item vectors reside and we can calculate their similarities.

Interpretation of SVD

Geometric Interpretation

In the case of $M \in \mathbb{R}^{n \times n}$ - aka a linear map within the same space - then there is a nice geometrical interpretation of the action of $M$, the image below illustrates this
  • $V$ is an orthogonal matrix - which corresponds to a reflection or rotation
  • $\Sigma$ is diagonal and hence is a scaling matrix
  • $U$ is also an orthogonal matrix - which is another reflection or rotation

More generally, consider $M \in \mathbb{R}^{m \times n}$, a bilinear map between spaces, we can rewrite the SVD equation as $$ MV = U \Sigma $$ If we consider each of the $j$ columns of $V$ separately then it is apparent $Mv_{j} = \sigma_j u_{j}$. Recalling that since $U$ and $V$ are orthogonal matrices and hence their rows form orthonormal bases, the action of $M$ is as follows: the basis of a coordinate system $\{v_1, \dots, v_n \}$ is mapped to a 'scaled' basis of a different coordinate system $\{\sigma_{1} u_{1}, \dots, \sigma_{m} u_{m} \}$. In words, every matrix $M \in \mathbb{R}^{m \times n}$ can be interpreted as having the following action
  1. A first rotation in the input space
  2. A simple positive scaling that takes a vector in the input space to the output space
  3. And another rotation in the output space

SVD as an approximation

Note that in the SVD decomposition there is an equality; the matrix $M$ is exactly reconstructed. In practice, we don't actually want the exact matrix, instead we want a lower k-dimensional approximation which will have less noise and (hopefully) have captured the latent structure of the user/item relationships. Now it turns out through SVD -if we keep only the top k singular values - the resulting matrix is the best k-rank approximation of $M$, where the notion of 'best' is defined by the Frobenius norm. See Section 2.1 of [2] for details.

We can think of the rank k approximation via SVD of a real matrix $M$ as the decomposition of $M$ into the sum of k rank 1 matrices. That is
$$ M = \sum_{j=1}^{k} \sigma_j u_j v_{j}^{*}$$
with $\sigma_1 > \sigma_2 > \dots >\sigma_k > 0$. This is a kind of expansion of $M$ in terms of rank 1 matrices, with the first term capturing the 'most' of $M$, followed by the second term and so on, with each including term adding to the accuracy of the approximation.

For a great exposition on SVD and some various angles on interpretation, check out [3] by Jeremy Kun which should help you gain some intuition.

Well, that wraps up a short summary on the techniques we'll use to build our recommender system! Next post will be focused some of the subtleties of the implementation and some tweaks which have been used to win competitions in the past!



Thursday, 22 June 2017

Transfer learning and hotdog classification!

Inspired by a recent episode of Silicon Valley where Jian Yang builds an app to classify pictures as either Hotdog or Not Hotdog, I decided to have a crack at retraining the final layer of the Inception V3 network to do so.

Thankfully, the lovely folks at Google make it ridiculously easy to do so. So instead of maxing out my poor little Macbook pro for weeks (or months :/) of training - we can utilise the already trained network and simply adjust the final layer for our use, more on this below.

Note that this post assumes familiarity of CNN architecture and its use in computer vision - see this great course at Stanford for a great introduction to CNNs - CS231n: Convolutional Neural Networks for Visual Recognition.


The Inception v3 image recognition neural network came out of Google at the end of 2015. The focus of the architecture builds upon the original inception architecture of GoogLeNet [1] which was also designed to perform well even under strict constraints on memory and computational budget [2].

Inception v1

The basis of this (at the time) new architecture was the so called inception module. It builds on the traditional convolutional layer which has a fixed height and width. Instead it applies and assortment of different size filters to the input, allowing the model to perform multi-level feature extraction on each input:

Figure 1: Taken from Rethinking the inception architecture for computer vision[1] - the naive inception module.

As you can see above, there are $1 \times 1$, $3 \times 3$ and $5 \times 5$ convolutions as well as a $3 \times 3$ max pooling applied to the input. In theory, this is great - as we are able to exploit multiple level spatial correlations in the input, however using a large number of $5 \times 5$ filters can be computationally expensive.

This train of thought led to the next iteration of the inception module:
Figure 2: Taken from Rethinking the inception architecture for computer vision[1] - inception module with dimension reductions.

This architecture is essentially the same as the initial proposal except that each of the larger filters is preceded by a $1 \times 1$ convolution. These  $1 \times 1$ convolutions act as a dimension reduction tool in the channel dimension, leaving the spatial dimension untouched. The idea is that the important cross channel information will be captured without explicitly keeping every channel. Once the channel dimension reduction has been performed, these results feed into the larger filters which will capture both spatial and channel correlations. The $1 \times 1$ filters also have ReLU activation functions which introduce additional non-linearities into the system.

Put simply in [1]:

One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously. 

Inception v2/3

A key realisation of the earlier iterations of Inception was that the improved performance compared to its peers was driven by dimensionality reduction using the $1 \times 1$ filters. This has a natural interpretation in computer vision; as the authors put it [2]:

In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.

With reducing computational complexity in mind (hence the number of parameters in the network) the authors sought a way to keep the power of the larger filters, but reduce how expensive the operations involving them were. Thus they investigated replacing the larger filters with a multi-layer network that has the same input/output size but less parameters.

Figure 3: Replacing a $5 \times 5$ filter with a network of $3 \times 3$ and a $1 \times 1$ filter.

Thus the inception module now looks like
Figure 4: Inception module where each $5 \times 5$ filter has been replaced by two $3 \times 3$ filters.

This replacement results in a 28% saving in computational efficiency [2]. Inception v2 and v3 share the same inception module architecture, the different arises in differences in training - namely batch normalization of the fully connected layer of the auxiliary classifier.

Transfer Learning

Now that we have a (very) high level understanding of the architecture of Inception v3, let's talk about transfer learning.

Given how computationally expensive it is to train a CNN from scratch - a task which can take months - we can use the results of a pre-trained CNN to do the feature extraction for us. The idea is that the generic features of an image (edges, shapes, etc) which are captured in the early layers of the CNN are common to most images. That is, we can utilise a CNN that has been trained for a specific task on millions of images to do the feature extraction for our image recognition task.

We then strip the CNN of the final fully connected layer (which feeds to the softmax classifier) and replace it with a fully connected layer built for our own task and train that. There are several approaches to transfer learning which are dependent on the task, such as we can use the pre-trained CNN as a feature extractor as detailed above, or we can use it as a starting point and retrain the weights in the convolutional layers. The amount of training data you have for your task and the details of the task will generally dictate the approach. See [3] for a detailed study on transfer learning.

The pre-trained CNN we'll be utilising an Inception v3 network as described above, trained on the ImageNet 2012 dataset which contained 1.2million images and 1000 categories.


Thankfully Google had the foresight to make it dead easy to perform the aforementioned transfer learning using Tensorflow.
For my training images, I utilised the work done in [4] which provided around 350 hotdog images and around 50 not hotdog (notdog) images. We then utilise the script which we'll pass parameters to from the terminal.

The script will determine the number of classes based on the number of folders in the parent folder. You don't need to worry about specific names for the images, you just need to make sure the folders contained the correctly labeled images.

The rest of the parameters are fairly self explanatory and are explained on the tutorial. One interesting set of parameters to note are the random_crop and random_scale, these distort the training images as to preserve their meaning but add new training examples. For example a picture of a hotdog rotated 90 degrees is still a hotdog; instead of sourcing additional hotdog pictures we can just rotate an existing training example to teach the classifier something new.


After running with 500 training steps, I received the below training summary

99% Validation accuracy - not bad at all!
Let's have a look at a specific example and try the below hotdog

hotdogs (score = 0.99956) 
random (score = 0.00044)
Great news, this image is in fact a hotdog and our CNN recognises it as such with a high degree of confidence. What about a notdog?

random (score = 0.92311)
hotdogs (score = 0.07689)
This is a sunflower not a hotdog! Yet the classifier is only rather confident rather than completely confident that it is a notdog. This can be attributed to the limited number of training examples that were used in training the final layer before classification.
What about if we try to trick it - with a HOTDOG CAR!?

random (score = 0.70404)
hotdogs (score = 0.29596)

Not too bad at all, obviously the shape is similar to that of a hot dog but the chassis, wheels and other car-like features offer enough difference for our CNN to make the distinction.

What's next?

I haven't decided if I'll build some CNNs or have a crack at an LSTM implementation next - both will be using Tensorflow but I've got some ideas I want to play around with first which will shape which direction I head in.


[1] Szegedy, C. et al. 2014. Going deeper with convolutions,
[2] Szegedy, C. et al. 2015. Rethinking the Inception Architecture for Computer Vision,
[3] Donahue, J. et al. 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,

Friday, 31 March 2017

word2vec and word embeddings

Where to now?

Now that we've got the basics of neural networks down pat, this opens up a world of related techniques that can build on this knowledge. I thought to myself that I'd get stuck into some Natural Language Processing (NLP) with a view to eventually implement some sort of Recurrent Neural Network using TensorFlow.

The wonderful folks at Stanford have all the lecture notes and assignments up online at which is a fantastic resource. 

Anyway before we can even begin to think of the applications, we must understand how to represent our text for input into any machine learning algorithm.

One-hot encoding

Naively, we may think to take our entire vocabulary (i.e all words in the training set - size $V$) and make a huge vector, with each entry in the vector corresponding to a word in our vocabulary. Thus each word is represented by a $V \times 1$ vector with exactly one entry equal to $1$ and all other entries $0$. For example, a word $x_i$ would be represented as
$$ x_i = \left[ \begin{array}{c} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{array} \right]$$This is called one-hot encoding. The representation seems simple enough, but it has one problem - there is no natural way to embed the contextual similarity of words. Imagine we had three words; $x_1$ = dog, $x_2$ = cat and $x_3$ = pencil. Intuitively, if you were asked which two words were "similar", you would say in a given context, $x_1$ and $x_2$ are similar, whilst $x_3$ shares less similarity with $x_1$ or $x_2$. Now supposed our three words have the following one-hot encoded representation $$ x_1 = \left[\begin{array}{c} \vdots  \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0  \\ \vdots\end{array} \right], x_2 = \left[ \begin{array}{c} \vdots  \\ 1 \\ \vdots \\ 0 \\ \vdots \\ 0 \\\vdots \end{array} \right], x_3 = \left[ \begin{array}{c} \vdots  \\ 0 \\ \vdots \\ 0 \\ \vdots \\ 1 \\\vdots \end{array} \right]$$ We could hope that the dot product between vectors would give us a notion of similarity, but each of the $w_i$ are orthogonal, that is $x_i \cdot x_j = 0 \hspace{.1in} \forall i \neq j$ This representation also isn't all that great given that our vocabulary could be of the order of hundreds of millions of words, which would result in humongous one-hot encoded vector representations in which almost all (except for one) entry is a 0. This called a sparse representation for obvious reasons. We'll now have a look at some more computationally efficient (and more accurate!) dense word representations.

Efficient Estimation of Word Representations in Vector Space

In 2013, Mikolov et al. presented Efficient Estimation of Word representations in Vector Space which proposed two architectures: 
  • Continuous Bag of Words (CBOW) and
  • Skip-gram model
with the aim of minimizing the computational complexity traditionally associated with NLP models. These representations somewhat surprisingly capture a lot of syntactic and semantic information in the structure of the vector space. That is, that similar words tend to be near one another (in terms of Euclidean distance) and that vector operations can be used to analogously relate pairs of words. There is the famous $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ equation which is quite remarkable. In the one-hot view of the world, all of our vectors were orthogonal so dot products or any sort of vector addition/subtraction had no intuitive meaning or captured any relationship between words. See here for a cool visualisation.

Continuous Bag of words

Simply put, the CBOW aims to predict a word given an input context. For example, we would like a CBOW model to predict the word fox from an input context "The quick brown...". To perform such a task, it was proposed to use a simple feedforward neural network (wait - we know all about those now!) without an activation function in each neuron in the hidden layer. The idea is to train the NN on this classification task and then simply pick out the weight matrix in the hidden layer - this will be our dense representation.

Architecture of CBOW

The CBOW model looks at m words either side (or $C$ total as in the diagram below) of the target/center word $w_c$ in order to predict it . Below is an image the architecture of a CBOW model - notice that each input word is a one-hot encoded vector of dimension $V \times 1$ where $V$ is the size of our vocabulary. $W$ is a $N \times V$ matrix, where $N$ is the dimension of the representation of our input vectors $x_1, \dots, x_c$, where $x_{i}$ is the one hot encoded vector of the $i^{th}$ context word. We define $W x_{i} \equiv v_{i}$ which is a single column of $W$ due to the one hot encoding. Below is a visual representation of a CBOW model.

Forward propagation:
  •  We have $C$ one hot encoded input vectors which feed into the hidden layer via the weight matrix $W$.
  • Since we have $C$ inputs acting on $W$, the resulting matrix $\bar{v}$ is computed from the average as follows: $$ \hat{v} = \frac{1}{C} W \left( x_{1} + x_{2} + \dots + x_{C} \right) = \frac{1}{C} \left ( v_{1} + v_{2} + \dots + v_{C} \right)$$
  • We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$
    • Define the $j^{th}$ row of $W'$ as $u^T_{j}$.
  • Thus the $j^{th}$ entry of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_{j} \hat{v}$
  • Finally, we apply our old friend the softmax function to obtain our output: $$\hat{y}_j = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}}$$
  • To measure the loss of our network, we'll use the cross entropy loss function, which we saw in the derivation of a feedforward neural network. $$l = - \sum_{j=1}^{V} y_j \log(\hat{y}_j)$$
    • This is just the negative of the log likelihood $p \left( w_k | w_{I_1}, \dots , w_{I_C} \right)$ where $w_{I_j}$ represents the $j^{th}$ word of input context $I$.
    • Thus we have $$l = - u^{T}_{c} \hat{v} + \log \left(\sum_{k=1}^V e^{ u^{T}_k \hat{v}} \right)$$
    • where c is the index of the correct word - the one hot vector for the correct word as a $1$ at the $c^{th}$ position.
  • We now just need to optimise the loss function to recover our input representation $\hat{v}$ and the output representation $u^{T}_j$ for $j = 1, \dots, V$.
Let's now take a look at how we optimise our loss function $L$ via gradient descent using backpropagation. We won't go through algebra, but it is exactly analogous to what we've seen before. Now $$\frac{\partial J}{\partial u_j} = \left\{ \begin{array}{ll} (\hat{y}_j-1)\hat{v} & j = c \\ \hat{y}_j \hat{v} & \text{otherwise} \end{array} \right. $$ and $$\frac{\partial J}{\partial \hat{v}} = -u_c + \sum_{k=1}^{V} \hat{y}_k u_k$$
These define the update rules for stochastic gradient descent. Note the sum over $V$, the size of our vocabulary, which can contain potentially millions of tokens.  This isn't computationally efficient and will result in terrible performance if we have do do this when training the model. Suspend your disbelief for now and I'll cover some methods that people have come up with to tackle this problem in my next blogpost. We'll carry on for now and introduce the skip-gram model.


The skip-gram does somewhat the opposite of the CBOW. A skip-gram model takes a single word (the center word) as input and predicts surrounding words (context). It uses the same approach in that we'll train an ANN to perform the prediction task and we'll just pick up the weight matrices it learns as our dense representation of our words.

Architecture of the Skip-Gram model

The Skip-Gram model takes an input (context) word $x_k$ and will predict the context from it. The input word is a single one-hot encoded vector. We denote the output by $C$ -  a collection of vectors ${\hat{y_1}, \dots, \hat{y_C}}$. Each $\hat{y_j}$ is a vector of probabilities for each word in the vocabulary occurring. Below is an image the architecture of a Skip-Gram model - notice that we have the same matrices $W$ and $W'$ as in CBOW - these operate the same way on our single input as they did on $\hat{v}$ in CBOW.

Forward propagation:
  •  We have a single one hot encoded input vector $x_k$ which feeds into the hidden layer via the weight matrix $W$. Define $W x_k \equiv v_c$ - this is also called the center word input representation.
  • Again, We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$
  • Thus the $j^th$ element of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_j v_c$. $C$ copies of the output vector are output - one vector for each of the words in the context we are trying to predict.
  • Again, we apply the softmax function to each of the $C$ output vectors (to obtain our output: $$\hat{y}_{i,j} = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}} =  \frac{e^{u^{T}_j v_c}}{\sum_{k=1}^{V} e^{u^{T}_k v_c}}$$ where $i \in \{1, \dots , C\}$ indexes each of the output vectors.
  • Note that since we used $C$ copies of the same output matrix for the prediction of each surrounding word, this amounts to a conditional independence assumption. That is the probability of each surrounding word occurring given the center word is independent of it's relative positioning to it.
  • To measure the loss of our network, we'll again use the cross entropy loss function. However, now that there are $C$ copies of the output, we need to sum over them $$l = - \sum_{j=1}^{V} \sum_{i=1}^{C}  y_{i,j} \log(\hat{y}_{i,j}) $$ represents the index of the actual output word. For example if the actual output words for a center word like were I and ice-cream then there are two one hot encoded vectors $y_{1}, y_{2}$. Suppose I has position $10,000$ in our vocabulary then $y_{1,10000}$ would be the only non zero entry of $y_{1}$. Similarly $y_{2,5000}$ would be the only non-zero entry of $y_{2}$ if ice-cream had position $5000$ in the vocabulary.
  • This is just the negative of the log likelihood $p \left( w_{c-m},\dots, w_{c-1}, w_{c+1}, \dots , w_{c+m} | w_c \right)$ where $w_{j}$ where $(j \neq c)$ represents the $j^{th}$ word of surrounding window and $w_c$ is the center word.  We have introduced the index $m$ to index the words surrounding the center word. That is, the window of sized $C$ is symmetric about the center word $v_c$ such that there are $m$ words to the left and right of $v_c$.
    • Thus we have $$l = - \sum_{j=0, j\neq m}^{2m} u^{T}_{c-m+j} v_c + 2m \log{\sum_{j=1}^{V} e^{u^{T}_j v_c }} $$
  • It's important to see that this loss function also suffers from the problem that we need to sum over our entire vocabulary $V$ at each parameter update step - which is computationally far too expensive and sometimes not even possible.


There are some interesting remarks on the word2vec Google Code page regarding model selection (skip-gram vs CBOW) and hyper parameter selection. We won't get into the details until we've covered the aforementioned sampling approaches for evaluating the models, but one point to note is 
  • architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
We can understand the skip-gram performing better for infrequent words as the embedded words are not averaged as they are in CBOW (i.e when we define $\hat{v}$ in CBOW). The averaging process will dampen information from infrequently occurring words, contrasted with the skip-gram model where no such process takes place.

In a second paper published by Mikolov et al Distributed Representations of Words and Phrasesand their Compositionality they published the following results for a $1000$ dimensional skip-gram model. Taking the first two principal components, the following image was created

The caption says it all really - note the geometric similarity of differences between capital cities and their countries! This was captured by the skip-gram model without providing any supervised information regarding these relationships. You can imagine simple vector addition holding approximately true in this 2D vector space: $$\text{Portugal} - \text{Lisbon} + \text{Beijing} \approx \text{China}$$


In the next blog I'll look at some clever sampling based approaches that people have come up with to tackle the problem of having to normalise over the entire vocabulary $V$ in order to efficiently train the model. Then we can mess around with some text classification tasks, by using our word embeddings in our favourite ML techniques! The hope is that once we've played around with some toy problems to get a strong handle of the implementations of word2vec, then I'll introduce Recurrent Neural Networks and then we can get into the nitty gritty of some super interesting deep learning architectures :)

Friday, 24 February 2017

Optimisation routines

Now that we are able to train the Neural Network, let's take a look at how we can optimise the training as in reality we will be dealing with deep (many hidden neurons) networks which will potentially take days to train. I'm going to take a step back and have a look at how we might generally attack a convex optimisation problem. Convex optimisation problems are very well researched and understood area, so it makes sense to have a look at these first.

The problem

We are going to fit a simple logistic regression to the following data 
There are two predictors (columns) in the first file and 1 outcome ($ y = \pm 1$) in the second. Note that I've borrowed these from the cs229 class that I'm currently working through. Our task is to minimise  the average empirical loss: $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$ where $y^{(i)}$ is the actual outcome for observation $x^{(i)}$. Note that $x^{(i)}$ is a vector containing the predictors for that observation.

Is this a convex optimisation problem?

It turns out that if we can show that the Hessian $H$ for this loss function satisfies $$z^{T} H z \ge 0$$ $ \forall z \in \mathbb{R}^{3}$ then $L(\theta) $ is a convex function (more generally if $H \in \mathbb{R}^{n \times n}$ then this result must hold true $\forall z \in \mathbb{R}^{n}$. Using the definition of $H$, we have $$ H_{pq} = \frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_q}$$ The details of the calculation are quite straightforward, so I'll omit them but the result yields $$H_{pq} = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \times \left( 1 - \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \right) \times x^{(i)}_p x^{(i)}_q $$ where the sum is over all of our training examples. We can write the last two terms as a matrix product (and subsequently drop the indices) as $x^{(i)} x^{(i)^{T}}$. Since the first two terms are $\in (0, 1]$ and $[0,1)$ respectively, then the product is $\in [0, 1]$, thus we can ignore this term when assessing if $z^T H z \ge 0 \hspace{.1in} \forall z$. Thus $$z^T H z \propto \sum_{i=1}^{m} z^T x^{(i)} x^{(i)^{T}} z = \sum_{i=1}^{m} (z^T x^{(i)})^2 \ge 0 \hspace{0.1in} \forall z$$ Hence $H$ is positive semidefinite which implies that $L(\theta) is a convex problem. This means it has a global optima, which makes our lives a lot easier. Since calculating the Hessian is rather easy in this setting, we can use Newton's method.

Newton's method

Newton's method utilises the Hessian and hence the curvature of the loss surface to find an optimal path to the minima. With this additional information, we can expect the algorithm to converge faster than gradient descent, which only uses first derivative information. The update rule is as follows $$ \theta \rightarrow \theta - H^{-1} \nabla_{\theta} L(\theta) $$
Let's see the performance of Newton's method vs Gradient descent:
I've performed 20 iterations for both Newton's method and gradient descent - clearly Newton's method converges a lot faster than gradient descent. Looking at the update step, it is obvious that this method won't scale well with more parameters since each step requires the calculation of the matrix of first derivatives and the Hessian. Furthermore, it's obvious that this method will fall apart if $H$ is singular. So in our toy example above which had a convex loss function, minimal parameters to estimate - Newton's method was king, but obviously in our neural network we couldn't apply such an algorithm. The beauty of backpropagation was that after a forward pass through the network, we had all of the first order derivatives we required in the update step. This fortune does not however extend to the second derivatives we would require to calculate $H$. It seems we're stuck with regular old batch gradient descent to train or neural network...or are we?

Stochastic Gradient Descent (SGD)

Recall our update step for batch gradient descent: $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ where  $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$$L(\theta)$ is a function of all m training examples. That is, for a single update to parameter $\theta_j$ we need to use all of our training data. What if we could calculate our loss on a subset of the training data? The hope is that this subset is "representative enough" of the entire dataset such that resulting update to $\theta_j$ is generally in the same direction to that of the update calculated on the entire dataset. This is the essence of Stochastic Gradient Descent.

Consider the fact when the training dataset has several hundred million rows, we may not even be able to fit all the data in memory to perform a batch gradient descent! 

I'll define the following terms commonly found when talking about SGD

  • Epoch - once every datapoint in the training set has been used, one epoch has occurred.
  • Mini batch - the subset of training data that is used in the parameter update
  • Batch size - the size of the mini batch
So per epoch, we will be able to update the parameters $\frac{N}{\text{batch size}}$ times. Compare this to batch gradient descent which by definition, the parameters are updated once per epoch.
The pseudocode is as follows
for epoch in num_epochs:
    for mini_batch in data:
        evaluate derivative of loss on mini_batch
        update parameters

Since we are taking a subset of our training data to update our parameters, the results may be volatile as each subset may contain slightly different information about our loss surface. This is slightly troublesome as we approach the optima, as in SGD the path of optimisation will tend to oscillate around the minimum we seek. In order to remedy this, we use a learning rate schedule which is simply a scaling (generally based on heuristics) of our learning rate at each iteration. The hope is that by the time the algorithm is near the minimum, the learning rate has been scaled down such that the successive parameter updates are relatively stable from this point forward. This process is also called annealing the learning rate. The trick is getting the balance between timing the schedule such that the learning rate is small enough when the algorithm is near the minimum - if you are too aggressive with the schedule, the learning rate will become too small too soon and you won't get near the minimum as the update to the parameters will tend to zero.

Let's have a look at the performance of SGD vs batch gradient descent on our neural network. If implemented SGD on the aforementioned neural network and have run the make moons dataset with 100000 data points and 10 neurons in the hidden layer. See below for a plot of the loss vs. epoch for both SGD and batch gradient descent.

Batch gradient descent:
  • Epochs: 200 (i.e 200 updates to the parameters, each based on the full training set)
  • Execution time: 17s
  • Epochs: 1
    • Mini batch size: 500 (i.e 200 updates to the parameters, each based on mini batch of size 500)
  • Execution time: 6s
We can see that SGD actually outperforms batch gradient descent here and takes about a third of the time to run! I haven't actually applied any learning rate schedule here, you can see an egregious spike in the loss from SGD at around the $160^{th}$ epoch. See what happens below when I simply set $\eta \rightarrow \frac{\eta}{10}$ at the $150^{th}$ epoch:

Notice that the aforementioned spike is now gone and the SGD results look very promising given the accuracy and performance (compared to batch gradient descent). In practice, batch gradient descent will never be used due to the memory constraints and the fact that the randomness of SGD can help the algorithm escape local minima that batch gradient descent would naturally get stuck in.


We can introduce the concept of momentum to our parameter updates. It is applied as follows:
Recall the usual update to our parameters is as follows $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ which we can write in two steps, define
$$\Delta \theta_j = \frac{\partial}{\partial \theta_j}L(\theta)$$ such that the update becomes $$\theta_j \rightarrow \theta_j - \eta \Delta \theta_j$$ We'll now modify our definition of $\Delta \theta_j$ as $$\Delta \theta_j = \gamma \Delta \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ and our update is $$\theta_j \rightarrow \theta_j + \Delta \theta_j$$ What we've done is influenced our current update of $\theta_j$ by the amount it was updated by in the previous step. Yes we've introduced yet another hyperparameter, but the idea here is to give it some "momentum" such that the updates are influenced by previous updates and the optimsation continues towards the minimum. Consider the case when the loss surface is a long narrow ravine with steep walls - SGD will typically oscillate across the ravine as the update will point down the steep walls, the momentum term will help move the algorithm down the ravine towards to minimum we seek. See the effect of momentum on training our neural network below

We see that momentum allows the algorithm to descend very quickly and hits the minimum loss around 40 epochs in - compared to ~140 in our previous iteration.

Hopefully you've now got a good grasp of batch gradient descent vs SGD and some of the subtleties and hyperparameters that need to be considered when training a model with SGD. For a more in depth exploration of various optimisation techniques, including some more advanced methods see here. In production, we'll rely on the built in SGD routines which have been highly tuned for performance.

Friday, 27 January 2017

Feedforward Artificial Neural Network pt5: Additional analysis

Now that we've finally implemented our ANN, let's have a play around some of the parameters to get an understanding of how they affect our network and its results.
The tricky part about training ANNs is that the loss function isn't necessarily convex, which means that we can't use our usual optimisation routines. The fact that the loss function isn't necessarily convex means that just because we find a local minimum, it doesn't mean it's the global minimum. Thus non convex problems may converge on different local minima depending on the parameters of the optimisation routine. We'll explore some of these parameters below.

Learning Rate

Recall how the learning rate $\eta$ enters our optimisation procedure via the weight updates in gradient descent; $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ It essentially controls the step size at each update. Recall we had some funny bumps in our loss function at certain iterations, let's take a closer look: I've plotted the results of two different iterations of training the ANN below
The two lines correspond to the total loss as a function of the number of iterations in our training of the ANN. The blue line has $\eta = 0.001$ and the green line has $\eta = 0.01$. You can see that the green line has those funny bumps we witnessed before - this is the training example with a larger learning rate. The spikes occur when the step size is too large and we overshoot the minimum. Notice that the blue line doesn't have these overshoots, however it takes more iterations to approach the minimum. If we take a step size which is too large, then we consistently overshoot the minima - never converging on the minimum:

The key is finding a learning rate which will find the minimum within a reasonable timeframe. Although our selection ($\eta = 0.001$ vs $\eta = 0.01$) didn't make a huge difference in this case, consider an ANN with multiple hidden layers and thousands of neurons in each layer. This network may take hours (or days) to train depending on how we choose our learning rate.
Depending on the problem at hand, you may value accuracy more than efficiency or vice versa, this will dictate how you choose your learning rate, of which you will usually calculate using cross validation.

Regularisation / Weight Decay

Say we have our initial loss function (the cross entropy Loss) $L_0$ and we add a regularisation term such that we now have $$L = L_0 + \frac{\lambda}{2n} \sum_{w} w^2$$ where the sum is over all weights. Now if $\lambda$ is large then the second term will dominate $L$ and the task of optimising the entire expression will be reduced to minimising $\sum_w w^2$. If $\lambda$ is small then the first term dominates and there are less restrictions place on $w$. This regularisation term controls $w$ by preventing it from becoming overly large and helps us from overfitting the model. If we want to use gradient descent to minimise this regularised loss function we have $$ \frac{\partial L}{\partial w} = \frac{\partial L_0}{\partial w} + \frac{\lambda}{n} \sum_w w$$ so our update at each iteration is $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ becomes $$ w \rightarrow w - \eta \frac{\partial L_0}{\partial w} - \frac{\eta \lambda}{n} w$$ $$\implies w \rightarrow \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \frac{\partial L_0}{\partial w}$$ That is at each update, the weight $w$ is rescaled by a factor of $\left( 1 - \frac{\eta \lambda}{n} \right)$ at each iteration; this is referred to as weight decay and as mentioned before, limits the magnitude of $w$.

Weight initialisation

In this section we'll take a look at why we chose to intialise our values as we did (from a normal distribution with specific parameters). Recall the definition of the weight update from our gradient descent algorithm $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$, if the second term in this expression is small or zero, then there is effectively no (or very little) weight update to $w$. This causes our training to slow down incredibly, such that after each iteration our weight $w$ is only changing ever so slightly; obviously we would like to avoid this situation at the start of the procedure. Recall the backpropagation rules for $W^{(1)}$:

  • $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
  • $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$

  • we see that $(1-\tanh^{2}(Z^{(1)})$ term enters the equation (more generally, this will be the derivative of the activation function). So we have our update to the weight $w$ as $$w \rightarrow w - \eta (1-\tanh^{2}(Z^{(1)})) \odot {x}^T  \delta^{(2)}{W^{(2)}}^T$$ That is the amount we update $w$ by is proportional to the derivative of our activation function. Thus we want to avoid initialising our weights in a region where this derivative is close to zero. Below is a plot of the function
    We can see that this activation function has its derivative approach zero at both extremes: as $x \rightarrow \infty$ and as $x \rightarrow -\infty$. Let's think about a more general ANN for a moment - suppose we have an ANN with 1000 inputs and a single training example where each input is equal to $1$. We have as usual $$Z^{(1)} = x W^{(1)} +b^{(1)}$$ If we have intialised each entry $W^{(1)}_{ij}$ and $b^{(1)}_j$ as selected from a standard normal distribution (iid), then each entry $Z^{(1)}_{i}$ will be the sum of 1001 iid standard normal variables. Then since the sum of $N$ standard normal variables will have mean $0$ and standard deviation $\sqrt{1001}$ i.e a very wide distribution with a relatively high probability of giving a large negative or positive result (this almost looks like a uniform distribution), the derivative of the activation function will be very close to zero! This isn't what we want.
    What about if we initialise with a random normal with mean $0$ and standard deviation $\frac{1}{\sqrt{1000}}$? Now we know that the variance of a sum of iid random normal variances is the sum of the variances so we now have each entry in $Z^{(1)}_{ij}$ has mean $0$ and standard deviation $$\sigma = \sqrt{\frac{1000}{1000}+1} = \sqrt{2}$$ which is a lot narrower than our distribution before - there is a lot smaller chance intialising at values where the derivative of the activation function is close to $0$. Below is a comparison of the resulting initialisation distributions from the toy example - the green line is the resulting distribution for the refined initialisation, where the red line results from initialisation by standard normal variables.

    More generally for a given network we will initialise from a Gaussian distribution with mean $0$ and standard deviation $\frac{1}{\sqrt{N_{in}}}$ where $N_{in}$ is the number of inputs into the neural network.
    Next time we'll have a look at optimising our network using stochastic gradient descent and maybe play around with some different datasets.

    Thursday, 12 January 2017

    Feedforward Artificial Neural Network pt4: Implementation!

    So if you've survived this far, it's all worth it! We've done all the groundwork required to code up our ANN to classify our toy example (make moons dataset kindly provided by sklearn).


    Recall that a motivation for exploring ANN's was that our standard Logistic Regression classification technique could only ever give us a linear separating boundary:

    ANN structure and Matrices

    In the last blog post we derived general expressions for a single hidden layer feedforward ANN. The results were;
    For a network with the following features:
    • 2 inputs, $x_1$, $x_2$
    • 1 Hidden Layer with n neurons - we will treat n as a parameter and see how it affects the output classification
    • 2 outputs, $\hat{y}_1$, $\hat{y}_2$
    • Activation function: $\sigma(x) = \tanh(x)$
    • Softmax output
    • Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$

    Forward Propagation Summary

    • $Z^{(1)} = xW^{(1)} + b^{(1)}$
    • $a^{(1)} = \tanh(Z^{(1)})$
    • $Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$
    • $\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$
    Backpropagation Summary

    • $\delta^{(2)} = \hat{y} - y$
    • $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
    • $\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$
    • $\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$
    • $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$
    • $\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$
    With the above, we have entirely defined how our ANN classifies training examples, how the weights and biases of the network affect the loss function and hence have our gradient descent update rule completely specified (up to the learning rate parameter).

    Implementation in Python

    Now instead of defining our network and neurons as a class, I've decided to closely follow this page (which actually inspired all of these ANN related posts) as it is more transparent and easier to grasp what is going on in the code. So without further delay, I present the code for an ANN with a single hidden layer of size 3:


    import numpy as np import sklearn.datasets
    import pandas as pd import matplotlib.pyplot as plt 
    X, y = sklearn.datasets.make_moons(200, noise=0.20)
    df = pd.DataFrame()
    df['x1'] = X[:,0]
    df['x2'] = X[:,1]
    df['z'] = y

    ##define ANN which takes inputs X and y with parameters n_iter as number of iterations and learning_rating the rate used in gradient descent
    def calculate(X ,y, num_hidden_neurons, n_iter=2000, learning_rate = 0.01, regularisation_rate = 0):
        loss = [];
        #initialise weights and biases
        X_size = len(X)
        W1 = np.random.randn(2, num_hidden_neurons) / np.sqrt(2)
        b1 = np.zeros((1, num_hidden_neurons))
        W2 = np.random.randn(num_hidden_neurons, 2) / np.sqrt(num_hidden_neurons)
        b2 = np.zeros((1, 2))
        model = {}
        for i in xrange(n_iter):
            Z1 =
            a1 = np.tanh(Z1)
            Z2 =
            yhat = softmax(Z2)
            d2 = yhat
            d2[range(X_size),y] -= 1
            d1 = (1-a1**2)*
            dW2 =
            db2 = np.sum(d2, axis=0, keepdims = True)
            dW1 =
            db1 = np.sum(d1, axis=0)
            dW1 += regularisation_rate * W1     
            dW2 += regularisation_rate * W2  
        #gradient descent    
            W1 = W1 - learning_rate * dW1
            b1 = b1 - learning_rate * db1
            W2 = W2 - learning_rate * dW2
            b2 = b2 - learning_rate * db2
            model = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}
        return loss, model


    The code is fairly self explanatory, but I'll walk through the steps
    1. Importing all the libraries we need
    2. Initialisation of the weights and biases (more on this later)
    3. Feedforward step - implemented via matrices exactly how we calculated them in the previous blog post
    4. Backpropagation step, again - implemented just how we derived them before. Note as the $b^{(i)}$ matrices consist of a single unique row each, we let python broadcast them into the appropriate dimensions when it does the matrix addition in the calculation of $Z^{(1)}$ and $Z^{(2)}$.
    5. Gradient descent to alter the weights after each iteration in order to minimize the loss function.
    6. Regularisation rate - I haven't mentioned this before, I've put it in there but set it to $0$ as to have no effect on the results. I will discuss this term later.
    7. Output a dictionary Model which contains our weight and bias matrices $W^{(1)}$, $b^{(1)}$, $W^{(2)}$, $b^{(2)}$.
    8. Note that I call a function (that's not defined here) called predict which will make predictions based on the input $X$ or return the cross entropy loss of a set of weights and biases when applied to $X$. The calculation of the cross entropy loss is the reason that $y$ is an input into the overall calculate function; if we were just calculating the weights and biases we would only require $X$ as an input. See code for predict at the bottom of this post.


    #function call - 3 neurons in the hidden layer
    loss, model = calculate(X, y, 3)
    The decision boundary is plotted below
    Notice the non-linear decision boundary, and how well it captures the separation of data points! This gives us great hope that if the data generating process is distinctly non-linear then even though the standard GLM (i.e logistic regression) decision boundaries are crude, with an ANN we can potentially get a good fit!
    The corresponding cross entropy loss as a function of iterations is also plotted below:
    Ok, the overall trend of the loss looks alright, but what the hell are the spikes??? Stay tuned...

    What about if we have 10 neurons in the hidden layer?
    #function call - 10 neurons in the hidden layer
    loss, model = calculate(X, y, 10)
    and the corresponding loss: 
    Now the spikes are gone and the loss is lower than in the 3 neuron case. Whoa this is a really good fit, or is it? How good is too good? Obviously here, with the 10 neurons the ANN is not only fitting the overall general shape of the data, but its capturing the noise from the input too. Just looking at the data (ignoring the decision boundary) we can imagine a kind of smooth sinusoidal shape separating the classes (as in the 3 neuron case above). Notice the jagged edge near the long red dot amongst the blue dots at coordinates (-0.5, 0.7), you can see that the single red training example has 'pulled' the decision boundary towards it. This is an ominous sign as the decision boundary is far too sensitive to a single data point, remember the idea is to capture the essence of the data generating process excluding noise.
    If we were to assess the performance of our ANN purely on the cross entropy loss - it would seem that the network with 10 neurons in the hidden layer outperforms the one with 3. However we always need to test the ANN on data it hasn't seen before - cross validation is usually employed to do so.

    Well there it is, if you've followed the posts from pt.1 of this series knowing the bare minimum about ANNs only armed with a pen, paper and a desire to learn, then you should be able to implement your own feedforward ANN from scratch! It was a laborious task (all the matrix algebra, etc) but I hope it provided insight into the construction and architecture of ANNs.

    Useful functions

    # Predict y_hat from input X or calculate loss
    def predict(model, X, y, calc_loss = False):
        W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
        # Forward propagation
        Z1 = + b1
        a1 = np.tanh(Z1)
        Z2 = + b2
        probs = softmax(Z2)
        if calc_loss:
            loss = -np.log(probs[range(len(X)), y])
            tot_loss = np.sum(loss)
            return 1./len(X) * tot_loss
        return np.argmax(probs, axis=1)  
    #Compute softmax values for each sets of scores in x.
    def softmax(x):
        return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims = True)
    Given a model, the predict function returns a classification (0 or 1) for our input data X or the cross entropy loss. This is used to keep track of the loss at each iteration of the weight updates. Note that this is just for illustrative purposes, since it is an expensive calculation in reality, it isn't wise to call it every iteration - maybe every 100, or 1000 or so.

    Softmax is used to calculate the softmax of the entries of a vector - used to calculate our $\hat{y}$.

    Next steps

    Currently I'm working my way through the CS229 Machine Learning course by Andrew Ng which takes up the majority of my spare time. It's a fantastic course which has accompanying lectures available on youtube and is a great introduction to ML techniques.

    However, in the future I am to continue with this implementation but tweak it to explore the following
    • Why the weights were initialised in such a peculiar way (why not just set them all to 0?)
    • Extending to more than 1 hidden layer
    • How to prevent overfitting
      • Regularisation (I mentioned it above, but did not explain what/how/why)
      • Drop out techniques
    • Stochastic gradient descent
      • A variant of our batch gradient descent which is industry practise when training an ANN (due to massive datasets)
    • Different applications - when I get around to it, I'll implement an ANN for the classic MNIST handwritten digit recognition problem.