## Friday, 2 February 2018

### Recommender Systems: Collaborative Filtering and matrix factorization: Part 2

The previous post looked at the mechanics of SVD and how we can interpret and use it in the collaborating filtering approach to building a recommender system.

### User and Items rating matrix

Suppose we have a matrix $r \in \mathbb{R}^{m \times n}$ which each row corresponds to a single user (i.e. person) and each column corresponds to a single item (i.e. movie) such that $r_{ui}$ corresponds to the rating user $u$ gave rating $i$. Now this matrix will generally be constructed from all user and rating information that is available and it is quite obvious that not every user will have rated every item available. In fact when we use the MovieLens dataset we'll see that only about 6% of the matrix $r$ are valid entries - the rest are missing and are the ratings we would like to predict by using our recommender system once it is built.

The sparsity of the matrix $r$ poses an issue using the SVD approach covered previously, as SVD aims to reconstruct a matrix as is; it considers all entries in the matrix $r$ as valid in the reconstruction process. So we need to make a slight tweak to our approach to building the recommender system. We still want a latent variable representation, but we only want to use the 6% of ratings that are populated.

### Latent space representation

In order to define the latent space representations, we make the following assumptions:
1. Each user can be described by a $k$ dimensional latent vector $\textbf{x}_u = \left[ x_{u1} \dots x_{uk} \right] \in \mathbb{R}^k$
2. Each item can be described by a $k$ dimensional latent vector $\textbf{y}_i = \left[ y_{i1} \dots y_{ik} \right]\in \mathbb{R}^k$
Hence we can approximate the rating user $u$ gave item $j$ as $$\hat{r}_{ui} = \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}$$
Hence we can define the loss function (with some regularisation) as trying to minimise the difference between the predicted and actual ratings a every user gave each item:
$$L = \sum_{u,i \in S}\left( r_{ui} - \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}\right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2$$ where $S$ is the set of valid ratings and $\lambda_x$, $\lambda_y$ are regularisation hyperparameters. We can rewrite a little more succinctly using matrix notation: $$L = \left( \textbf{r}_u - \textbf{x}^T_uY^T \right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2$$ where $$\textbf{r}_u = \left[ r_{u_1} \dots r_{u_m} \right]$$ is the rating vector for user u's ratings of the m items and $$Y = \left[ \begin{array}{ccc} -& y^{T}_{1} & - \\ - & y^{T}_{2} & - \\ & \vdots & \\ - & y^{T}_{m} & - \end{array} \right]$$ is a composite vector of the $y_i$'s. Now this equation looks quite familiar (recall the definition of the OLS loss function...) hence if we fix one of the $\textbf{x}^T_u$ or $\textbf{y}_{i}$ then we have effectively reduced the cost function to that of an OLS problem. The procedure is simple, we fix $\textbf{y}_i$ and find the corresponding  $\textbf{x}^T_u$ which minimises the loss function, this then feeds into the next iteration where we fix  $\textbf{x}^T_u$ and find the  $\textbf{y}_i$ which minimises the updated loss function and continue until convergence (or some other appropriate stopping criteria). This process is known as (for obvious reasons) as Alternating Least Squares and we'll derive the close form solution in the next section.

### Alternating Least Squares

We can then minimize $L$ by differentiating with respect to the components of our user vector $x_u$ and setting to $0$. Using summation notation:
$$\frac{\partial L}{\partial x_{ua}} = \sum_{i}^{m} -2 \left(r_{ui} - \sum_{j}^k x_{uj} Y_{ij} \right)Y_{ai} + 2 \lambda_x x_{ua}$$ Hence for a minimum and in matrix form $$0 = -\left( \textbf{r}_u - \textbf{x}^T_uY^T \right) Y + \lambda_x \textbf{x}_{u}$$ we can re-arrange to get $$\textbf{x}_{u}^T = \textbf{r}_u Y \left( I \lambda_x + Y^TY\right)^{-1}$$ We can perform an analogous calculation for $\textbf{y}_i$ which yields $$\textbf{y}_i^T = r_i X \left( I \lambda_i + X^TX \right)^{-1}$$ where $I$ is the $k \times k$ identity matrix. Hence we have derived the updates for our parameters required at each iteration - the next section looks at the results.

Next time we'll have a look at the python implementation and results to see how well our recommendation system works!

## Thursday, 26 October 2017

### What is a recommender system?

Think about any website which has a community of users who purchase/interact with content (items) on that website. Based on users' interactions with the items, various personalised recommendations can be tailored to specific user behaviour. More concretely, consider the purchase suggestions Amazon offers you, or movies that Netflix recommend you watch based on your past viewing behaviour. The machinery behind these recommendations is the basis of recommender systems.

Companies are looking to personalise a user's experience in a way that resonates with them - to make the user experience feel unique and tailored to their needs. These days websites capture all sorts of attributes from a user's experience/interaction - time spent on pages, scroll speed, viewing behaviour, purchasing behaviour etc... One may ask how we can use such detailed information to offer a better user experience; through the construction of a recommender system.

Now recommender systems are generally based on two types of user behaviour:
1. Implicit feedback
• Information based on user behaviour such as viewing/purchasing/click behaviour; user preferences can be inferred.
2. Explicit feedback
• Where the user has given an explicit rating to a movie or item (a 5 star rating for example).
There are generally two approaches to building a recommender system, content based filtering and collaborative filtering.

### Content based filtering

The approach of content based filtering is as follows:
• Each item is characterised by a series of properties or features. For example if we are using content based filtering to build a movie recommendation system then each movie could have attributes such as lead actor, director, genre, etc.
• A profile based on these attributes is built for each user and hence new items can be recommended to the user via a similarity measure between the user's profile and a specific items profile.
This approach requires the features of each item to be defined explicitly and hence it is limited to the level of feature engineering. It also means that as new items are added - which have new characteristics - these must be calculated for each of the existing items too and the dimensionality of our feature space will rise exponentially. We will not focus on this approach in the remainder of the blog.

### Collaborative Filtering

As its name suggests, collaborative filtering uses an aggregate (or collaborative) approach to suggests new items, based on the premise that users who share a common interest in certain items will also share a common interest in other items. Unlike content based filtering, this approach doesn't require hand crafted features for each item and hence can be more easily scaled to larger and even different domains. We will focus on the collaborative filtering approach in building our recommender system and will use the MovieLens dataset in our example [1].

The approach we take in building the collaborative filter borrows from some linear algebra results, namely the Singular Value Decomposition (SVD) of a matrix. I will first define exactly what SVD is and then I'll add some context into how it helps us with creating a recommender system.

### Singular Value Decomposition (SVD)

Given a matrix $M \in \mathbb{R}^{m \times n}$ there exists a factorisation $M = U \Sigma V^{*}$ where
• $U \in \mathbb{R}^{m \times m}$ us a unitary matrix (i.e  $UU^{*} = U^{*}U = I$)
• $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal, with non-negative numbers (singular values)
• $V^{*} \in \mathbb{R}^{n \times n}$ is the conjugate transpose of a unitary matrix $V$
Note that since I've taken $\mathbb{R}$ as the field the matrix entries come from, the conjugate transpose is just the transpose (i.e. $V^{*} \equiv V^{T}$) and the unitary matrices are also known as orthogonal matrices.

What this is telling us, is that any matrix can be decomposed into the above structure. If you think about the eigenvalue decomposition (EVD) of a matrix, it only exists for a square matrix, whereas the SVD of an arbitrarily (non-square) shaped matrix exists. There is a deep connection between the SVD and EVD of a matrix - the non zero entries of $\Sigma$ are the square root of the eigenvalues of $MM^T$. See section 3.2 of [2].

Yeah, yeah that's a cool result, but how does that help us with the recommender system? Well, the motivation is borrowed from Latent Semantic Analysis  (LSA) - a topic in Natural Language Processing which aims to understand relationships between documents and the words they contain. In LSA, a term-document matrix is created in which each entry contains the number of occurrences of a particular word in a particular document. This term-document matrix is then decomposed via SVD with the resultant matrices representing documents and terms which capture the pattern of term usage among documents. The top $k$ singular values are kept, such that the decomposition results in a representation of the original matrix in a lower dimensional space. The idea is that this lower dimensional representation captures the structure of the relationship between documents and terms, whilst filtering out the noise. In this lower dimensional 'latent' space one can simply find the similarity of documents or terms by using our $U$ and $V$ matrices.

So in our case, the analogy of a term-document matrix is a user-item matrix in which each row of the matrix is a user, and each column is an item. For our example the items will be movies. Each entry $(i,j)$ corresponds to the rating the $i^{th}$ user gave to the $j^{th}$ movie. The SVD will result in a latent space in which user and item vectors reside and we can calculate their similarities.

### Interpretation of SVD

#### Geometric Interpretation

In the case of $M \in \mathbb{R}^{n \times n}$ - aka a linear map within the same space - then there is a nice geometrical interpretation of the action of $M$, the image below illustrates this
• $V$ is an orthogonal matrix - which corresponds to a reflection or rotation
• $\Sigma$ is diagonal and hence is a scaling matrix
• $U$ is also an orthogonal matrix - which is another reflection or rotation

More generally, consider $M \in \mathbb{R}^{m \times n}$, a bilinear map between spaces, we can rewrite the SVD equation as $$MV = U \Sigma$$ If we consider each of the $j$ columns of $V$ separately then it is apparent $Mv_{j} = \sigma_j u_{j}$. Recalling that since $U$ and $V$ are orthogonal matrices and hence their rows form orthonormal bases, the action of $M$ is as follows: the basis of a coordinate system $\{v_1, \dots, v_n \}$ is mapped to a 'scaled' basis of a different coordinate system $\{\sigma_{1} u_{1}, \dots, \sigma_{m} u_{m} \}$. In words, every matrix $M \in \mathbb{R}^{m \times n}$ can be interpreted as having the following action
1. A first rotation in the input space
2. A simple positive scaling that takes a vector in the input space to the output space
3. And another rotation in the output space

#### SVD as an approximation

Note that in the SVD decomposition there is an equality; the matrix $M$ is exactly reconstructed. In practice, we don't actually want the exact matrix, instead we want a lower k-dimensional approximation which will have less noise and (hopefully) have captured the latent structure of the user/item relationships. Now it turns out through SVD -if we keep only the top k singular values - the resulting matrix is the best k-rank approximation of $M$, where the notion of 'best' is defined by the Frobenius norm. See Section 2.1 of [2] for details.

We can think of the rank k approximation via SVD of a real matrix $M$ as the decomposition of $M$ into the sum of k rank 1 matrices. That is
$$M = \sum_{j=1}^{k} \sigma_j u_j v_{j}^{*}$$
with $\sigma_1 > \sigma_2 > \dots >\sigma_k > 0$. This is a kind of expansion of $M$ in terms of rank 1 matrices, with the first term capturing the 'most' of $M$, followed by the second term and so on, with each including term adding to the accuracy of the approximation.

For a great exposition on SVD and some various angles on interpretation, check out [3] by Jeremy Kun which should help you gain some intuition.

Well, that wraps up a short summary on the techniques we'll use to build our recommender system! Next post will be focused some of the subtleties of the implementation and some tweaks which have been used to win competitions in the past!

### References

[1] https://grouplens.org/datasets/movielens/
[2] http://www.math.ucla.edu/~dakuang/cse6040/lectures/6040_lecture15.pdf
[3] https://jeremykun.com/2016/04/18/singular-value-decomposition-part-1-perspectives-on-linear-algebra/

## Thursday, 22 June 2017

### Transfer learning and hotdog classification!

Inspired by a recent episode of Silicon Valley where Jian Yang builds an app to classify pictures as either Hotdog or Not Hotdog, I decided to have a crack at retraining the final layer of the Inception V3 network to do so.

Thankfully, the lovely folks at Google make it ridiculously easy to do so. So instead of maxing out my poor little Macbook pro for weeks (or months :/) of training - we can utilise the already trained network and simply adjust the final layer for our use, more on this below.

Note that this post assumes familiarity of CNN architecture and its use in computer vision - see this great course at Stanford for a great introduction to CNNs - CS231n: Convolutional Neural Networks for Visual Recognition.

## Background

The Inception v3 image recognition neural network came out of Google at the end of 2015. The focus of the architecture builds upon the original inception architecture of GoogLeNet [1] which was also designed to perform well even under strict constraints on memory and computational budget [2].

### Inception v1

The basis of this (at the time) new architecture was the so called inception module. It builds on the traditional convolutional layer which has a fixed height and width. Instead it applies and assortment of different size filters to the input, allowing the model to perform multi-level feature extraction on each input:

Figure 1: Taken from Rethinking the inception architecture for computer vision[1] - the naive inception module.

As you can see above, there are $1 \times 1$, $3 \times 3$ and $5 \times 5$ convolutions as well as a $3 \times 3$ max pooling applied to the input. In theory, this is great - as we are able to exploit multiple level spatial correlations in the input, however using a large number of $5 \times 5$ filters can be computationally expensive.

This train of thought led to the next iteration of the inception module:
Figure 2: Taken from Rethinking the inception architecture for computer vision[1] - inception module with dimension reductions.

This architecture is essentially the same as the initial proposal except that each of the larger filters is preceded by a $1 \times 1$ convolution. These  $1 \times 1$ convolutions act as a dimension reduction tool in the channel dimension, leaving the spatial dimension untouched. The idea is that the important cross channel information will be captured without explicitly keeping every channel. Once the channel dimension reduction has been performed, these results feed into the larger filters which will capture both spatial and channel correlations. The $1 \times 1$ filters also have ReLU activation functions which introduce additional non-linearities into the system.

Put simply in [1]:

One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

### Inception v2/3

A key realisation of the earlier iterations of Inception was that the improved performance compared to its peers was driven by dimensionality reduction using the $1 \times 1$ filters. This has a natural interpretation in computer vision; as the authors put it [2]:

In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.

With reducing computational complexity in mind (hence the number of parameters in the network) the authors sought a way to keep the power of the larger filters, but reduce how expensive the operations involving them were. Thus they investigated replacing the larger filters with a multi-layer network that has the same input/output size but less parameters.

Figure 3: Replacing a $5 \times 5$ filter with a network of $3 \times 3$ and a $1 \times 1$ filter.

Thus the inception module now looks like
Figure 4: Inception module where each $5 \times 5$ filter has been replaced by two $3 \times 3$ filters.

This replacement results in a 28% saving in computational efficiency [2]. Inception v2 and v3 share the same inception module architecture, the different arises in differences in training - namely batch normalization of the fully connected layer of the auxiliary classifier.

### Transfer Learning

Now that we have a (very) high level understanding of the architecture of Inception v3, let's talk about transfer learning.

Given how computationally expensive it is to train a CNN from scratch - a task which can take months - we can use the results of a pre-trained CNN to do the feature extraction for us. The idea is that the generic features of an image (edges, shapes, etc) which are captured in the early layers of the CNN are common to most images. That is, we can utilise a CNN that has been trained for a specific task on millions of images to do the feature extraction for our image recognition task.

We then strip the CNN of the final fully connected layer (which feeds to the softmax classifier) and replace it with a fully connected layer built for our own task and train that. There are several approaches to transfer learning which are dependent on the task, such as we can use the pre-trained CNN as a feature extractor as detailed above, or we can use it as a starting point and retrain the weights in the convolutional layers. The amount of training data you have for your task and the details of the task will generally dictate the approach. See [3] for a detailed study on transfer learning.

The pre-trained CNN we'll be utilising an Inception v3 network as described above, trained on the ImageNet 2012 dataset which contained 1.2million images and 1000 categories.

### Implementation

Thankfully Google had the foresight to make it dead easy to perform the aforementioned transfer learning using Tensorflow.
For my training images, I utilised the work done in [4] which provided around 350 hotdog images and around 50 not hotdog (notdog) images. We then utilise the script retrain.py which we'll pass parameters to from the terminal.

The script will determine the number of classes based on the number of folders in the parent folder. You don't need to worry about specific names for the images, you just need to make sure the folders contained the correctly labeled images.

The rest of the parameters are fairly self explanatory and are explained on the tutorial. One interesting set of parameters to note are the random_crop and random_scale, these distort the training images as to preserve their meaning but add new training examples. For example a picture of a hotdog rotated 90 degrees is still a hotdog; instead of sourcing additional hotdog pictures we can just rotate an existing training example to teach the classifier something new.

### Results

After running retrain.py with 500 training steps, I received the below training summary

99% Validation accuracy - not bad at all!
Let's have a look at a specific example and try the below hotdog

hotdogs (score = 0.99956)
random (score = 0.00044)
Great news, this image is in fact a hotdog and our CNN recognises it as such with a high degree of confidence. What about a notdog?

random (score = 0.92311)
hotdogs (score = 0.07689)
This is a sunflower not a hotdog! Yet the classifier is only rather confident rather than completely confident that it is a notdog. This can be attributed to the limited number of training examples that were used in training the final layer before classification.
What about if we try to trick it - with a HOTDOG CAR!?

random (score = 0.70404)
hotdogs (score = 0.29596)

Not too bad at all, obviously the shape is similar to that of a hot dog but the chassis, wheels and other car-like features offer enough difference for our CNN to make the distinction.

### What's next?

I haven't decided if I'll build some CNNs or have a crack at an LSTM implementation next - both will be using Tensorflow but I've got some ideas I want to play around with first which will shape which direction I head in.

### References

[1] Szegedy, C. et al. 2014. Going deeper with convolutions, https://arxiv.org/pdf/1409.4842.pdf
[2] Szegedy, C. et al. 2015. Rethinking the Inception Architecture for Computer Vision, https://arxiv.org/pdf/1512.00567.pdf.
[3] Donahue, J. et al. 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,  https://arxiv.org/abs/1310.1531.

## Friday, 31 March 2017

### Where to now?

Now that we've got the basics of neural networks down pat, this opens up a world of related techniques that can build on this knowledge. I thought to myself that I'd get stuck into some Natural Language Processing (NLP) with a view to eventually implement some sort of Recurrent Neural Network using TensorFlow.

The wonderful folks at Stanford have all the lecture notes and assignments up online at http://web.stanford.edu/class/cs224n/syllabus.html which is a fantastic resource.

Anyway before we can even begin to think of the applications, we must understand how to represent our text for input into any machine learning algorithm.

### One-hot encoding

Naively, we may think to take our entire vocabulary (i.e all words in the training set - size $V$) and make a huge vector, with each entry in the vector corresponding to a word in our vocabulary. Thus each word is represented by a $V \times 1$ vector with exactly one entry equal to $1$ and all other entries $0$. For example, a word $x_i$ would be represented as
$$x_i = \left[ \begin{array}{c} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{array} \right]$$This is called one-hot encoding. The representation seems simple enough, but it has one problem - there is no natural way to embed the contextual similarity of words. Imagine we had three words; $x_1$ = dog, $x_2$ = cat and $x_3$ = pencil. Intuitively, if you were asked which two words were "similar", you would say in a given context, $x_1$ and $x_2$ are similar, whilst $x_3$ shares less similarity with $x_1$ or $x_2$. Now supposed our three words have the following one-hot encoded representation $$x_1 = \left[\begin{array}{c} \vdots \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots\end{array} \right], x_2 = \left[ \begin{array}{c} \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots \\ 0 \\\vdots \end{array} \right], x_3 = \left[ \begin{array}{c} \vdots \\ 0 \\ \vdots \\ 0 \\ \vdots \\ 1 \\\vdots \end{array} \right]$$ We could hope that the dot product between vectors would give us a notion of similarity, but each of the $w_i$ are orthogonal, that is $x_i \cdot x_j = 0 \hspace{.1in} \forall i \neq j$ This representation also isn't all that great given that our vocabulary could be of the order of hundreds of millions of words, which would result in humongous one-hot encoded vector representations in which almost all (except for one) entry is a 0. This called a sparse representation for obvious reasons. We'll now have a look at some more computationally efficient (and more accurate!) dense word representations.

### Efficient Estimation of Word Representations in Vector Space

In 2013, Mikolov et al. presented Efficient Estimation of Word representations in Vector Space which proposed two architectures:
• Continuous Bag of Words (CBOW) and
• Skip-gram model
with the aim of minimizing the computational complexity traditionally associated with NLP models. These representations somewhat surprisingly capture a lot of syntactic and semantic information in the structure of the vector space. That is, that similar words tend to be near one another (in terms of Euclidean distance) and that vector operations can be used to analogously relate pairs of words. There is the famous $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ equation which is quite remarkable. In the one-hot view of the world, all of our vectors were orthogonal so dot products or any sort of vector addition/subtraction had no intuitive meaning or captured any relationship between words. See here for a cool visualisation.

### Continuous Bag of words

Simply put, the CBOW aims to predict a word given an input context. For example, we would like a CBOW model to predict the word fox from an input context "The quick brown...". To perform such a task, it was proposed to use a simple feedforward neural network (wait - we know all about those now!) without an activation function in each neuron in the hidden layer. The idea is to train the NN on this classification task and then simply pick out the weight matrix in the hidden layer - this will be our dense representation.

#### Architecture of CBOW

The CBOW model looks at m words either side (or $C$ total as in the diagram below) of the target/center word $w_c$ in order to predict it . Below is an image the architecture of a CBOW model - notice that each input word is a one-hot encoded vector of dimension $V \times 1$ where $V$ is the size of our vocabulary. $W$ is a $N \times V$ matrix, where $N$ is the dimension of the representation of our input vectors $x_1, \dots, x_c$, where $x_{i}$ is the one hot encoded vector of the $i^{th}$ context word. We define $W x_{i} \equiv v_{i}$ which is a single column of $W$ due to the one hot encoding. Below is a visual representation of a CBOW model.

Forward propagation:
•  We have $C$ one hot encoded input vectors which feed into the hidden layer via the weight matrix $W$.
• Since we have $C$ inputs acting on $W$, the resulting matrix $\bar{v}$ is computed from the average as follows: $$\hat{v} = \frac{1}{C} W \left( x_{1} + x_{2} + \dots + x_{C} \right) = \frac{1}{C} \left ( v_{1} + v_{2} + \dots + v_{C} \right)$$
• We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$
• Define the $j^{th}$ row of $W'$ as $u^T_{j}$.
• Thus the $j^{th}$ entry of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_{j} \hat{v}$
• Finally, we apply our old friend the softmax function to obtain our output: $$\hat{y}_j = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}}$$
• To measure the loss of our network, we'll use the cross entropy loss function, which we saw in the derivation of a feedforward neural network. $$l = - \sum_{j=1}^{V} y_j \log(\hat{y}_j)$$
• This is just the negative of the log likelihood $p \left( w_k | w_{I_1}, \dots , w_{I_C} \right)$ where $w_{I_j}$ represents the $j^{th}$ word of input context $I$.
• Thus we have $$l = - u^{T}_{c} \hat{v} + \log \left(\sum_{k=1}^V e^{ u^{T}_k \hat{v}} \right)$$
• where c is the index of the correct word - the one hot vector for the correct word as a $1$ at the $c^{th}$ position.
• We now just need to optimise the loss function to recover our input representation $\hat{v}$ and the output representation $u^{T}_j$ for $j = 1, \dots, V$.
Let's now take a look at how we optimise our loss function $L$ via gradient descent using backpropagation. We won't go through algebra, but it is exactly analogous to what we've seen before. Now $$\frac{\partial J}{\partial u_j} = \left\{ \begin{array}{ll} (\hat{y}_j-1)\hat{v} & j = c \\ \hat{y}_j \hat{v} & \text{otherwise} \end{array} \right.$$ and $$\frac{\partial J}{\partial \hat{v}} = -u_c + \sum_{k=1}^{V} \hat{y}_k u_k$$
These define the update rules for stochastic gradient descent. Note the sum over $V$, the size of our vocabulary, which can contain potentially millions of tokens.  This isn't computationally efficient and will result in terrible performance if we have do do this when training the model. Suspend your disbelief for now and I'll cover some methods that people have come up with to tackle this problem in my next blogpost. We'll carry on for now and introduce the skip-gram model.

### Skip-Gram

The skip-gram does somewhat the opposite of the CBOW. A skip-gram model takes a single word (the center word) as input and predicts surrounding words (context). It uses the same approach in that we'll train an ANN to perform the prediction task and we'll just pick up the weight matrices it learns as our dense representation of our words.

#### Architecture of the Skip-Gram model

The Skip-Gram model takes an input (context) word $x_k$ and will predict the context from it. The input word is a single one-hot encoded vector. We denote the output by $C$ -  a collection of vectors ${\hat{y_1}, \dots, \hat{y_C}}$. Each $\hat{y_j}$ is a vector of probabilities for each word in the vocabulary occurring. Below is an image the architecture of a Skip-Gram model - notice that we have the same matrices $W$ and $W'$ as in CBOW - these operate the same way on our single input as they did on $\hat{v}$ in CBOW.

Forward propagation:
•  We have a single one hot encoded input vector $x_k$ which feeds into the hidden layer via the weight matrix $W$. Define $W x_k \equiv v_c$ - this is also called the center word input representation.
• Again, We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$
• Thus the $j^th$ element of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_j v_c$. $C$ copies of the output vector are output - one vector for each of the words in the context we are trying to predict.
• Again, we apply the softmax function to each of the $C$ output vectors (to obtain our output: $$\hat{y}_{i,j} = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}} = \frac{e^{u^{T}_j v_c}}{\sum_{k=1}^{V} e^{u^{T}_k v_c}}$$ where $i \in \{1, \dots , C\}$ indexes each of the output vectors.
• Note that since we used $C$ copies of the same output matrix for the prediction of each surrounding word, this amounts to a conditional independence assumption. That is the probability of each surrounding word occurring given the center word is independent of it's relative positioning to it.
• To measure the loss of our network, we'll again use the cross entropy loss function. However, now that there are $C$ copies of the output, we need to sum over them $$l = - \sum_{j=1}^{V} \sum_{i=1}^{C} y_{i,j} \log(\hat{y}_{i,j})$$ represents the index of the actual output word. For example if the actual output words for a center word like were I and ice-cream then there are two one hot encoded vectors $y_{1}, y_{2}$. Suppose I has position $10,000$ in our vocabulary then $y_{1,10000}$ would be the only non zero entry of $y_{1}$. Similarly $y_{2,5000}$ would be the only non-zero entry of $y_{2}$ if ice-cream had position $5000$ in the vocabulary.
• This is just the negative of the log likelihood $p \left( w_{c-m},\dots, w_{c-1}, w_{c+1}, \dots , w_{c+m} | w_c \right)$ where $w_{j}$ where $(j \neq c)$ represents the $j^{th}$ word of surrounding window and $w_c$ is the center word.  We have introduced the index $m$ to index the words surrounding the center word. That is, the window of sized $C$ is symmetric about the center word $v_c$ such that there are $m$ words to the left and right of $v_c$.
• Thus we have $$l = - \sum_{j=0, j\neq m}^{2m} u^{T}_{c-m+j} v_c + 2m \log{\sum_{j=1}^{V} e^{u^{T}_j v_c }}$$
• It's important to see that this loss function also suffers from the problem that we need to sum over our entire vocabulary $V$ at each parameter update step - which is computationally far too expensive and sometimes not even possible.

#### Results

There are some interesting remarks on the word2vec Google Code page regarding model selection (skip-gram vs CBOW) and hyper parameter selection. We won't get into the details until we've covered the aforementioned sampling approaches for evaluating the models, but one point to note is
• architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
We can understand the skip-gram performing better for infrequent words as the embedded words are not averaged as they are in CBOW (i.e when we define $\hat{v}$ in CBOW). The averaging process will dampen information from infrequently occurring words, contrasted with the skip-gram model where no such process takes place.

In a second paper published by Mikolov et al Distributed Representations of Words and Phrasesand their Compositionality they published the following results for a $1000$ dimensional skip-gram model. Taking the first two principal components, the following image was created

The caption says it all really - note the geometric similarity of differences between capital cities and their countries! This was captured by the skip-gram model without providing any supervised information regarding these relationships. You can imagine simple vector addition holding approximately true in this 2D vector space: $$\text{Portugal} - \text{Lisbon} + \text{Beijing} \approx \text{China}$$

#### Conclusion

In the next blog I'll look at some clever sampling based approaches that people have come up with to tackle the problem of having to normalise over the entire vocabulary $V$ in order to efficiently train the model. Then we can mess around with some text classification tasks, by using our word embeddings in our favourite ML techniques! The hope is that once we've played around with some toy problems to get a strong handle of the implementations of word2vec, then I'll introduce Recurrent Neural Networks and then we can get into the nitty gritty of some super interesting deep learning architectures :)

## Friday, 24 February 2017

### Optimisation routines

Now that we are able to train the Neural Network, let's take a look at how we can optimise the training as in reality we will be dealing with deep (many hidden neurons) networks which will potentially take days to train. I'm going to take a step back and have a look at how we might generally attack a convex optimisation problem. Convex optimisation problems are very well researched and understood area, so it makes sense to have a look at these first.

### The problem

We are going to fit a simple logistic regression to the following data
• http://cs229.stanford.edu/ps/ps1/logistic_x.txt
• http://cs229.stanford.edu/ps/ps1/logistic_y.txt
There are two predictors (columns) in the first file and 1 outcome ($y = \pm 1$) in the second. Note that I've borrowed these from the cs229 class that I'm currently working through. Our task is to minimise  the average empirical loss: $$L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}})$$ where $y^{(i)}$ is the actual outcome for observation $x^{(i)}$. Note that $x^{(i)}$ is a vector containing the predictors for that observation.

### Next steps

Currently I'm working my way through the CS229 Machine Learning course by Andrew Ng which takes up the majority of my spare time. It's a fantastic course which has accompanying lectures available on youtube and is a great introduction to ML techniques.

However, in the future I am to continue with this implementation but tweak it to explore the following
• Why the weights were initialised in such a peculiar way (why not just set them all to 0?)
• Extending to more than 1 hidden layer
• How to prevent overfitting
• Regularisation (I mentioned it above, but did not explain what/how/why)
• Drop out techniques