Journey into Machine Learning: March 2017

Where to now?

Now that we've got the basics of neural networks down pat, this opens up a world of related techniques that can build on this knowledge. I thought to myself that I'd get stuck into some Natural Language Processing (NLP) with a view to eventually implement some sort of Recurrent Neural Network using TensorFlow.

The wonderful folks at Stanford have all the lecture notes and assignments up online at http://web.stanford.edu/class/cs224n/syllabus.html which is a fantastic resource.

Anyway before we can even begin to think of the applications, we must understand how to represent our text for input into any machine learning algorithm.

One-hot encoding

Naively, we may think to take our entire vocabulary (i.e all words in the training set - size $V$) and make a huge vector, with each entry in the vector corresponding to a word in our vocabulary. Thus each word is represented by a $V \times 1$ vector with exactly one entry equal to $1$ and all other entries $0$. For example, a word $x_i$ would be represented as

$$ x_i = \left[ \begin{array}{c} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{array} \right]$$This is called one-hot encoding. The representation seems simple enough, but it has one problem - there is no natural way to embed the contextual similarity of words. Imagine we had three words; $x_1$ = dog, $x_2$ = cat and $x_3$ = pencil. Intuitively, if you were asked which two words were "similar", you would say in a given context, $x_1$ and $x_2$ are similar, whilst $x_3$ shares less similarity with $x_1$ or $x_2$. Now supposed our three words have the following one-hot encoded representation $$ x_1 = \left[\begin{array}{c} \vdots \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots\end{array} \right], x_2 = \left[ \begin{array}{c} \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots \\ 0 \\\vdots \end{array} \right], x_3 = \left[ \begin{array}{c} \vdots \\ 0 \\ \vdots \\ 0 \\ \vdots \\ 1 \\\vdots \end{array} \right]$$ We could hope that the dot product between vectors would give us a notion of similarity, but each of the $w_i$ are orthogonal, that is $x_i \cdot x_j = 0 \hspace{.1in} \forall i \neq j$ This representation also isn't all that great given that our vocabulary could be of the order of hundreds of millions of words, which would result in humongous one-hot encoded vector representations in which almost all (except for one) entry is a 0. This called a sparse representation for obvious reasons. We'll now have a look at some more computationally efficient (and more accurate!) dense word representations.

Efficient Estimation of Word Representations in Vector Space

In 2013, Mikolov et al. presented Efficient Estimation of Word representations in Vector Space which proposed two architectures:

Continuous Bag of Words (CBOW) and
Skip-gram model

with the aim of minimizing the computational complexity traditionally associated with NLP models. These representations somewhat surprisingly capture a lot of syntactic and semantic information in the structure of the vector space. That is, that similar words tend to be near one another (in terms of Euclidean distance) and that vector operations can be used to analogously relate pairs of words. There is the famous $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ equation which is quite remarkable. In the one-hot view of the world, all of our vectors were orthogonal so dot products or any sort of vector addition/subtraction had no intuitive meaning or captured any relationship between words. See here for a cool visualisation.

Continuous Bag of words

Simply put, the CBOW aims to predict a word given an input context. For example, we would like a CBOW model to predict the word fox from an input context "The quick brown...". To perform such a task, it was proposed to use a simple feedforward neural network (wait - we know all about those now!) without an activation function in each neuron in the hidden layer. The idea is to train the NN on this classification task and then simply pick out the weight matrix in the hidden layer - this will be our dense representation.

Architecture of CBOW

The CBOW model looks at m words either side (or $C$ total as in the diagram below) of the target/center word $w_c$ in order to predict it . Below is an image the architecture of a CBOW model - notice that each input word is a one-hot encoded vector of dimension $V \times 1$ where $V$ is the size of our vocabulary. $W$ is a $N \times V$ matrix, where $N$ is the dimension of the representation of our input vectors $x_1, \dots, x_c$, where $x_{i}$ is the one hot encoded vector of the $i^{th}$ context word. We define $W x_{i} \equiv v_{i}$ which is a single column of $W$ due to the one hot encoding. Below is a visual representation of a CBOW model.

Forward propagation:

We have $C$ one hot encoded input vectors which feed into the hidden layer via the weight matrix $W$.
Since we have $C$ inputs acting on $W$, the resulting matrix $\bar{v}$ is computed from the average as follows: $$ \hat{v} = \frac{1}{C} W \left( x_{1} + x_{2} + \dots + x_{C} \right) = \frac{1}{C} \left ( v_{1} + v_{2} + \dots + v_{C} \right)$$
We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$

Define the $j^{th}$ row of $W'$ as $u^T_{j}$.

Thus the $j^{th}$ entry of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_{j} \hat{v}$
Finally, we apply our old friend the softmax function to obtain our output: $$\hat{y}_j = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}}$$
To measure the loss of our network, we'll use the cross entropy loss function, which we saw in the derivation of a feedforward neural network. $$l = - \sum_{j=1}^{V} y_j \log(\hat{y}_j)$$

This is just the negative of the log likelihood $p \left( w_k | w_{I_1}, \dots , w_{I_C} \right)$ where $w_{I_j}$ represents the $j^{th}$ word of input context $I$.
Thus we have $$l = - u^{T}_{c} \hat{v} + \log \left(\sum_{k=1}^V e^{ u^{T}_k \hat{v}} \right)$$
where c is the index of the correct word - the one hot vector for the correct word as a $1$ at the $c^{th}$ position.

We now just need to optimise the loss function to recover our input representation $\hat{v}$ and the output representation $u^{T}_j$ for $j = 1, \dots, V$.

Let's now take a look at how we optimise our loss function $L$ via gradient descent using backpropagation. We won't go through algebra, but it is exactly analogous to what we've seen before. Now $$\frac{\partial J}{\partial u_j} = \left\{ \begin{array}{ll} (\hat{y}_j-1)\hat{v} & j = c \\ \hat{y}_j \hat{v} & \text{otherwise} \end{array} \right. $$ and $$\frac{\partial J}{\partial \hat{v}} = -u_c + \sum_{k=1}^{V} \hat{y}_k u_k$$
These define the update rules for stochastic gradient descent. Note the sum over $V$, the size of our vocabulary, which can contain potentially millions of tokens. This isn't computationally efficient and will result in terrible performance if we have do do this when training the model. Suspend your disbelief for now and I'll cover some methods that people have come up with to tackle this problem in my next blogpost. We'll carry on for now and introduce the skip-gram model.

Skip-Gram

The skip-gram does somewhat the opposite of the CBOW. A skip-gram model takes a single word (the center word) as input and predicts surrounding words (context). It uses the same approach in that we'll train an ANN to perform the prediction task and we'll just pick up the weight matrices it learns as our dense representation of our words.

Architecture of the Skip-Gram model

The Skip-Gram model takes an input (context) word $x_k$ and will predict the context from it. The input word is a single one-hot encoded vector. We denote the output by $C$ - a collection of vectors ${\hat{y_1}, \dots, \hat{y_C}}$. Each $\hat{y_j}$ is a vector of probabilities for each word in the vocabulary occurring. Below is an image the architecture of a Skip-Gram model - notice that we have the same matrices $W$ and $W'$ as in CBOW - these operate the same way on our single input as they did on $\hat{v}$ in CBOW.

Forward propagation:

We have a single one hot encoded input vector $x_k$ which feeds into the hidden layer via the weight matrix $W$. Define $W x_k \equiv v_c$ - this is also called the center word input representation.
Again, We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$
Thus the $j^th$ element of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_j v_c$. $C$ copies of the output vector are output - one vector for each of the words in the context we are trying to predict.
Again, we apply the softmax function to each of the $C$ output vectors (to obtain our output: $$\hat{y}_{i,j} = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}} = \frac{e^{u^{T}_j v_c}}{\sum_{k=1}^{V} e^{u^{T}_k v_c}}$$ where $i \in \{1, \dots , C\}$ indexes each of the output vectors.
Note that since we used $C$ copies of the same output matrix for the prediction of each surrounding word, this amounts to a conditional independence assumption. That is the probability of each surrounding word occurring given the center word is independent of it's relative positioning to it.
To measure the loss of our network, we'll again use the cross entropy loss function. However, now that there are $C$ copies of the output, we need to sum over them $$l = - \sum_{j=1}^{V} \sum_{i=1}^{C} y_{i,j} \log(\hat{y}_{i,j}) $$ represents the index of the actual output word. For example if the actual output words for a center word like were I and ice-cream then there are two one hot encoded vectors $y_{1}, y_{2}$. Suppose I has position $10,000$ in our vocabulary then $y_{1,10000}$ would be the only non zero entry of $y_{1}$. Similarly $y_{2,5000}$ would be the only non-zero entry of $y_{2}$ if ice-cream had position $5000$ in the vocabulary.
This is just the negative of the log likelihood $p \left( w_{c-m},\dots, w_{c-1}, w_{c+1}, \dots , w_{c+m} | w_c \right)$ where $w_{j}$ where $(j \neq c)$ represents the $j^{th}$ word of surrounding window and $w_c$ is the center word. We have introduced the index $m$ to index the words surrounding the center word. That is, the window of sized $C$ is symmetric about the center word $v_c$ such that there are $m$ words to the left and right of $v_c$.

Thus we have $$l = - \sum_{j=0, j\neq m}^{2m} u^{T}_{c-m+j} v_c + 2m \log{\sum_{j=1}^{V} e^{u^{T}_j v_c }} $$

It's important to see that this loss function also suffers from the problem that we need to sum over our entire vocabulary $V$ at each parameter update step - which is computationally far too expensive and sometimes not even possible.

Results

There are some interesting remarks on the word2vec Google Code page regarding model selection (skip-gram vs CBOW) and hyper parameter selection. We won't get into the details until we've covered the aforementioned sampling approaches for evaluating the models, but one point to note is

architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)

We can understand the skip-gram performing better for infrequent words as the embedded words are not averaged as they are in CBOW (i.e when we define $\hat{v}$ in CBOW). The averaging process will dampen information from infrequently occurring words, contrasted with the skip-gram model where no such process takes place.

In a second paper published by Mikolov et al Distributed Representations of Words and Phrasesand their Compositionality they published the following results for a $1000$ dimensional skip-gram model. Taking the first two principal components, the following image was created

The caption says it all really - note the geometric similarity of differences between capital cities and their countries! This was captured by the skip-gram model without providing any supervised information regarding these relationships. You can imagine simple vector addition holding approximately true in this 2D vector space: $$\text{Portugal} - \text{Lisbon} + \text{Beijing} \approx \text{China}$$

Conclusion

In the next blog I'll look at some clever sampling based approaches that people have come up with to tackle the problem of having to normalise over the entire vocabulary $V$ in order to efficiently train the model. Then we can mess around with some text classification tasks, by using our word embeddings in our favourite ML techniques! The hope is that once we've played around with some toy problems to get a strong handle of the implementations of word2vec, then I'll introduce Recurrent Neural Networks and then we can get into the nitty gritty of some super interesting deep learning architectures :)

Journey into Machine Learning

Friday 31 March 2017

word2vec and word embeddings