Sunday 8 March 2020

Variational Autoencoders


What is an autoencoder?

An autoencoder is a neural network which is used to learn efficient (read: lower dimensional) encodings of input data in an unsupervised manner. This just means that we design the architecture of a neural network which takes an input of $k$ dimensions and train it to output that very input (e.g. the output layer has the same dimensions $k$ as the input). That is, the loss that we try to minimise isn't base on the difference of the prediction and some label or value (as in the supervised case) but the difference between the prediction and the input (also known as the reconstruction error); we are training the network to recreate the $k$ dimensional input vector as closely as possible. But Why? I hear you ask.

Dense Representation

Study the architecture of the network below, you can see it is naturally broken up into two parts; an encoder and a decoder.

The encoder takes the aforementioned $k$ dimensional input vector and successively feeds it through hidden layers with dimension $< k$ (there are two in the image below). This results in the middle layer of the network, which is a key part in an autoencoder and its associated uses - more on that later.

The decoder takes takes the middle layer and feeds it through additional hidden layers - upscaling the dimensionality - to result in the output vector of dimension $k$.


The idea is that in a sufficiently trained network which has minimal error between the input and output vectors, the middle layer will have captured the essence (e.g. the signal) of the input in a lower dimensional, or dense representation vector, stripping the input of irrelevant noise. The dense representation is also called a latent representation which generally lives in a latent vector space.  

So what?
That's all nice and dandy but what can we actually use autoencoders for?
  • Outlier detection
    • By definition, outliers have some characteristics of features distinct from the rest of the dataset - they are in the vast minority. Thus we can feed a given dataset through an autoencoder - training it to minimise the reconstruction error between input and output. We then feed through a previously unseen observation; the autoencoder will then hopefully be able to identify this observation as either 
      • similar to "typical" observations it was trained on; characterised by small reconstruction error or
      • as an outlier; characterised by the relatively large reconstruction error.
  • De-noising inputs / dimensionality reduction of inputs
    • Autoencoders can be used to reduce noise from inputs to create a "de-noised" and lower dimensional version ready for use in machine learning pipelines. This can be seen as a form of regularisation.
Latent vector space
One natural curiosity that may arise is how we can characterise or better understand the latent vector space. I've borrowed the results below from [2], which shows a 2 dimensional dense representation of MNIST trained on an autoencoder.

What we can see are distinct and disjoint regions for each of digits [0-9] - this makes it easy for the decoder to perform it's job. However, this causes problems if we ever want to sample from this space to generate realistic examples. Because each digit's latent vector is clustered locally, the decoder would likely decode a vector taken from (-15, 5) into junk - it would not resemble any of the digits from [0-9] - as it hasn't seen any training examples from this local region. Hence we are very limited to sampling local regions of the latent space which have seen training data in that region - the generated samples will likely replicate the training samples, this is not very intelligent generation! If we were to sample within a cluster but in a region that wasn't in the training data - for example (-22,-10) - there is no guarantee that the decoded image would make sense! If we started at the 1 indicated in the image and traveled along the vector toward 7, intuitively we would expect the output of the decoder to continuously deform from a 1 to a 7, however this is not the case with the autoencoder - no such guarantee exists - and it is more than likell non intelligible outputs would be created.

The disjoint and discontinuous nature of the representations created by an autoencoder make it a poor candidate for generating realistic samples, if we want to do so, we must use a Variational Autoencoder.

Variational Autoencoder (VAE)

The distinguishing feature of a VAE compared to an autoencoder is that - by design - the latent spaces are continuous, in the sense that two reprensentations which are close in the latent space will result in similar looking outputs when decoded. Close here can be any metric, but let's consider the Euclidean metric for intuitive understanding. This is achieved via borrowing some machinery from Bayesian analysis (hence the variational part of the name) and cleverly manipulating the loss function.

The loss function

In the regular autoencoder, for vectors $(x_i,y_i)$ we define the loss function as the standard sum of squares

$$ L(y, \hat{y}) = \sum_{i=1}^{n} (y_i - \hat{y}_{i})^2 $$

where $(y_i = x_i)$ for an autoencoder and $ \hat{y}_{i}$ is the attempted reconstruction of the input $x_i$ by the autoencoder. This loss function has a single job - to minimise the reconstruction error, it as no constraints as to how the latent vectors are represented or distributed in the latent space.

The trick is to introduce some form of penalty into this loss function such that we coerce the latent vectors to be drawn from a continuous probability distribution, thus ensuring it has the aforementioned desirable properties for generating samples. This is achieved by using the Kullback Leibler (KL) divergence.

The KL divergence is a very important piece of machinery, often used in Bayesian analysis. It gives a (non-symmetric) measure of how similar or dissimilar two probability distributions are. We can use it to ensure our latent vectors are somewhat drawn from a distribution of our choice - this is almost always a Gaussian distribution due to the tractability of the math and the existence of closed form solutions [3].

ADDITIONAL MATERIAL TO FOLLOW.