Thankfully, the lovely folks at Google make it ridiculously easy to do so. So instead of maxing out my poor little Macbook pro for weeks (or months :/) of training - we can utilise the already trained network and simply adjust the final layer for our use, more on this below.
Note that this post assumes familiarity of CNN architecture and its use in computer vision - see this great course at Stanford for a great introduction to CNNs - CS231n: Convolutional Neural Networks for Visual Recognition.
Background
The Inception v3 image recognition neural network came out of Google at the end of 2015. The focus of the architecture builds upon the original inception architecture of GoogLeNet [1] which was also designed to perform well even under strict constraints
on memory and computational budget [2].
Inception v1
The basis of this (at the time) new architecture was the so called inception module. It builds on the traditional convolutional layer which has a fixed height and width. Instead it applies and assortment of different size filters to the input, allowing the model to perform multi-level feature extraction on each input:Figure 1: Taken from Rethinking the inception architecture for computer vision[1] - the naive inception module.
As you can see above, there are $1 \times 1$, $3 \times 3$ and $5 \times 5$ convolutions as well as a $3 \times 3$ max pooling applied to the input. In theory, this is great - as we are able to exploit multiple level spatial correlations in the input, however using a large number of $5 \times 5$ filters can be computationally expensive.
This train of thought led to the next iteration of the inception module:
Figure 2: Taken from Rethinking the inception architecture for computer vision[1] - inception module with dimension reductions.
This architecture is essentially the same as the initial proposal except that each of the larger filters is preceded by a $1 \times 1$ convolution. These $1 \times 1$ convolutions act as a dimension reduction tool in the channel dimension, leaving the spatial dimension untouched. The idea is that the important cross channel information will be captured without explicitly keeping every channel. Once the channel dimension reduction has been performed, these results feed into the larger filters which will capture both spatial and channel correlations. The $1 \times 1$ filters also have ReLU activation functions which introduce additional non-linearities into the system.
Put simply in [1]:
One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.
Inception v2/3
A key realisation of the earlier iterations of Inception was that the improved performance compared to its peers was driven by dimensionality reduction using the $1 \times 1$ filters. This has a natural interpretation in computer vision; as the authors put it [2]:
In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.
With reducing computational complexity in mind (hence the number of parameters in the network) the authors sought a way to keep the power of the larger filters, but reduce how expensive the operations involving them were. Thus they investigated replacing the larger filters with a multi-layer network that has the same input/output size but less parameters.
Figure 3: Replacing a $5 \times 5$ filter with a network of $3 \times 3$ and a $1 \times 1$ filter.
Thus the inception module now looks like
Figure 4: Inception module where each $5 \times 5$ filter has been replaced by two $3 \times 3$ filters.
This replacement results in a 28% saving in computational efficiency [2]. Inception v2 and v3 share the same inception module architecture, the different arises in differences in training - namely batch normalization of the fully connected layer of the auxiliary classifier.
In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.
With reducing computational complexity in mind (hence the number of parameters in the network) the authors sought a way to keep the power of the larger filters, but reduce how expensive the operations involving them were. Thus they investigated replacing the larger filters with a multi-layer network that has the same input/output size but less parameters.
Figure 3: Replacing a $5 \times 5$ filter with a network of $3 \times 3$ and a $1 \times 1$ filter.
Thus the inception module now looks like
Figure 4: Inception module where each $5 \times 5$ filter has been replaced by two $3 \times 3$ filters.
This replacement results in a 28% saving in computational efficiency [2]. Inception v2 and v3 share the same inception module architecture, the different arises in differences in training - namely batch normalization of the fully connected layer of the auxiliary classifier.
Transfer Learning
Now that we have a (very) high level understanding of the architecture of Inception v3, let's talk about transfer learning.
Given how computationally expensive it is to train a CNN from scratch - a task which can take months - we can use the results of a pre-trained CNN to do the feature extraction for us. The idea is that the generic features of an image (edges, shapes, etc) which are captured in the early layers of the CNN are common to most images. That is, we can utilise a CNN that has been trained for a specific task on millions of images to do the feature extraction for our image recognition task.
We then strip the CNN of the final fully connected layer (which feeds to the softmax classifier) and replace it with a fully connected layer built for our own task and train that. There are several approaches to transfer learning which are dependent on the task, such as we can use the pre-trained CNN as a feature extractor as detailed above, or we can use it as a starting point and retrain the weights in the convolutional layers. The amount of training data you have for your task and the details of the task will generally dictate the approach. See [3] for a detailed study on transfer learning.
The pre-trained CNN we'll be utilising an Inception v3 network as described above, trained on the ImageNet 2012 dataset which contained 1.2million images and 1000 categories.
Implementation
Thankfully Google had the foresight to make it dead easy to perform the aforementioned transfer learning using Tensorflow.
For my training images, I utilised the work done in [4] which provided around 350 hotdog images and around 50 not hotdog (notdog) images. We then utilise the script retrain.py which we'll pass parameters to from the terminal.
The script will determine the number of classes based on the number of folders in the parent folder. You don't need to worry about specific names for the images, you just need to make sure the folders contained the correctly labeled images.
The rest of the parameters are fairly self explanatory and are explained on the tutorial. One interesting set of parameters to note are the random_crop and random_scale, these distort the training images as to preserve their meaning but add new training examples. For example a picture of a hotdog rotated 90 degrees is still a hotdog; instead of sourcing additional hotdog pictures we can just rotate an existing training example to teach the classifier something new.
For my training images, I utilised the work done in [4] which provided around 350 hotdog images and around 50 not hotdog (notdog) images. We then utilise the script retrain.py which we'll pass parameters to from the terminal.
The script will determine the number of classes based on the number of folders in the parent folder. You don't need to worry about specific names for the images, you just need to make sure the folders contained the correctly labeled images.
The rest of the parameters are fairly self explanatory and are explained on the tutorial. One interesting set of parameters to note are the random_crop and random_scale, these distort the training images as to preserve their meaning but add new training examples. For example a picture of a hotdog rotated 90 degrees is still a hotdog; instead of sourcing additional hotdog pictures we can just rotate an existing training example to teach the classifier something new.
Results
After running retrain.py with 500 training steps, I received the below training summary
99% Validation accuracy - not bad at all!
Let's have a look at a specific example and try the below hotdog
hotdogs (score = 0.99956)
random (score = 0.00044)
Great news, this image is in fact a hotdog and our CNN recognises it as such with a high degree of confidence. What about a notdog?
Great news, this image is in fact a hotdog and our CNN recognises it as such with a high degree of confidence. What about a notdog?
random (score = 0.92311)
hotdogs (score = 0.07689)
This is a sunflower not a hotdog! Yet the classifier is only rather confident rather than completely confident that it is a notdog. This can be attributed to the limited number of training examples that were used in training the final layer before classification.
What about if we try to trick it - with a HOTDOG CAR!?
What about if we try to trick it - with a HOTDOG CAR!?
random (score = 0.70404)
hotdogs (score = 0.29596)
Not too bad at all, obviously the shape is similar to that of a hot dog but the chassis, wheels and other car-like features offer enough difference for our CNN to make the distinction.
What's next?
I haven't decided if I'll build some CNNs or have a crack at an LSTM implementation next - both will be using Tensorflow but I've got some ideas I want to play around with first which will shape which direction I head in.
References
[1] Szegedy, C. et al. 2014. Going deeper with convolutions, https://arxiv.org/pdf/1409.4842.pdf
[2] Szegedy, C. et al. 2015. Rethinking the Inception Architecture for Computer Vision, https://arxiv.org/pdf/1512.00567.pdf.
[3] Donahue, J. et al. 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, https://arxiv.org/abs/1310.1531.
[4] https://github.com/Fazelesswhite/Hotdog-Classification
[2] Szegedy, C. et al. 2015. Rethinking the Inception Architecture for Computer Vision, https://arxiv.org/pdf/1512.00567.pdf.
[3] Donahue, J. et al. 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, https://arxiv.org/abs/1310.1531.
[4] https://github.com/Fazelesswhite/Hotdog-Classification