Wednesday 4 January 2017

Feedforward Artificial Neural Network pt3: Matrices!

So far we've looked at the building blocks of an ANN - the different layers, weights, activation functions and error functions. We know have a decent understanding of what the objects are and how they relate, but so far we've only looked at relatively small parts of the network in isolation for individual training samples. We'll look at implementing the aforementioned ideas using matrices, which will be very helpful when we build our full network and have hundreds (or thousands) of training samples. This is a very natural thing to do - if we can set up the propagation of our network and training samples through matrix multiplication then the computer can do the work in optimising the calculations. The other option is the old fashioned way of using for loops to iterate through each neuron and each training example in our network, this way is markedly slower and inefficient.

Sample ANN - using matrices

We'll use the below network to demonstrate the power of the matrix implementation. Note that we will do this step by step, so it can be rather laborious/repetitive but it's important to understand how each element of the matrix relates to the inputs and is propagated through the network. 
The network has the following features:
  • 2 inputs, $x_1$, $x_2$
  • 1 Hidden Layer with 3 neurons
  • 2 outputs, $\hat{y}_1$, $\hat{y}_2$
  • Activation function: $\sigma(x) = \tanh(x)$
  • Softmax output
  • Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$

Inputs

We define our input matrix based on two training samples: $$x = \left[ \begin{array}{cc} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right]$$ 
where the matrix is defined such that $x_{ij}$ is the $j^{th}$ input of the $i^{th}$ training sample.

Weights

Now we define a matrix $W^{(1)}$ of weights for the neurons in the Hidden Layer;
$$W^{(1)} = \left[ \begin{array}{ccc} w_{11}^{(1)} w_{12}^{(1)} w_{13}^{(1)} \\ w_{21}^{(1)} w_{22}^{(1)} w_{23}^{(1)} \end{array} \right]$$ where $w_{ij}^{(1)}$ is the weight for the $i^{th}$ input in the $j^{th}$ neuron in the hidden layer. Thus each column of $W^{(1)}$ represents the weight each input receives from each neuron in the hidden layer.

Bias

We introduce the concept of bias in an activation function. The bias is a translation of the entire activation function and is implemented as the following $$\sigma(w \cdot x +b)$$ where $w$ is the weight, $x$ is the input into the particular neuron and $b$ is the bias. The bias adds flexibility to the activation function; we can achieve any output from any input. 
For example consider an activation function without bias $$\sigma(x) = \tanh(w \cdot x)$$ and say we want to achieve an output of $0.5$ when $x=0$. There is no such $w$ that will allow us to achieve this result, as $\tanh(0) = 0$. However if we introduce bias we have $$\sigma(x) = \tanh(w \cdot x + b)$$ we can set $b \approx 0.549$ to achieve the desired result. Now we introduce the bias matrix - the matrix of biases for the hidden layer $$b^{(1)} = \left[ \begin{array}{ccc} b_{11}^{(1)} b_{12}^{(1)} b_{13}^{(1)} \\ b_{21}^{(1)} b_{22}^{(1)} b_{23}^{(1)} \end{array} \right]$$ with the $b^{(1)}_{ij} = b^{(1)}_{kj}$ for all $k$ i.e $b^{(1)}$ has a single unique row; all other rows in the matrix are a copy of this one. This is just to ensure that the matrix multiplication works as intended.

Forward Propagation

We'll now work our way through the network, starting at the input, traversing through the hidden layer and arriving at the output.

Using the notation described in this previous blog post, we get our matrix $Z^{(1)}$ which contains the results of applying the weights and biases to our inputs. Thus $$Z^{(1)} = x W^{(1)}+b^{(1)}$$ is a $2 \times 3$ matrix with the following structure
$$Z^{(1)} = \left[ \begin{array}{ccc} x_{11}w_{11}^{(1)}+x_{12}w_{21}^{(1)}+b_{11}^{(1)} \hspace{.3in} \cdots \hspace{.3in} \cdots \\ x_{21}w_{11}^{(1)}+x_{22}w_{21}^{(1)}+b_{21}^{(1)} \hspace{.3in} \cdots \hspace{.3in} \cdots \end{array} \right]$$ where I've only included the first column of the matrix. In matrix index notation we have $$Z_{ij}^{(1)} = \sum_{k} x_{ik} W^{(1)}_{kj} + b_{ij}^{(1)}$$We can interpret $Z_{ij}^{(1)}$ as the input the the $j^{th}$ hidden neuron receives from the $i^{th}$ test sample.

Once at the hidden layer, the activation function is applied (elementwise) to $Z^{(1)}$ as $$a^{(1)} = \left[ \begin{array}{ccc} a_{11}^{(1)} a_{12}^{(1)} a_{13}^{(1)}  \\ a_{21}^{(1)} a_{22}^{(1)} a_{23}^{(1)} \end{array} \right] \equiv \tanh{Z^{(1)}} $$

We now propagate these values through the hidden layer, applying the weights and biases resulting in a $2 \times 2$ matrix $Z^{(2)}$ defined as $$Z^{(2)} = a^{(1)}W^{(2)}+b^{(2)}$$ where $$W^{(2)} = \left[ \begin{array}{cc} w_{11}^{(2)} w_{12}^{(2)} \\ w_{21}^{(2)} w_{22}^{(2)} \\ w_{31}^{(2)} w_{32}^{(2)} \end{array} \right]$$ and $$b^{(2)} = \left[ \begin{array}{cc} b_{11}^{(1)} b_{12}^{(2)} \\ b_{21}^{(2)} b_{22}^{(2)} \end{array} \right]$$ Similarly to $b^{(1)}$, this matrix has a unique single row, with the rest of the rows of the matrix being exact copies.  Explicitly, $Z^{(2)}$ has the following entries
$$\left[ \begin{array}{cc} a_{11}^{(1)}w_{11}^{(2)}+a_{12}^{(1)}w_{21}^{(2)}+a_{13}^{(1)}w_{31}^{(2)}+b_{11}^{(2)} \hspace{.3in} a_{11}^{(1)}w_{12}^{(2)}+a_{12}^{(1)}w_{22}^{(2)}+a_{13}^{(1)}w_{23}^{(2)}+b_{12}^{(2)} \\ a_{21}^{(1)}w_{11}^{(2)}+a_{22}^{(1)}w_{21}^{(2)}+a_{23}^{(1)}w_{31}^{(2)}+b_{21}^{(2)} \hspace{.3in} a_{21}^{(1)}w_{12}^{(2)}+a_{22}^{(1)}w_{22}^{(2)}+a_{23}^{(1)}w_{23}^{(2)}+b_{22}^{(2)} \end{array} \right]$$ More compactly, we have $$Z_{ij}^{(2)} = \sum_{k} a^{(1)}_{ik} W^{(2)}_{kj} + b^{(2)}_{ij} $$ where $Z_{ij}^{(2)}$ can be considered the input to the $j^{th}$ output neuron from the $i^{th}$ training sample.

Finally, applying the softmax function in the output neurons we end up with $$a^{(2)} = \text{softmax}(Z^{(2)}) \equiv \left[ \begin{array}{cc} \hat{y}_{11} \hspace{.15in} \hat{y}_{12} \\ \hat{y}_{21} \hspace{.15in} \hat{y}_{22} \end{array} \right]$$ where $a_{ij}^{(2)}$ is the output of the $j^{th}$ output neuron of the $i^{th}$ training sample. Hopefully the explicit matrix multiplication helps illuminate how the input values are propagated through the network and how their values are affected by the various weights in the input/hidden/output layers. Next we'll take an explicit look at backpropagation and establish similar matrix results which will allow us to train the network in a relatively efficient manner.

Backpropagation

In this section we'll derive the backpropagation rules for the network. Make sure you understand matrix index notation as we'll use it here to succinctly write the results.
Recall our definition $$\delta_j \equiv \frac{\partial L}{\partial z_j} =  \sigma ' (z_j) \sum_{k \in path(j)} \delta_k w_{j \rightarrow k}$$ We'll slightly alter our notation to make it easier to track all of the indices floating around $ \delta^{(j)} \equiv \delta_j$ Starting at our output neurons and using the result from the previous blog post we have $$\delta^{(2)}_{ij} \equiv  \frac{\partial L}{\partial Z^{(2)}_{ij}} = \hat{y}_{ij} - y_{ij}$$ or in full matrix form
$$\left[ \begin{array}{cc} \hat{y}_{11}-y_{11} \hspace{.3in} \hat{y}_{12}-y_{12} \\ \hat{y}_{21}-y_{21} \hspace{.3in} \hat{y}_{22}-y_{22} \end{array} \right]$$ Now $$\delta^{(1)}_{ij} \equiv  \frac{\partial L}{\partial Z^{(1)}_{ij}}$$ using the chain rule we can decompose this into $$ \frac{\partial L}{\partial Z^{(1)}_{ij}} = \sum_{m,n,p,q} \frac{\partial L}{\partial{Z^{(2)}_{mn}}} \times \frac{\partial{Z^{(2)}_{mn}}}{\partial{a^{(1)}_{pq}}}\times \frac{\partial{a^{(1)}_{pq}}}{\partial {Z^{(1)}_{ij}}}$$ The first term is simply $\delta^{(2)}_{mn}$. For the second term, recall above $$Z^{(2)}_{mn} = \sum_{k} a^{(1)}_{mk} W^{(2)}_{kn} + b^{(2)}_{mn} $$ $$\implies \frac{\partial Z^{(1)}_{mn}}{\partial a^{(1)}_{pq}} = W^{(2)}_{qn}$$ where the sum over k has been collapsed since the derivative is only non-zero when $m=p$ and $k=q$. Now, the third term $$\frac{\partial a^{(1)}_{pq}}{\partial Z^{(1)}_{ij}} = \frac{\partial}{\partial Z^{(1)}_{ij}} \left( \tanh(Z^{(1)}_{pq}) \right) = 1 - \tanh^{2}(Z^{(1)}_{ij})$$ where the only non-zero elements of the above expression are when $p=i$ and $q=j$. Putting this all together we have $$\frac{\partial L}{\partial Z^{(1)}_{ij}} = \left( 1 - \tanh^{2}(Z^{(1)}_{ij}) \right)  \sum_{n} \delta^{(2)}_{in} W^{(2)}_{jn} $$ note the sum is only over $n$ now, instead of $m, n, p, q$ make sure you understand why this is. We can now write this in full matrix form (as we will use in the implementation) as
$$\delta^{(1)} = \left( 1 - \tanh^{2}(Z^{(1)}) \right) \odot \delta^{(2)} {W^{(2)}}^T$$ where $\odot$ is the elemetwise multiplication of the matrices (Hadamard product) and $T$ indicates the transpose of the matrix.
Now we can use the above results for $\delta^{(1)}$ and $\delta^{(2)}$ to calculate how the loss function changes with respect to the weights - recall this is the value we use to alter our weights at each iteration in our gradient descent routine. Now $$\frac{\partial L}{\partial W^{(2)}_{ij}} = \sum_{p,q} \frac{\partial L}{\partial Z^{(2)}_{pq}} \times \frac{\partial Z^{(2)}_{pq}}{\partial W^{(2)}_{ij}}$$ The first term is simply $\delta^{(2)}_{pq}$ and the second term is $$\frac{\partial}{\partial W^{(2)}_{ij}} \left[ \sum_{k} a^{(1)}_{pk} W^{(2)}_{kq} + b^{(2)}_{pq} \right]$$ $\implies k=i$ and $q=j$ hence the sum collapses and the term evaluates to $a^{(1)}_{pi}$. Hence $$\frac{\partial L}{\partial W^{(2)}_{ij}} = \sum_{p} \delta^{(2)}_{pj} a^{(1)}_{pi}$$ or in matrix form, this is represented as $$ \frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$$ Similarly $$\frac{\partial L}{ \partial b^{(2)}_{ij}} = \sum_{p,q} \frac{\partial L}{\partial Z^{(2)}_{pq}} \times \frac{\partial Z^{(2)}_{pq}}{\partial b^{(2)}_{ij}} $$ the second term forces $q=j$ resulting in $$\frac{\partial L}{ \partial b^{(2)}_{ij}} = \sum_p \delta^{(2)}_{pj}$$ i.e a sum over the rows of $\delta^{(2)}$ note the right hand side of the expression is indendent of $i$ this implies that every row in $\frac{\partial L}{ \partial b^{(2)}}$ is the same. Similar analysis yields $$\frac{\partial L}{\partial W^{(1)}_{ij}} = \sum_{p} \delta^{(1)}_{pj} x_{pi} $$ and $$\frac{\partial L}{\partial b^{(1)}_{ij}} = \sum_p \delta^{(1)}_{pj}$$

Recap

Relax. Take a breath. The derivations are over (for now). I think going through all the derivations in detail at least once is paramount to understanding the inner workings of the neural network. Once you get your head around the notation, there actually isn't anything that fancy going on - just repeated applications of the chain rule. They key for me was attacking this problem via matrix index notation which illustrated exactly how values are propagated (backwards and forwards) in the network via matrix multiplication.

Forward Propagation Summary

  • $Z^{(1)} = xW^{(1)} + b^{(1)}$
  • $a^{(1)} = \tanh(Z^{(1)})$
  • $Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$
  • $\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$

Backpropagation Summary


  • $\delta^{(2)} = \hat{y} - y$
  • $\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$
  • $\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$
  • $\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$
  • $\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$
  • $\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$
  • Note that the $\frac{\partial L}{\partial W^{(i)}}$ and $\frac{\partial L}{\partial b^{(i)}}$ are not actually derivatives with respect to matrices, these are just labels for the matrices containing the derivatives with respect to the various weights and biases. The $\sum_rows \delta^{(i)}$ indicates that we sum over the rows of the matrix $\delta^{(i)}$ such that the matrices $$\frac{\partial L}{\partial b^{(i)}}$$ consist of a single unique row repeated $m$ times, where $n$ is the number of training examples.


    With these matrices we are now in a position to finally implement our ANN to classify our dataset! The next blog will detail the implementation in python and the associated results.

    No comments:

    Post a Comment