Derivative of the softmax function
I've dedicated a separate post for the derivation of the derivative of the cross entropy loss function with softmax as the activation function in the output layer, which I'll reference in the future. It's nothing groundbreaking but sometimes it's nice to work through some of the results which are often quoted without derivation.
The softmax function
The softmax function is often used as the activation function for neurons in the output layer of an ANN. We borrow the notation from a previous blog post, where $\hat{y}$ is the output of a particular neuron in the output layer and $$z_o = \sum_{j \in input(o)} z_j w_{j \rightarrow o}$$ i.e. we sum over all paths $j$ which are an input into our output neuron $o$.
It is defined as the following:
$$ \hat{y}_o = \frac{e^{z_o}}{\sum_k e^{z_k}}$$ where the sum is over the $k$ output neurons in the network. The softmax essentially takes an input $z_o \in \mathbb{R}$ and maps it to the interval $(0,1)$ such that the sum of all outputs is 1 - it gives us a probability distribution over our $k$ outputs (i.e. $k$ different classifications).
Cross entropy loss function
Recalling the definition of the cross entropy loss function from a previous blog post we have
$$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$
where $N$ is the number of training examples and $C$ is the number of classifications (i.e 2 for binary classification, etc...). Let us assume $N = 1$ without loss of generality, thus
$$L = - \sum_{o} y_{o} \ln{\hat{y}_{o}}$$ where the sum is over the output neurons in our network (given that we have 1 output neuron for each class, the sum over $i \in C$ and $o$ are equivalent). Note that $y_o$ is the actual classification of our data, i.e. $y_o \in \{0,1\}$ and that if $y_o=1$ then all other $y_j =0$ where $ i \neq o$ see a description of one hot encoding. Now our object of interest is $$ \frac{ \partial L}{\partial z_i} = \frac{\partial L}{\partial \hat{y}_o} \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Let's focus on the second term $\frac{\partial \hat{y}_o}{\partial z_i}$ for now and split it into two cases.
- Case 1: $ i = o$
$$= \frac{e^{z_o}}{\sum_k e^{z_k}} \left ( 1 - \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = \hat{y}_o (1 - \hat{y}_o)$$
- Case 2: $ i \neq o $
$$= -\hat{y}_o \hat{y}_i$$ Now $$\frac{\partial L}{\partial \hat{y}_o} = - \sum_o \frac{1}{\hat{y}_o} y_o$$ Therefore $$ \frac{\partial L}{\partial z_i} = -\sum_o \frac{1}{\hat{y}_o} y_o \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Substituting in our above computations and splitting out the $i=o$ case from the sum we get
$$\frac{\partial L}{\partial z_i} = -\frac{1}{\hat{y}_i} y_i \times \hat{y}_i(1-\hat{y}_i) - \sum_{o \neq i} \frac{1}{\hat{y}_o} y_o \times (- \hat{y}_o \hat{y}_i)$$
$$ = -y_i(1-\hat{y}_i) + \sum_{o \neq i} y_o \hat{y}_i$$ $$= y_i \hat{y}_i -y_i + \sum_{o \neq i} y_o \hat{y}_i$$ Moving the first term inside the sum and summing over all $o$ we get
$$\frac{\partial L}{\partial z_i} = -y_i + \sum_{o} y_o \hat{y}_i$$ $$ = \hat{y}_i \sum_o y_o - y_i$$ But $\sum_o y_o$ is the sum over all the actual labels of our data which is one hot encoded (as mentioned above). Thus $y_o = 1$ for exactly one class and $0$ for all else - hence $\sum_o y_o = 1$. We finally arrive at a result we'll use in the future in constructing and implementing our ANN to classify the make moons dataset
$$ \frac{\partial L}{\partial z_i} = \hat{y}_i - y_i$$