User and Items rating matrix
Suppose we have a matrix $r \in \mathbb{R}^{m \times n}$ which each row corresponds to a single user (i.e. person) and each column corresponds to a single item (i.e. movie) such that $r_{ui}$ corresponds to the rating user $u$ gave rating $i$. Now this matrix will generally be constructed from all user and rating information that is available and it is quite obvious that not every user will have rated every item available. In fact when we use the MovieLens dataset we'll see that only about 6% of the matrix $r$ are valid entries - the rest are missing and are the ratings we would like to predict by using our recommender system once it is built.
The sparsity of the matrix $r$ poses an issue using the SVD approach covered previously, as SVD aims to reconstruct a matrix as is; it considers all entries in the matrix $r$ as valid in the reconstruction process. So we need to make a slight tweak to our approach to building the recommender system. We still want a latent variable representation, but we only want to use the 6% of ratings that are populated.
Latent space representation
In order to define the latent space representations, we make the following assumptions:
- Each user can be described by a $k$ dimensional latent vector $\textbf{x}_u = \left[ x_{u1} \dots x_{uk} \right] \in \mathbb{R}^k$
- Each item can be described by a $k$ dimensional latent vector $\textbf{y}_i = \left[ y_{i1} \dots y_{ik} \right]\in \mathbb{R}^k$
Hence we can define the loss function (with some regularisation) as trying to minimise the difference between the predicted and actual ratings a every user gave each item:
$$ L = \sum_{u,i \in S}\left( r_{ui} - \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}\right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2$$ where $S$ is the set of valid ratings and $\lambda_x$, $\lambda_y$ are regularisation hyperparameters. We can rewrite a little more succinctly using matrix notation: $$ L = \left( \textbf{r}_u - \textbf{x}^T_uY^T \right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2 $$ where $$\textbf{r}_u = \left[ r_{u_1} \dots r_{u_m} \right]$$ is the rating vector for user u's ratings of the m items and $$Y = \left[ \begin{array}{ccc} -& y^{T}_{1} & - \\ - & y^{T}_{2} & - \\ & \vdots & \\ - & y^{T}_{m} & - \end{array} \right] $$ is a composite vector of the $y_i$'s. Now this equation looks quite familiar (recall the definition of the OLS loss function...) hence if we fix one of the $\textbf{x}^T_u$ or $\textbf{y}_{i}$ then we have effectively reduced the cost function to that of an OLS problem. The procedure is simple, we fix $\textbf{y}_i$ and find the corresponding $\textbf{x}^T_u$ which minimises the loss function, this then feeds into the next iteration where we fix $\textbf{x}^T_u$ and find the $\textbf{y}_i$ which minimises the updated loss function and continue until convergence (or some other appropriate stopping criteria). This process is known as (for obvious reasons) as Alternating Least Squares and we'll derive the close form solution in the next section.
We can then minimize $L$ by differentiating with respect to the components of our user vector $x_u$ and setting to $0$. Using summation notation:
$$ \frac{\partial L}{\partial x_{ua}} = \sum_{i}^{m} -2 \left(r_{ui} - \sum_{j}^k x_{uj} Y_{ij} \right)Y_{ai} + 2 \lambda_x x_{ua}$$ Hence for a minimum and in matrix form $$ 0 = -\left( \textbf{r}_u - \textbf{x}^T_uY^T \right) Y + \lambda_x \textbf{x}_{u}$$ we can re-arrange to get $$ \textbf{x}_{u}^T = \textbf{r}_u Y \left( I \lambda_x + Y^TY\right)^{-1}$$ We can perform an analogous calculation for $\textbf{y}_i$ which yields $$ \textbf{y}_i^T = r_i X \left( I \lambda_y + X^TX \right)^{-1} $$ where $I$ is the $k \times k$ identity matrix. Hence we have derived the updates for our parameters required at each iteration - the next section looks at the results.
Next time we'll have a look at the python implementation and results to see how well our recommendation system works!
Alternating Least Squares
We can then minimize $L$ by differentiating with respect to the components of our user vector $x_u$ and setting to $0$. Using summation notation:
$$ \frac{\partial L}{\partial x_{ua}} = \sum_{i}^{m} -2 \left(r_{ui} - \sum_{j}^k x_{uj} Y_{ij} \right)Y_{ai} + 2 \lambda_x x_{ua}$$ Hence for a minimum and in matrix form $$ 0 = -\left( \textbf{r}_u - \textbf{x}^T_uY^T \right) Y + \lambda_x \textbf{x}_{u}$$ we can re-arrange to get $$ \textbf{x}_{u}^T = \textbf{r}_u Y \left( I \lambda_x + Y^TY\right)^{-1}$$ We can perform an analogous calculation for $\textbf{y}_i$ which yields $$ \textbf{y}_i^T = r_i X \left( I \lambda_y + X^TX \right)^{-1} $$ where $I$ is the $k \times k$ identity matrix. Hence we have derived the updates for our parameters required at each iteration - the next section looks at the results.
Next time we'll have a look at the python implementation and results to see how well our recommendation system works!
Results
See my github for the python implementation in a jupyter notebook. I have run the model with the following parameters:
- $k = 10$ latent features
- $\lambda_x = 0.01$ user regularisation
- $\lambda_y = 0.01$ item regularisation
- $100$ iterations
You can see the training and test error plot for each iteration below:
We can see that learning occurs very quickly and after approximately the fifth
Additional flexibility and SGD
So far we've only considered a linear model which does not allow for the idiosyncrasies of individual user rating behaviour i.e. a given user might tend to rate movies more highly in general, compared to another user. Concretely, consider a movie I don't particularly like - I might give it a 3/5 because I'm a nice guy, however someone else might be more punitive and assign the movie a 1/5. Even though we have the same sentiment toward the movie, our individual rating philosophies differ. Similarly we can address the fact that certain movies tend to be rated lower compared to another movie even though the average sentimentality towards each movie might be similar.
We can address this by normalising each movie/rating pair by refining our rating estimate:
$$\hat{r}_{ui} = \mu + b_i + b_u + \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}$$ where $b_i$ is the user bias, $b_u$ is the item bias and $\mu$ is the global bias. This global bias term just adds additional flexibility (another parameter = another degree of freedom) for our model. Hence our loss function is now defined as:
$$ L = \sum_{u,i \in S}\left( r_{ui} - \mu - b_i - b_u - \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}\right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2 + \lambda_{bi} \sum_{i} ||b_i||^2 + \lambda_{bu} \sum_{u} ||b_u||^2 $$
where we have just substituted our new definition for $$\hat{r}_{ui}$$ and added regularisation terms for our $b_i$ and $b_u$ biases.
The form of this loss function is no longer comparable to OLS, due to our new bias terms. In order to minimise, we turn to our old friend SGD. There's enough maths on this page so I won't go through the gory details of the updates for each parameter - if you're interested you can see the details here [1].
Optimising this new loss function using SGD with the same parameters above:
We can see that the training error shrinks to zero whilst the test error starts increasing with the number of iterations - these are the classic signs of overfitting! Looking at the plot, cutting off training at around iteration $25$ seems like a good idea. Comparing this to our ALS results, the test error is now around $1$, which is markedly better than the $8$ we observed earlier - this model fits the data a lot better, as expected with the additional flexibility.
Ok we have fit our models, now what? The whole purpose of this exercise was to make recommendations! What we're going to do is use the cosine-similarity to compare each of our latent vectors - given an input vector representation of a movie, we'll find the most 'similar' in our latent space to provide the user with recommendations!
On the right below we have the recommendations based on the bond movie Independence Day provided by our SGD model and on the left the ALS model.
Which do you think is better? There is some common recommendations - The Rock, Twister and the recommendations seem reasonable for the most part, aside from a stray Toy Story. In reality you'd probably combine these recommendations in the final model and perhaps supplement with some content based filtering to get the 'best' recommendation.
Until next time...
$$\hat{r}_{ui} = \mu + b_i + b_u + \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}$$ where $b_i$ is the user bias, $b_u$ is the item bias and $\mu$ is the global bias. This global bias term just adds additional flexibility (another parameter = another degree of freedom) for our model. Hence our loss function is now defined as:
$$ L = \sum_{u,i \in S}\left( r_{ui} - \mu - b_i - b_u - \textbf{x}_{u}^{T} \cdot \textbf{y}_{i}\right)^2 + \lambda_x \sum_{u} ||\textbf{x}_u||^2 + \lambda_y \sum_{i} ||\textbf{y}_i||^2 + \lambda_{bi} \sum_{i} ||b_i||^2 + \lambda_{bu} \sum_{u} ||b_u||^2 $$
where we have just substituted our new definition for $$\hat{r}_{ui}$$ and added regularisation terms for our $b_i$ and $b_u$ biases.
The form of this loss function is no longer comparable to OLS, due to our new bias terms. In order to minimise, we turn to our old friend SGD. There's enough maths on this page so I won't go through the gory details of the updates for each parameter - if you're interested you can see the details here [1].
Optimising this new loss function using SGD with the same parameters above:
- $k = 10$ latent features
- $\lambda_x = 0.01$ user regularisation
- $\lambda_y = 0.01$ item regularisation
- $\lambda_bu = 0.01$ user bias regularisation
- $\lambda_bi = 0.01$ item bias regularisation
- $100$ iterations
- $\eta = 0.005$ learning rate
We can see that the training error shrinks to zero whilst the test error starts increasing with the number of iterations - these are the classic signs of overfitting! Looking at the plot, cutting off training at around iteration $25$ seems like a good idea. Comparing this to our ALS results, the test error is now around $1$, which is markedly better than the $8$ we observed earlier - this model fits the data a lot better, as expected with the additional flexibility.
Ok we have fit our models, now what? The whole purpose of this exercise was to make recommendations! What we're going to do is use the cosine-similarity to compare each of our latent vectors - given an input vector representation of a movie, we'll find the most 'similar' in our latent space to provide the user with recommendations!
On the right below we have the recommendations based on the bond movie Independence Day provided by our SGD model and on the left the ALS model.
Which do you think is better? There is some common recommendations - The Rock, Twister and the recommendations seem reasonable for the most part, aside from a stray Toy Story. In reality you'd probably combine these recommendations in the final model and perhaps supplement with some content based filtering to get the 'best' recommendation.
Until next time...