Thursday 26 October 2017

Recommender Systems: Collaborative Filtering and matrix factorization

What is a recommender system?

Think about any website which has a community of users who purchase/interact with content (items) on that website. Based on users' interactions with the items, various personalised recommendations can be tailored to specific user behaviour. More concretely, consider the purchase suggestions Amazon offers you, or movies that Netflix recommend you watch based on your past viewing behaviour. The machinery behind these recommendations is the basis of recommender systems.

Companies are looking to personalise a user's experience in a way that resonates with them - to make the user experience feel unique and tailored to their needs. These days websites capture all sorts of attributes from a user's experience/interaction - time spent on pages, scroll speed, viewing behaviour, purchasing behaviour etc... One may ask how we can use such detailed information to offer a better user experience; through the construction of a recommender system.

Now recommender systems are generally based on two types of user behaviour:
  1. Implicit feedback
    • Information based on user behaviour such as viewing/purchasing/click behaviour; user preferences can be inferred.
  2. Explicit feedback
    • Where the user has given an explicit rating to a movie or item (a 5 star rating for example).
There are generally two approaches to building a recommender system, content based filtering and collaborative filtering.

Content based filtering

The approach of content based filtering is as follows:
  • Each item is characterised by a series of properties or features. For example if we are using content based filtering to build a movie recommendation system then each movie could have attributes such as lead actor, director, genre, etc. 
  • A profile based on these attributes is built for each user and hence new items can be recommended to the user via a similarity measure between the user's profile and a specific items profile.
This approach requires the features of each item to be defined explicitly and hence it is limited to the level of feature engineering. It also means that as new items are added - which have new characteristics - these must be calculated for each of the existing items too and the dimensionality of our feature space will rise exponentially. We will not focus on this approach in the remainder of the blog.


Collaborative Filtering

As its name suggests, collaborative filtering uses an aggregate (or collaborative) approach to suggests new items, based on the premise that users who share a common interest in certain items will also share a common interest in other items. Unlike content based filtering, this approach doesn't require hand crafted features for each item and hence can be more easily scaled to larger and even different domains. We will focus on the collaborative filtering approach in building our recommender system and will use the MovieLens dataset in our example [1].

The approach we take in building the collaborative filter borrows from some linear algebra results, namely the Singular Value Decomposition (SVD) of a matrix. I will first define exactly what SVD is and then I'll add some context into how it helps us with creating a recommender system.


Singular Value Decomposition (SVD)

Given a matrix $M \in \mathbb{R}^{m \times n}$ there exists a factorisation $ M = U \Sigma V^{*}$ where 
  • $U \in \mathbb{R}^{m \times m}$ us a unitary matrix (i.e  $UU^{*} = U^{*}U = I$)
  • $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal, with non-negative numbers (singular values) 
  • $V^{*} \in \mathbb{R}^{n \times n}$ is the conjugate transpose of a unitary matrix $V$
Note that since I've taken $\mathbb{R}$ as the field the matrix entries come from, the conjugate transpose is just the transpose (i.e. $V^{*} \equiv V^{T}$) and the unitary matrices are also known as orthogonal matrices.

What this is telling us, is that any matrix can be decomposed into the above structure. If you think about the eigenvalue decomposition (EVD) of a matrix, it only exists for a square matrix, whereas the SVD of an arbitrarily (non-square) shaped matrix exists. There is a deep connection between the SVD and EVD of a matrix - the non zero entries of $\Sigma$ are the square root of the eigenvalues of $MM^T$. See section 3.2 of [2].

Yeah, yeah that's a cool result, but how does that help us with the recommender system? Well, the motivation is borrowed from Latent Semantic Analysis  (LSA) - a topic in Natural Language Processing which aims to understand relationships between documents and the words they contain. In LSA, a term-document matrix is created in which each entry contains the number of occurrences of a particular word in a particular document. This term-document matrix is then decomposed via SVD with the resultant matrices representing documents and terms which capture the pattern of term usage among documents. The top $k$ singular values are kept, such that the decomposition results in a representation of the original matrix in a lower dimensional space. The idea is that this lower dimensional representation captures the structure of the relationship between documents and terms, whilst filtering out the noise. In this lower dimensional 'latent' space one can simply find the similarity of documents or terms by using our $U$ and $V$ matrices.

So in our case, the analogy of a term-document matrix is a user-item matrix in which each row of the matrix is a user, and each column is an item. For our example the items will be movies. Each entry $(i,j)$ corresponds to the rating the $i^{th}$ user gave to the $j^{th}$ movie. The SVD will result in a latent space in which user and item vectors reside and we can calculate their similarities.

Interpretation of SVD

Geometric Interpretation

In the case of $M \in \mathbb{R}^{n \times n}$ - aka a linear map within the same space - then there is a nice geometrical interpretation of the action of $M$, the image below illustrates this
  • $V$ is an orthogonal matrix - which corresponds to a reflection or rotation
  • $\Sigma$ is diagonal and hence is a scaling matrix
  • $U$ is also an orthogonal matrix - which is another reflection or rotation


More generally, consider $M \in \mathbb{R}^{m \times n}$, a bilinear map between spaces, we can rewrite the SVD equation as $$ MV = U \Sigma $$ If we consider each of the $j$ columns of $V$ separately then it is apparent $Mv_{j} = \sigma_j u_{j}$. Recalling that since $U$ and $V$ are orthogonal matrices and hence their rows form orthonormal bases, the action of $M$ is as follows: the basis of a coordinate system $\{v_1, \dots, v_n \}$ is mapped to a 'scaled' basis of a different coordinate system $\{\sigma_{1} u_{1}, \dots, \sigma_{m} u_{m} \}$. In words, every matrix $M \in \mathbb{R}^{m \times n}$ can be interpreted as having the following action
  1. A first rotation in the input space
  2. A simple positive scaling that takes a vector in the input space to the output space
  3. And another rotation in the output space

SVD as an approximation

Note that in the SVD decomposition there is an equality; the matrix $M$ is exactly reconstructed. In practice, we don't actually want the exact matrix, instead we want a lower k-dimensional approximation which will have less noise and (hopefully) have captured the latent structure of the user/item relationships. Now it turns out through SVD -if we keep only the top k singular values - the resulting matrix is the best k-rank approximation of $M$, where the notion of 'best' is defined by the Frobenius norm. See Section 2.1 of [2] for details.

We can think of the rank k approximation via SVD of a real matrix $M$ as the decomposition of $M$ into the sum of k rank 1 matrices. That is
$$ M = \sum_{j=1}^{k} \sigma_j u_j v_{j}^{*}$$
with $\sigma_1 > \sigma_2 > \dots >\sigma_k > 0$. This is a kind of expansion of $M$ in terms of rank 1 matrices, with the first term capturing the 'most' of $M$, followed by the second term and so on, with each including term adding to the accuracy of the approximation.

For a great exposition on SVD and some various angles on interpretation, check out [3] by Jeremy Kun which should help you gain some intuition.

Well, that wraps up a short summary on the techniques we'll use to build our recommender system! Next post will be focused some of the subtleties of the implementation and some tweaks which have been used to win competitions in the past!

References

[1] https://grouplens.org/datasets/movielens/
[2] http://www.math.ucla.edu/~dakuang/cse6040/lectures/6040_lecture15.pdf
[3] https://jeremykun.com/2016/04/18/singular-value-decomposition-part-1-perspectives-on-linear-algebra/

No comments:

Post a Comment