Recommender system address the following question: Given a set of users, items, and their past interactions, how can we recommend the best item for each user?
These past interactions are usually stored in a matrix called the user-item interaction matrix. It can contain ratings of movies, likes/dislikes of videos, song reproductions…
In our day-to-day life, we are constantly interacting with recommender systems: YouTube recommends us new videos, Amazon shows us items of interest, Spotify proposes playlists…
Collaborative filtering methods are solely based on the interaction between users and items. Thus (unlike content-based methods) they do not process features of the items.
Pros/Cons:
Cold-start problem can be addressed in multiple ways:
Memory-based collaborative filtering usually uses kNN (k-nearest neighbor) search from the information on the user-item interaction matrix.
Pros/Cons: Same as k-NN algorithm in general:
Imagine we want to recommend something to a given user \(x\). These method would find the most similar users and recommend something these users enjoyed that hasn’t been consumed by \(x\). This similarity is computed using some distance metric between each user’s row in the user-item interaction matrix, accounting for the un-interacted items.
Similarity between users is often evaluated using the centered cosine similarity (aka Pearson’s coefficient)
We can get an estimation of the rating a user \(x\) will give to item \(i\) as:
\begin{equation} r_{x,i} = \frac{\sum_{y \in \text{SIM}(x)} s_{xy} r_{yi}}{\sum_{y \in \text{SIM}(x)} s_{xy}} \end{equation}
Where:
Item-item methods use the same idea but from the perspective of items instead of users. When recommending something, instead of looking at similar users, we can look at similar items to those already liked by the user and recommend these.
The estimated rating would then be:
\begin{equation} r_{x,i} = \frac{\sum_{j \in \text{SIM}(i)} s_{ij} r_{xj}}{\sum_{j \in \text{SIM}(i)} s_{xj}} \end{equation}
Where:
Other ideas:
Depending on the nature of the service provided one can better exploit the data and get the similarity between items by alternative means. For instance, Spotify users organize songs in playlists: This aggregations can be understood in different ways:
word2vec is essentially implemented as an autoencoder that embeds words into a latent space considering its syntactic similarity.
Comparison with user-user:
Pros/Cons:
The idea is to factorize the huge (and very sparse) user-item interaction matrix \(M \in \mathbb{R}^{n \times m}\) into two small and dense matrices:
\begin{equation} M = Q P^T \end{equation}
In essence, we learn a (linear) model to represent both users and items in a low-dimensional space (of \(l\) dimensions). The low-dim features of a user will match those of the movies it enjoys.
A possible factorization would be using SVD (check out our dimensionality reduction post). However SVD is very expensive (considering how big the user-item interaction matrix is) and does not account for all the missing information. Alternatively, we can see it as an optimization problem, where we look for tha matrices which give the best reconstruction:
\begin{equation} P, Q = \arg \min_{P, Q} \sum_{(i, j) \in \text{RATINGS}} \left[ P_i Q_j^T - M_{ij} \right]^2 + \underbrace{\lambda_p \Vert P_i \Vert^2 + \lambda_q \Vert Q_j \Vert^2}_{regularization} \end{equation}
The regularization ensures we assign \(0\) to the unknown interactions.
We can then find \(P, Q\) running gradient descent on that loss. With the advantage that we can train with batches of users and items independently.
Once we have the matrices factorized we can either:
More generally, we can estimate the rating given by users by fancier operation than dot product such as an ANN \(f_\phi\):
\begin{equation} P, Q, \phi = \arg \min_{P, Q, \phi} \sum_{(i, j) \in \text{RATINGS}} \left[ f_\phi (P_i, Q_j) - M_{ij} \right]^2 + \underbrace{\lambda_p \Vert P_i \Vert^2 + \lambda_q \Vert Q_j \Vert^2}_{regularization} \end{equation}
Content-based methods process features of the items and/or users to find the best recommendations. In essence, they try to build a model that explains the observed user-item interactions. Once trained, we can run inference by inputing user features. The model will output a set of matching items.
Pros/Cons:
In a song recommender system:
Types:
If looking for simple models we can use Naive Bayes or Logistic regression for classification scenarios and linear regression for regression tasks. Or we can learn more complex models using ANNs.
While these item features can be hand-crafted by humans, it is interesting to study how they can be automatically extracted using ML techniques. For instance, Spotify can get characteristics of a song by analyzing its audio file and learning a model that relates songs by these signatures. This can be very helpful to solve the cold-start problem as it can be trained in a supervised fashion with past songs whose similarity is known using collaborative filtering techniques.
Item-centered answers the question: What is the probability of each user to like this item?, while user-centered methods answer the question: What is the probability of each item being liked by this user?.
It is also interesting to note that usually asking for information to new users is much harder than obtaining item information. Thus, item features tend to be much better than user features.
Collaborative filtering and content-based methods mainly belong to the supervised learning paradigm. SL presents some limitations to solve the recommendation problem:
Framing the problem as an interactive system where the recommender presents some content and users review this content might be better. However, it still presents some challenges:
RL problem setup:
How do you know if your model is any good? Notice that users will not like a model that proposes things which are too similar to what they already know, there also needs to be some exploration. This idea is captured by the serendipity metric: the diversity of the recommendations (how close are all recommendations between themselves).
Have you ever noticed that Netflix proposes new recommendations explaining why it things it is a good match? It has been shown that humans loose faith in the recommender system if they do not understand where the recommendations come from. Thus we also desire models with high explainability.
Before unleashing our recommender system into the wild, it is interesting to run some of these metrics. The main idea is to split the dataset into train and test sets and then analyze:
When introducing changes into the model it is often tested using A/B testing techniques to compare its performance to some past version of the model and see if there is any statistically significant improvement.