Context Aware Recommender Systems

Introduction

I’ve had the opportunity to work on a Context Aware Recommender System for a major European Broadcaster during my Bachelor’s Thesis. Everyone knows Recommender System (RS), that’s like Netflix stuff. But how does it work? Well, read on and you’ll know.

What is a Recommender System ?

A Recommender system is a system that recommends items/products to users, these items can be anything from movies to books to music to food, but they must be of interest to the user. There are two main types of RS: Content Based and Collaborative Filtering.
The first one tries to recommend items similar to the ones you already liked, while the second one tries to recommend items that people similar to you liked. How do we know if a user likes an item ? Well, we can ask them to rate it, this is an explicit feedback, or we can infer it from their behavior, this is an implicit feedback. If you have a set of users \(U\) and a set of items \(I\) you can write the review between a user \(u\) and an item \(i\) as \(r_{u,i}\). This review can be implicit or explicit. By observing the interaction between \(i\) and \(u\) and retrieving a rating \(r_{u,i}\). We have some data point of the function \(f\left(u,i\right) \rightarrow r \). The goal is to approximate the whole function \(f\) so that we can predict the rating of an item \(i\) for a user \(u\) that has not yet interacted with \(i\). When recommending for a user \(u\), the system generate the rating for all the items and then select the top \(k\) items.

Content Based Filtering

The first type of RS that we’ll explore is called content based filtering, as the name suggests, this one is based on the content. Therefore, recommending items similar to the ones you already liked. The idea is to create a mathematical representation of the items and then compute the similarity between them. We transform the items to their mathematical representation because this introduces a notion of distance/similarity. One such representation is vector. The items \(i \in I \) are converted to a vector \( \vec{i} \). Before going further, we need to introduce how to compare them.
We can use the cosine similarity, which is derived from the scalar product of two vectors. If the vectors are similar, the angle \(\theta\) between the two vectors is small, and the cosine of a small angle is close to 1. If the vectors are dissimilar, the angle \(\theta\) is large, and the cosine of a large angle is close to 0. The cosine similarity is defined as follows:

\begin{align} \vec{a} \cdot \vec{b} &= |\vec{a}| \cdot |\vec{b}| \cdot cos(\theta) \\ cos(\theta) &= \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| \cdot |\vec{b}|} \end{align}

Let’s vectorize

How to transform an item such as a movie to a vector ? Well in the old time of my grandmother you’d use something like TF-IDF on the description of the movie, but now you mostly use a Neural Network. They create vectors with a lot of dimensions, and the similarity between two vectors is computed using the cosine similarity. Vectors are created in such a way that similar items have similar vectors, this is a property acquired during the training of the ML model. These vectors are called embeddings.

Let’s recommend

So we have a set of users \(U\) and a set of items \(I\). We can transform an item \(i \in I\) to a vector \(\vec{i}\). We have a way to compute the similarity between two vectors. Therefore, we can now compute which items we should recommend to a user: We first look at which content he liked the most, and we recommand similar items.

We have a last problem known as the cold start problem which is: what if we don’t know what a user likes ? Well, we can’t recommend anything. We can solve this problem by asking the user to rate some items, or by recommending popular content at first to get a sense of his taste.

Collaborative Filtering

In this variant, we recommend items that similar users liked. We start by again by collecting interaction between users and item. If we have \(n\) users and \(m\) items, we can create a matrix \(R_{n\times m}\) where each \(r_{u,i}\) is the rating of the user \(u\) for the item \(i\).

\begin{equation} R_{n\times m} = \begin{bmatrix} r_{11} & . & . & . & . & r_{1m} \\ . & 1 & ? & 0 & 5 & .\\ . & 3 & 2 & 1 & ? & .\\ . & ? & ? & ? & 5 & .\\ . & 3 & ? & ? & 1 & .\\ r_{n1} & . & . & . & . & r_{nm} \\ \end{bmatrix} \end{equation} As you can see in (3), the matrix is sparse, we don’t have all the ratings (missing rating are denoted by ?), because most of the users haven’t consumed all the content.
We can do a cool math trick: matrix factorization. \begin{equation} R_ {n \times m} \approx \hat{R}_ {n\times m} = P^T_ {k\times n} \times Q_{k\times m} \end{equation}

We say that \(\hat{r}_{u,i}\) is the approximated rating in \(\hat{R}\). \(P\) and \(Q\) are defined as follows:

\begin{equation} P = \begin{bmatrix} |& &| \\ \vec{p_1}& … &\vec{p_n} \\ |& &| \\ \end{bmatrix} \quad \quad, Q = \begin{bmatrix} | & & | \\ \vec{q_1} & … & \vec{q_m} \\ | & & | \end{bmatrix} \end{equation} You can deduce by the shape of the matrix that \(P\) is composed of vectors representing users, because they are \(n\) vectors in it (these are embeddings representing the users). A user used to be represented by all the content he rated out of the \(m\) items, but now he is represented by a vector of size \(k\). This space reduction creates an embedding of the user. Similarly, \(Q\) is composed of vectors representing items. We are trying to find \(P\) and \(Q\) such that \(\hat{R}\) is close to \(R\) for known \(r_{u,i}\). We have the following optimization problem: \begin{equation} \min_{P,Q} \sum_{r_{u,i}} (r_{ui} - \hat{r}_{u,i})^2 \end{equation}

We add a regularization term to avoid overfitting and the new parameter \(\lambda \) allows to control the weight of the regularization term:
\begin{equation} \min_{P,Q} \sum_{r_{u,i}} (r_{ui} - \hat{r} _ {u,i})^2 + \lambda \left(\sum_u ||p_u||^2 + \sum_i ||q_i||^2\right) \end{equation}

If you don’t like optimizing for \(Q\) and \(P\) without seeing them in the formula we can use the description of P and Q in equation (5) and rewrite (7) such as:

\begin{equation} \min_{P,Q} \sum_{r_{u,i}} (r_{ui} - \vec{p}^{\intercal}_{u} \cdot \vec{q} _ i )^2 + \lambda \left(\sum_u ||p_u||^2 + \sum_i ||q_i||^2\right) \end{equation}

From (8) it is clear that we can compute \(\hat{r}_ {u,i}\) as the scalar product of the two vectors \(\vec{p}^{\intercal}_{u} \cdot \vec{q} _ i\).

Let’s get dirty

How about we try ? That doesn’t seem too hard. We can do some optimization with Scipy or some fancy PyTorch stuff. I took a look at SciPy, and it seems that it needs to have the partials derivatives of the function to optimize. I don’t want to compute that myself, so I’ll use Pytorch which will compute it itself using the autograd module.

We can use PyTorch and Lightning to minimize the error we start by doing the necessary imports.

python
1
2
3
4
import torch
import torch.optim as optim
import pytorch_lightning as pl
from torch.utils.data import DataLoader, TensorDataset

Then we can create our model using lightning. The first thing to do is initialize all the data that we need. i.e The matrices \(P \text{ and } Q\)

python
1
2
3
4
5
6
7
8
class MatrixFactorizationModel(pl.LightningModule):
    def __init__(self, n, m, k):
        super().__init__()
        self.n, self.m, self.k = n, m, k
        # We initialize P and Q with random values from a normal distribution
        self.P = torch.nn.Parameter(torch.randn(n, k))
        self.Q = torch.nn.Parameter(torch.randn(m, k))
        self.loss_history = []  # Store training loss

Having \(P \text{ and } Q\) we can compute \(\hat{R}\), this is done in the forward method.

python
1
2
3
    def forward(self):
        # torch.mm stands for matrix multiplication, but there is no broadcasting
        return torch.mm(self.P, self.Q.t())

the training step is just computing the loss, Lightning will take care of calling the training step for each batch, and computing the gradients and updating the parameters using PyTorch.

python
1
2
3
4
5
6
7
8
9
    def training_step(self, batch, batch_idx):
        R, = batch  # Unpack the single tensor from the batch list
        R = R.reshape(self.n, self.m)
        R_hat = self.forward()
        # We add a regularization term to avoid overfitting.
        # for the dim=1, see posts/what-is-this-dim/
        loss = self.loss_function(R, R_hat) + torch.norm(self.P,dim=1).sum() + torch.norm(self.Q,dim=1).sum()
        self.loss_history.append(loss.item())  
        return loss

Now we only need to configure the optimizer:

python
1
2
3
        def configure_optimizers(self):
            #first parameters are the weights to optimize
            return optim.SGD([self.P, self.Q], lr=0.01)

And now we can simply use the model

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
n,m,k = 100, 50,30
num_epochs = 5000

device = torch.device("cuda" if torch.cuda.is_available() and not torch.backends.mps.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
# Assuming you have your tensor R
R = torch.randn(n, m)

# Move R to the same device as the model parameters
R = R.to(device)

batch_size = 10
# create a dataset of 1 matrice of size n*m
dataset = TensorDataset(R.reshape(1,n,m))
data_loader = DataLoader(dataset)

model = MatrixFactorizationModel(n, m, k).to(device)  # Move model to the same device

trainer = pl.Trainer(max_epochs=num_epochs)
trainer.fit(model, data_loader)

import matplotlib.pyplot as plt
plt.plot(model.loss_history)

And TADA, you should have a nice plot of the loss function, showing that we found some values for \(P \text{ and } Q\) that minimize the loss function. Indeed you should not use this code, it is highly inefficient, and you should use a library like Surprise or implicit which are library for recommender systems.