Collaborative Filtering For Anime

Experiment Overview

We'll first go over our two models and then discuss how we will compare them.

Our first model uses matrix factorization. A particularly good explanation of matrix factorization can be found in this series (though the term matrix factorization is not frequently used, the method taught is a form of Funk matrix factorization). Our matrix factorization model is a Funk matrix factorization model. We used L2 regularization in our matrix factorization model on both the user representation vectors and the anime representation vectors. Dropout was applied to each representation vector prior to calculating their dot product (which would be the predicted score the user gave the anime). Be sure to read the next section regarding our experiment results for further details about our choice to use dropout.

Below is a depiction of our neural recommender system:

In our neural model, we first embed every user and anime. We then concatenate the two embeddings and put the concatenation through several dense linear layers with a ReLU activation. We take a linear combination of the result from the dense linear layers by passing it through a fully connected layer. The result from the fully connected layer is then squashed through a sigmoid and scaled by 10 to predict the user's score for the anime.

Our neural model used L2 regularization on the user and anime embeddings. Dropout was applied to all embedding and dense layers.

Note that the use of embedding layers in our neural model is very similar to matrix factorization since embedding layers are frequently implemented as matrices.

The embeddings for the users and animes are not normalized in either model.

Note that both models are transductive. They cannot predict ratings for users or animes not present during training. This influences how we create our test set to evaluate our models.

We can't simply split our data by reviews since there might be users in the test set that haven't been seen before during training. Neither model will have any hope of returning reasonable estimates in this case.

Thus, we created our training/validation/testing split by taking each user's reviews, putting 65% of them in the training set, 15% of them in the validation set, and 20% of them into the testing set. Each split will have some reviews for each user.

We discarded any users who had fewer than 100 reviews and any animes that had fewer than 100 reviews.

We optimized both models using adaptive momentum with weight decay along with a linear learning rate schedule. We used mean squared error as our loss function. Be sure to read the next section regarding our results for further details about our optimizer choice.

We conducted hyperparameter search for both models using a tree-structured Parzen estimator.

The hyperparameters for our matrix factorization model were:

Initial Learning Rate
Number of Training Epochs
Batch Size
Max Gradient Clipping Threshold
User & Anime Vector Representation Size
L2 Regularization Multiplier
Dropout Probability

The hyperparameters for our neural model were:

Initial Learning Rate
Number of Training Epochs
Batch Size
Max Gradient Clipping Threshold
User & Anime Vector Representation Size
Number of Dense Linear Layers
L2 Regularization Multiplier
Dropout Probability

Experiment Results

Below are the results for the best models we were able to find so far from our hyperparameter search. The models are ranked based on validation score.

Click on the model labels below to see the results.

Hover over the bars in the bar charts or points in the scatter plots for more details about the data shown.

Matrix Factorization Results:

Loading results...

Neural Model Results:

Loading results...

Looking at our results, we'll see that both models perform very similarly with respect to validation and testing losses.

The first thing to notice is that our validation scores are consistently better than our testing scores. This might be a sign of overfitting. We're currently investigating this.

Early on, we suspected that there might be problems due to our data processing where the models may perform better with users who have more reviews than those with fewer reviews. If we look at our User Mean MSE Loss vs User Example Count scatter plots, we'll notice that there are users with the highest mean loss tend to be the ones with lower example counts. Ideally, our models generally would not get a higher loss on users with low example counts. If they do, then we'd expect that there would be many users with low example counts who would get high losses. Looking at the User Count vs MSE Loss Histograms for our models, most of the users have a low MSE loss. Given that this is the case, it seems that the users with high losses are outliers.

Looking at our matrix factorization models, we'll notice that our best performing models have a close-to-zero dropout probability. This makes sense since dropout for generalized linear models is equivalent to adaptive L2 regularization. It seems that adaptive regularization didn't yield much benefit over uniform L2 regularization for this dataset. If we were to repeat this experiment in the future or try matrix factorization on a different dataset, we would reconsider the need for dropout when we're applying L2 regularization.

The choice to use L2 regularization might've been redundant or have had adverse affects on our gradient descent search since we used adaptive momentum with weight decay given that L2 regularization is not effective in Adam. Our choice of optimizer and loss function might be different for future experiments given this information.

Conclusion

Our matrix factorization models and neural models achieve similar performance, so it might seem that both architectures perform similarly. However, looking at our results so far, there's still some amount of work to be done (in particular with respect to further exploration with optimizer & regularization choices). Thus, despite the fact that our initial findings show that our two architectures perform similarly, there's still further exploration to do before we can conclusively make that claim.

Interested in my work?

See my projects on GitHub.

Want to learn more about me?

Visit my website.

Intro

Experiment Overview

Experiment Results

Conclusion