[PYTHON] I applied LightFM to Movielens

While researching the Factoriazation Machines related library, I came across a library called LightFM, so I tried using it.

Eventually I would like to apply it to my own dataset, but this time I will apply it to Movielens (movie recommendation dataset) to get used to using LightFM.

LightFM repository GitHub - lyst/lightfm: A Python implementation of LightFM, a hybrid recommendation algorithm.

Difference between Light FM and FM

LightFM is attached to FM, but it is not a library of Factorization Machines.

When reading LightFM's Paper, the authors state that "LightFM is a special case of FM."

I was looking for a python implementation of FM, so I was a little disappointed, but when I read it, I still had a model that seemed to solve the task I wanted to try, so I decided to try this. (There are plenty of tutorials and documentation)

In normal FM, in addition to user id and item id, various features can be used as context features for input, and each embedding is taken and all inner products are taken to output the sum.

f(x)=w_0 + \sum_{i=1}^d w_ix_i + \sum_{i,j}\langle \boldsymbol v_i, \boldsymbol 
 v_j \rangle x_ix_j

Take the inner product of user embedding and item embedding, the inner product of context features, and so on.

LightFM can partially include context features like FM, but not everything can be included.

The feature to be added is bound to be either a user feature or an item feature.

As for the inner product, the user's feature only takes the inner product with the item's feature. It does not take the inner product of user features.

f(i, u)=  \langle \boldsymbol p_i,  \boldsymbol q_u \rangle + b_i +b_u 

However

 \boldsymbol p_i = \sum_{j \in f_i}\boldsymbol e_{j}^I
  \boldsymbol q_u = \sum_{j \in f_u}\boldsymbol e_{j}^U
  b_i = \sum_{j \in f_i}b_{j}^U
  b_u = \sum_{j \in f_u} b_{j}^U

is.

If you don't give the features of user or item, it matches the form of Matrix Factorization.

The losses supported by LightFM are BPR, WARP, warp-kos, and logistic loss. The former three are loss for ranking, and are suitable for ranking learning with implicit feedback.

I wrote an article about BPR here, so please do not hesitate to contact me!

[Paper introduction] BPR: Bayesian Personalized Ranking from Implicit Feedback (UAI 2009)

Let's move

Get Movielens dataset

Movielens is a dataset of movie recommendations and is often used in experiments in this area. This time, I'm focusing on a relatively small size.

LightFM kindly provides a class to load Movielens, so let's use it.

(LightFM install can be done with pip or conda. Omitted)

from lightfm.datasets import fetch_movielens
data = fetch_movielens()

data.keys()

Then

dict_keys(["train", "test",  "item_features", "item_feature_labels", "item_labels"])

Has returned.

train and test are rated in the matrix of (n_user, n_item). If there is no rating, 0 is included.

I thought that item_features is a matrix of (n_item, n_features) that represents which feature each item has by 01 ... but when I look at the contents, it looks like a square matrix and a unit matrix. is.

Also, the document says that item_feature_label contains the name of each feature and item_labels contains the name of the item, and it says that they are arrays of (n_feature,) and (n_item,), respectively. Both of them

array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
       'Sliding Doors (1998)', 'You So Crazy (1994)',
       'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object)

was. All are item names.

Perhaps this tutorial is going to work without the item feature (that is, with Matrix Factorization).

Learning part

Click here for the part that actually trains the model.

train = data["train"]
test = data["test"]

model = LightFM(no_components=10,learning_rate=0.05, loss='bpr')
model.fit(train, epochs=10)

train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10).mean()

train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

You can specify the dimension of embedding with no_components. This time 10.

At model fit and auc

model.fit(train, item_features=data["item_features"], epochs=10)
auc_score(model, train, item_features=data["item_features"]).mean()

You can put item_features like this.

However, this time, item_features contains only item id, so the result should not change.

item_No feature
Precision: train 0.59, test 0.10.
AUC: train 0.90, test 0.86.

item_with feature
Precision: train 0.59, test 0.10.
AUC: train 0.89, test 0.86.

I thought, but it's slightly different ... Is it an error?

Looking at the LightFM source, if item_features is not specified, item features are created with the identity matrix of (n_item, n_item), so both of the above should behave the same.

Get embedding

The embedding after learning can be obtained as follows.

user_embedding=model.user_embeddings
item_embedding=model.item_embeddings

user_embedding is (n_user_features, no_components) item_embedding is (n_item_features, no_components)

It is an array of.

As I found out by going to the source earlier, if features were not specified, embedding for user / item id was obtained.

predict

Finally, there was a slight stumbling block in the predict part, so make a note of it.

Given user_id and item_id, it will calculate the score for that pair. To give an id, you may first give multiple item ids for each user.

model.predict(user_ids=0,item_ids=[1,3,4])

Then, the score for item1,3,4 of user0 will be calculated. I think that there are many situations where ranking is created for each user, so this writing method is convenient.

There are a few points to note about multiple user id and multiple item id pairs,

model.predict(user_ids=[4,3,1],item_ids=[1,3,4])

If you do, you will get an Assertion Error.

Correctly

import numpy as np
model.predict(user_ids=np.array([4,3,1]),item_ids=[1,3,4])

When reading the source, if the user id is not a numpy array, the process of repeating the number of item ids is included in order to match the number with the item ids. If the user id is int (when calculating the score for multiple items per user mentioned earlier), the lengths will be the same, but when user_ids is given as a list, it will be repeated and the length will be the same. Will not be aligned. This is a trap ...

I should decide whether to repeat by "Can user_ids calculate len?" Someone will give you a list ...

About loss

BPR and WARP have been implemented so that positive feedback is regarded as positive feedback. It seems that you don't have to fix it to 1 here and give it.

It was written that logistic loss can be used when there are +1 and -1 feedback, It worked even if I gave the rating matrix as it was. (I couldn't quite understand what I was doing inside just by looking at the source)

logistic


Precision: train 0.43, test 0.08.
AUC: train 0.87, test 0.84.

bpr(Repost)


Precision: train 0.59, test 0.10.
AUC: train 0.90, test 0.86.

However, you can see that it is not suitable as a ranking index.

in conclusion

As mentioned above, I think that the basic processing can be done now.

I'm happy that the tutorials and documentation are easy to understand and that there are multiple losses supported.

I wrote it briefly, but it is also convenient that evaluation indicators such as AUC and Precision @ k are also provided.

Next, I'm going to apply LightFM to my own dataset!

Thank you for reading until the end.

Recommended Posts

I applied LightFM to Movielens
I started to analyze
I tried to debug.
I tried to paste
I tried to learn PredNet
I tried to organize SVM.
I talked to Raspberry Pi
I tried to implement PCANet
Introduction to Nonlinear Optimization (I)
I want to solve Sudoku (Sudoku)
I tried to reintroduce Linux
I tried to introduce Pylint
I tried to summarize SparseMatrix
I tried to touch jupyter
I tried to implement StarGAN (1)
I tried to implement Deep VQE
I tried to create Quip API
I tried to touch Python (installation)
I want to understand systemd roughly
I tried to explain Pytorch dataset
I tried Watson Speech to Text
I calculated the stochastic integral (I to integral)
I tried to touch Tesla's API
I wanted to evolve cGAN to ACGAN
I tried to implement hierarchical clustering
I want to scrape images to learn
I tried to organize about MCMC.
I want to do ○○ with Pandas
I tried to implement Realness GAN
I want to copy yolo annotations
I want to debug with Python
Hash chain I wanted to avoid (1)
I tried to move the ball
I was addicted to multiprocessing + psycopg2
I tried to estimate the interval.
A story I was addicted to trying to install LightFM on Amazon Linux