[PYTHON] Things to be aware of when building a recommender system using Item2Vec

I'm Kubota from NTT DoCoMo. This is the second appearance.

Do you know the technology called Item2Vec? Item2Vec is a technology that applies Word2Vec, which acquires distributed expressions of words from sentences, to a recommender system. Specifically, when considering the recommendation on the EC site, the word of Word2Vec is used as the product, the sentence is used as the item set evaluated by the user, the distributed expression of the item is acquired, and the recommendation is made based on the similarity between the items. I feel like doing it.

Since it is easy to implement, there are many articles that I tried relatively, but there are some points to be aware of when actually considering application to recommender systems.

Item2Vec implementation policy

There is a topic analysis library called gensim, which makes it easy to implement Item2Vec. You can train the model by inputting a set of items evaluated by the user on one line and a text file (in this case, item_buskets.txt) separated by spaces for each item as shown in the following example. The parameters will be explained later. It's really easy!

from gensim.models import word2vec

sentences = word2vec.LineSentence('item_buskets.txt')
model = word2vec.Word2Vec(sentences)

Be careful when considering application to recommender systems

Item2Vec can be easily implemented using gensim, a library of topic analysis. However, since gensim was originally created with the application to natural language processing in mind, it is necessary to make changes according to the problem settings when applying to recommender systems with different problem settings.

Difference between Word2Vec and Item2Vec

From these differences, it is hypothesized that the hyperparameters of Word2Vec and Item2Vec will be different.

A paper that tested that hypothesis was Word2vec applied to Recommendation: Hyperparameters Matter reported at recsys 2018. The following experimental settings and evaluation results are quoted from this paper.

Experimental settings

There are quite a few characteristics in the distribution depending on the data set. 30 Music datasets that are data from last.fm and Deezer datasets that are data from Deezer Is a music streaming system, and there is a considerable difference between popular songs and unpopular songs. Click-Stream datasets are also different in popularity and unpopularity. On the other hand, the E-commerce dataset has a gentler curve than the previous two.

NDCG@K = \left\{
\begin{array}{ll}
\frac{1}{\log_{2} (j+1)} & (\text{if} \ j^{th} \ \text{predicted item is correct}) \\
0 & (\text{otherwise})
\end{array}
\right.

Search parameters

In the paper shown above, the following parameters are searched and evaluated.

Parameters Corresponding options in gensim's Word2Vec
window size $ L $ window
epochs $ n $ iter
sub-sampling parameter $ t $ sample
negative sampling distribution parameter $ \alpha $ ns_exponent
embedding size size
the number of negative samples negative
learning rate alpha, min_alpha

Probably not very familiar, so I think it's $ t $ and $ \ alpha $. The sub-sampling parameter $ t $ is a parameter related to downsampling of high frequency words. In natural language processing, the high-frequency words "a" and "the" are downsampled because they do not have much information compared to the low-frequency words. In the problem setting of recommender systems, popular items that are frequently used words should have a considerable effect on the accuracy of recommender systems, so it is understandable that the influence of parameters is likely to be high.

Next, the negative sampling distribution parameter $ \ alpha $ is a parameter that changes the shape of the distribution to be negatively sampled. The default for gensim is 0.75. $ \ Alpha = 1 $ results in sampling based on word frequency, $ \ alpha = 0 $ results in random sampling, and negative values make it easier to sample infrequent ones.

In the paper, it seems that the parameters shown in the table were investigated, but it seems that the parameters other than the four parameters in bold did not affect the performance so much, and the four parameters are evaluated in detail.

The figure below shows the evaluation results of the paper. If you look only at Item2Vec (Out-of-the-box SGNS in the table) implemented with the default parameters of gensim and Item2Vec (Fully optimised SGNS in the table) with the four parameters as the optimum parameters, I think that it is okay for the time being. item2vec結果.PNG

Music datasets (30 Music dataset and Deezer dataset), which had a big difference between popular and unpopular items, have about twice the performance of the default! With Click-Stream dataset, the accuracy is improved about 10 times, which is amazing.

The paper shows the relationship between the distribution and accuracy of the negative sampling distribution parameter $ \ alpha $ (ns_exponent for gensim) on a 30Music dataset. alpha.PNG You can see that the default parameter of gensim, 0.75, is not the optimal parameter. By the way, based on the result of this paper, ns_exponent, which corresponds to $ \ alpha $, has been added as an option of gensim.

Summary

It was an introduction of a paper trying to set hyperparameters according to the problem setting. Since ○○ Vec is quite popular, it may be interesting to search for optimization with what parameters.

Recommended Posts

Things to be aware of when building a recommender system using Item2Vec
[Caution] When creating a binary image (1bit / pixel), be aware of the file format!
Things to note when initializing a list in Python
Summary of things that were convenient when using pandas
Summary of things that need to be installed to run tf-pose-estimation
Record of actions to be taken when google_image_download cannot be used
[Introduction to AWS] A memorandum of building a web server on AWS
Be careful when assigning Series as a column to pandas.DataFrame
Things to keep in mind when using Python with AtCoder
Precautions that must be understood when building a PYTHON environment
Things to keep in mind when using cgi with python.
List of libraries to install when installing Python using Pyenv
[End of 2020] A memo to start using AWS CLI (Version 2)
Here's a summary of things that might be useful when dealing with complex numbers in Python
A memorandum of using eigen3
Things to watch out for when using default arguments in Python
I tried to get a database of horse racing using Pandas
I tried to make a regular expression of "amount" using Python
What to do when a video cannot be read by cv2.VideoCapture
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
Convert a large number of PDF files to text files using pdfminer
I tried to get a list of AMI Names using Boto3
How to save only a part of a long video using OpenCV
Output search results of posts to a file using Mattermost API
Try using Elasticsearch as the foundation of a question answering system