Implement item-based collaborative filtering in python-using MovieLens as an example

Of the collaborative filtering, I implemented the simplest item-based collaborative filtering in python, so I will expose it to Qiita. (This is probably the first implementation, so if there are any mistakes, please point out mm)

Use MovieLens 100K Dataset as the recommended subject. I will leave the detailed explanation to the link destination, but roughly speaking, it is the data that evaluated the movie with a rating of 1 to 5.

About collaborative filtering

In English, collaborative filtering is Collaborative filtering, which can be called a "recommendation system based on word-of-mouth based on the evaluation of others."

Collaborative filtering can be classified as follows, and item-based collaborative filtering is positioned as one of them.

--Memory-based method --The accumulated data is used directly at the time of recommendation --User base --Item base (★ Items handled this time) --Model-based method --Those that use the regularity of data that has been examined in advance --Cluster model --Function model - etc

For more information, please read Recommendation System Explanation / Lecture Materials by Professor Kamishima. This classification is also based on the above material.

Item-based collaborative filtering

I wrote that collaborative filtering is a recommender system that "uses the evaluation of others", but item-based collaborative filtering is based on the following two assumptions.

Assumption 1: Of the items that Mr. A has not evaluated (= recommended candidates), the ones that are likely to be liked are similar to the items that Mr. A highly evaluates. </ i>

Assumption 2: Items can be said to be similar if their evaluation patterns are similar. </ i>

To get an image, the explanation on P36 ~ of Construction of recommender system using collaborative filtering is easy to understand.

It is expressed by a mathematical formula as follows.

r'_{u,y} = \frac{\sum _{j \in Y_u} {s_{y,j}r_{u,j}}}{\sum _{j \in Y_u} {\mid s_{y,j} \mid}}
\scriptsize r'_{u,y}Is a rating prediction for item y for user u\\
\scriptsize S_{y,j}Is the similarity between item y and item j\\
\scriptsize Y_{u}Is a set of items that user u has evaluated\\
\scriptsize r_{u,j}Is the user u's rating for item j

The numerator is the weighted sum of the evaluated items by similarity, which corresponds to Assumption 1. The denominator is normalization.

STEP1: Calculate the similarity of items

First, we need to calculate the similarity of the items that correspond to Assumption 2.

Convert csv, tsv data to matrix with python --- MovieLens as an example assumes that it has been read into matrix R.

>>> print(R)
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ...,
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]

This time, we will use cosine to calculate the similarity of items. It seems that Pearson correlation is also used.

def compute_item_similarities(R):
    # n: movie counts
    n = R.shape[1]
    sims = np.zeros((n,n))

    for i in range(n):
        for j in range(i, n):
            if i == j:
                sim = 1.0
            else:
                # R[:, i]Is a column vector listing all user ratings for item i
                sim = similarity(R[:,i], R[:,j])

            sims[i][j] = sim 
            sims[j][i] = sim 

    return sims 

def similarity(item1, item2):
    #A set of users who have evaluated both item1 and item2
    common = np.logical_and(item1 != 0, item2 != 0)

    v1 = item1[common]
    v2 = item2[common]

    sim = 0.0
    #The number of common evaluators is limited to 2 or more.
    if v1.size > 1:
        sim = 1.0 - cosine(v1, v2)

    return sim

sims = compute_item_similarities(R)
>>> print(sims)
[[ 1.          0.94873739  0.91329972 ...,  0.          0.          0.        ]
 [ 0.94873739  1.          0.90887971 ...,  0.          0.          0.        ]
 [ 0.91329972  0.90887971  1.         ...,  0.          0.          0.        ]
 ...,
 [ 0.          0.          0.         ...,  1.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          1.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          1.        ]]

STEP2: Predict evaluation

Since the similarity S of the following formula was obtained in STEP 1, the predicted value can be obtained by obtaining the weighted sum and normalizing it.

r'_{u,y} = \frac{\sum _{j \in Y_u} {s_{y,j}r_{u,j}}}{\sum _{j \in Y_u} {\mid s_{y,j} \mid}}
def predict(u, sims):
    #Not evaluated is 0,Evaluated is a vector that is 1. For the calculation of normalizers.
    x = np.zeros(u.size) 
    x[u > 0] = 1

    scores      = sims.dot(u)
    normalizers = sims.dot(x)

    prediction = np.zeros(u.size)
    for i in range(u.size):
        #Predicted value is 0 for cases where the denominator is 0 and evaluated items
        if normalizers[i] == 0 or u[i] > 0:
            prediction[i] = 0
        else:
            prediction[i] = scores[i] / normalizers[i]

    #Prediction of evaluation of user u for item i
    return prediction

For confirmation, I tried it with a simple example and got the result.


u = np.array([5, 0, 1])
sims = np.array([ [1, 0.2, 0], [0.2, 1, 0.1], [0, 0.1, 1] ])

>> print(predict(u, sims))
[ 0.          3.66666667  0.        ]

Summary

Implemented item-based collaborative filtering based on MovieLens data. The materials from which this article was based are as follows.

-Recommendation system commentary / lecture materials -Construction of recommender system using collaborative filtering -5th Collaborative Filtering: Basics of Information Recommendation System | gihyo.jp ... Gijutsu-Hyoronsha

Recommended Posts