Of the collaborative filtering, I implemented the simplest item-based collaborative filtering in python, so I will expose it to Qiita. (This is probably the first implementation, so if there are any mistakes, please point out mm)
Use MovieLens 100K Dataset as the recommended subject. I will leave the detailed explanation to the link destination, but roughly speaking, it is the data that evaluated the movie with a rating of 1 to 5.
In English, collaborative filtering is Collaborative filtering, which can be called a "recommendation system based on word-of-mouth based on the evaluation of others."
Collaborative filtering can be classified as follows, and item-based collaborative filtering is positioned as one of them.
--Memory-based method --The accumulated data is used directly at the time of recommendation --User base --Item base (★ Items handled this time) --Model-based method --Those that use the regularity of data that has been examined in advance --Cluster model --Function model - etc
For more information, please read Recommendation System Explanation / Lecture Materials by Professor Kamishima. This classification is also based on the above material.
I wrote that collaborative filtering is a recommender system that "uses the evaluation of others", but item-based collaborative filtering is based on the following two assumptions.
Assumption 1: Of the items that Mr. A has not evaluated (= recommended candidates), the ones that are likely to be liked are similar to the items that Mr. A highly evaluates. </ i>
Assumption 2: Items can be said to be similar if their evaluation patterns are similar. </ i>
To get an image, the explanation on P36 ~ of Construction of recommender system using collaborative filtering is easy to understand.
It is expressed by a mathematical formula as follows.
r'_{u,y} = \frac{\sum _{j \in Y_u} {s_{y,j}r_{u,j}}}{\sum _{j \in Y_u} {\mid s_{y,j} \mid}}
\scriptsize r'_{u,y}Is a rating prediction for item y for user u\\
\scriptsize S_{y,j}Is the similarity between item y and item j\\
\scriptsize Y_{u}Is a set of items that user u has evaluated\\
\scriptsize r_{u,j}Is the user u's rating for item j
The numerator is the weighted sum of the evaluated items by similarity, which corresponds to Assumption 1. The denominator is normalization.
First, we need to calculate the similarity of the items that correspond to Assumption 2.
Convert csv, tsv data to matrix with python --- MovieLens as an example assumes that it has been read into matrix R.
>>> print(R)
[[ 5. 3. 4. ..., 0. 0. 0.]
[ 4. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 5. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 5. 0. ..., 0. 0. 0.]]
This time, we will use cosine to calculate the similarity of items. It seems that Pearson correlation is also used.
def compute_item_similarities(R):
# n: movie counts
n = R.shape[1]
sims = np.zeros((n,n))
for i in range(n):
for j in range(i, n):
if i == j:
sim = 1.0
else:
# R[:, i]Is a column vector listing all user ratings for item i
sim = similarity(R[:,i], R[:,j])
sims[i][j] = sim
sims[j][i] = sim
return sims
def similarity(item1, item2):
#A set of users who have evaluated both item1 and item2
common = np.logical_and(item1 != 0, item2 != 0)
v1 = item1[common]
v2 = item2[common]
sim = 0.0
#The number of common evaluators is limited to 2 or more.
if v1.size > 1:
sim = 1.0 - cosine(v1, v2)
return sim
sims = compute_item_similarities(R)
>>> print(sims)
[[ 1. 0.94873739 0.91329972 ..., 0. 0. 0. ]
[ 0.94873739 1. 0.90887971 ..., 0. 0. 0. ]
[ 0.91329972 0.90887971 1. ..., 0. 0. 0. ]
...,
[ 0. 0. 0. ..., 1. 0. 0. ]
[ 0. 0. 0. ..., 0. 1. 0. ]
[ 0. 0. 0. ..., 0. 0. 1. ]]
Since the similarity S of the following formula was obtained in STEP 1, the predicted value can be obtained by obtaining the weighted sum and normalizing it.
r'_{u,y} = \frac{\sum _{j \in Y_u} {s_{y,j}r_{u,j}}}{\sum _{j \in Y_u} {\mid s_{y,j} \mid}}
def predict(u, sims):
#Not evaluated is 0,Evaluated is a vector that is 1. For the calculation of normalizers.
x = np.zeros(u.size)
x[u > 0] = 1
scores = sims.dot(u)
normalizers = sims.dot(x)
prediction = np.zeros(u.size)
for i in range(u.size):
#Predicted value is 0 for cases where the denominator is 0 and evaluated items
if normalizers[i] == 0 or u[i] > 0:
prediction[i] = 0
else:
prediction[i] = scores[i] / normalizers[i]
#Prediction of evaluation of user u for item i
return prediction
For confirmation, I tried it with a simple example and got the result.
u = np.array([5, 0, 1])
sims = np.array([ [1, 0.2, 0], [0.2, 1, 0.1], [0, 0.1, 1] ])
>> print(predict(u, sims))
[ 0. 3.66666667 0. ]
Implemented item-based collaborative filtering based on MovieLens data. The materials from which this article was based are as follows.
-Recommendation system commentary / lecture materials -Construction of recommender system using collaborative filtering -5th Collaborative Filtering: Basics of Information Recommendation System | gihyo.jp ... Gijutsu-Hyoronsha