[PYTHON] Collaborative filtering with principal component analysis and K-means clustering

Consider using the API of the recommendation engine

The recommendation engine is often used in WEB API, etc., and it is required to return the result quickly while maintaining the accuracy. Therefore, in this article, when implementing the recommendation function based on the movie viewing history, we use principal component analysis and K-means clustering to maintain the recommendation accuracy as much as possible and balance it with the calculation speed so that it can withstand the use of the API. Take.

The recommendation logic used this time is a simple item-based collaborative filtering that extracts and recommends movies with similar user ratings for the selected movie.

Principal component analysis and K-means clustering

Principal component analysis (PCA) is a method of extracting only the principal components of high-dimensional or super-dimensional vectors and lowering the dimensions of the vectors to reduce the amount of data. Since the number of dimensions after principal component analysis can be determined in advance, the amount of data can be determined by determining the dimensions of the principal component, so the amount of calculation can be kept constant even if the dimensions of the source vector to be analyzed become enormous.

Furthermore, by clustering to some extent with Kmeans clustering, we will limit the movies to be evaluated, reduce the number of comparisons, and aim to improve the speed of response.

This time, we will use a free movie evaluation dataset called MovieLens. https://grouplens.org/datasets/movielens/100k/

Roughly speaking. .. .. ..

You should also like movies that have similar ratings from users to the movies you like! !! However, it would be difficult to compare with everyone because it would take a long time, so first make a group of people who are roughly similar (clustering), and extract only the evaluation of the movie that seems to be more important (principal component analysis). ) Let's compare!

A little more about principal component analysis

For example, if there is evaluation data of 10,000 movies, and there is data that everyone is watching or almost no one is watching, it will not be very good information, so you can exclude it from the comparison. .. Let's narrow down to about 100, which are easy to get the characteristics. (Strictly speaking, it's different, but it's almost like this.)

Data acquisition

https://grouplens.org/datasets/movielens/100k/ Download and unzip ml-100k.zip from. There are some files inside, but they are explained in the README. This time, I mainly use u1.base.

program

I will write a comment in the program and follow it as follows.

recommend.py


import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

#Reading TSV format data
datum = np.loadtxt("u1.base", delimiter="\t", usecols=(0, 1, 2), skiprows=1)

#Prepare a list of user IDs and movie IDs
user_ids = []
movie_ids = []
for row in datum:
    user_ids.append( row[0] )
    movie_ids.append( row[1] )
user_ids = list(set(user_ids))
movie_ids = list(set(movie_ids))

#Organize evaluation data for each movie ID
vectors = {}
for movie_id in sorted(movie_ids):
    vectors[movie_id] = {}
    for user_id in user_ids:
        #Movies you haven't watched are the default-1 Evaluation
        vectors[movie_id][user_id] = -1

#Store each user rating in a vector
for row in datum:
    vectors[row[1]][row[0]] = row[2]

dataset = []

#Format data
for movie_id in vectors:
    temp_data = []
    for user_id in sorted(vectors[movie_id]):
        temp_data.append(vectors[movie_id][user_id])
    dataset.append(temp_data)

#Classified into 3 clusters by Kmeans
predict = KMeans(n_clusters=3).fit_predict(dataset)

#Number of dimensions after principal component analysis
DIMENTION_NUM = 128
#Principal component analysis
pca = PCA(n_components=DIMENTION_NUM)
dataset = pca.fit_transform(dataset)
print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))

#Find a movie similar to movie ID1
MOVIE_ID = 1
#Get the cluster ID of movie ID1
CLUSTER_ID = predict[movie_ids.index(MOVIE_ID)]

distance_data = {}
for index in range(len(predict)):
    #Compare vector distances when in the same cluster
    if predict[index] == CLUSTER_ID:
        distance = np.linalg.norm( np.array(dataset[index], dtype=float) - np.array(dataset[movie_ids.index(MOVIE_ID)], dtype=float) )
        distance_data[movie_ids[index]] = distance

#Display in vector distance order
print(sorted(distance_data.items(), key=lambda x: x[1]))

result

Cumulative contribution rate: 0.7248119795849713
[(1.0, 0.0), (121.0, 67.0315681132561), (117.0, 69.90161652852805), (405.0, 71.07049485275981), (151.0, 71.39559068741323), (118.0, 72.04600188124728), (222.0, 72.78595965661094), (181.0, 74.18442192660996), (742.0, 76.10520742268852), (28.0, 76.27732956739469), (237.0, 76.31850794116573), (25.0, 76.82773190547944), (7.0, 76.96541606511116), (125.0, 77.07961442692692), (95.0, 77.42577990621398), (257.0, 77.87452368514414), (50.0, 78.80566867021435), (111.0, 78.9631520879044), (15.0, 78.97825600046046), (69.0, 79.22663656944697), (588.0, 79.64989759225082), (82.0, 80.23718315576053), (71.0, 80.26936193506091), (79.0, 81.02025503780014).....

The movie with ID = 1 was Toy Story, and the closest movie ID = 121 was Independence Day. The contribution rate of 0.72 means that about 72% of the original data can be restored with only the main components. I feel like I can understand it somehow! !! !!

Subsequent implementation

This time, I went through data shaping, clustering, principal component analysis, and comparison with one script. Originally, the data that has been analyzed for principal components is stored in the database, and it is implemented so that only comparison is performed each time.

In addition, since movie data and evaluation data increase daily, we will decide an appropriate period and repeat clustering and principal component analysis in batches.

When converting to API, we will adjust the recommendation accuracy and response speed by making full use of the number of dimensions and clusters of the main component, API cache, etc. This * recommendation accuracy * can be seen at first by seeing if it fits in with a sense, but if you make full use of deep learning that takes into account actual data in combination with CTR and CVR, it will be even more modern. It's like machine learning.

Next time, I will write about making API using Python flask etc.

Recommended Posts

Collaborative filtering with principal component analysis and K-means clustering
Clustering and principal component analysis by K-means method (beginner)
Dimensional compression with self-encoder and principal component analysis
Principal component analysis with Spark ML
Principal Component Analysis with Livedoor News Corpus-Practice-
Principal component analysis with Power BI + Python
Principal component analysis with Livedoor News Corpus --Preparation--
I tried principal component analysis with Titanic data!
Collaborative filtering with PySpark
Principal component analysis (Principal component analysis: PCA)
Let's start multivariate analysis and principal component analysis with Pokemon! Collaboration between R and Tableau
Challenge principal component analysis of text data with Python
Principal component analysis using python from nim with nimpy
Principal component analysis (PCA) and independent component analysis (ICA) in python
I implemented collaborative filtering (recommendation) with redis and python
Unsupervised learning 3 Principal component analysis
[Recommendation] Content-based filtering and collaborative filtering
Principal component analysis hands-on with PyCaret [normalization + visualization (plotly)] memo
Photo segmentation and clustering with DBSCAN
Face recognition using principal component analysis
Explainable AI ~ Explainable k-Means and k-Medians Clustering ~
Python: Unsupervised Learning: Principal Component Analysis
Recognize the contour and direction of a shaped object with OpenCV3 and Python3 (Principal component analysis: PCA, eigenvectors)
The most basic clustering analysis with scikit-learn
Unsupervised text classification with Doc2Vec and k-means
Tweet analysis with Python, Mecab and CaboCha
<Course> Machine learning Chapter 4: Principal component analysis
[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
Extract the color of the object in the image with Mask R-CNN and K-Means clustering
Vectors are compressed to two dimensions by principal component analysis and visualized by matplotlib --Compress vectors to 2-dimension using Principal Component Analysis and visualize it with matplotlib.
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Robot grip position (Python PCA principal component analysis)