**-Introducing a recommender system using movie datasets --Implementation of movie evaluation prediction system using matrix factorization **
Last time, I introduced anomaly detection by autoencoder using unsupervised learning. https://qiita.com/nakanakana12/items/f238b9760af2c62fa0e8
However, these methods only identify and cannot generate new data. Next, I will introduce the model that generates the data.
Such models may want to learn the probability distribution of the dataset and make inferences about data they have never seen.
Reference book ["Learning without teacher by python"](url https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82 % 81% E3% 82% 8B% E6% 95% 99% E5% B8% AB% E3% 81% AA% E3% 81% 97% E5% AD% A6% E7% BF% 92-% E2% 80% 95% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 8F% AF% E8% 83% BD% E6% 80% A7% E3% 82% 92% E5% BA% 83% E3% 81% 92% E3% 82% 8B% E3% 83% A9% E3% 83% 99% E3% 83% AB% E3% 81% AA% E3% 81% 97% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 88% A9% E7% 94% A8-Ankur-Patel / dp / 4873119103) The recommendation system of was given as an example. In this article, I will first introduce ** "Evaluation prediction system by matrix factorization" **, which seems to be a widely used method.
The data handled is a movie evaluation dataset. It's called MovieLens 20M and consists of about 20,000,000 ratings. Since the total data is as heavy as 200MB, we will handle ** 1000 high-rated movies and data only for 1000 randomly sampled users **. At this time, the number of ratings will be 90,000.
Click here for the data handled https://drive.google.com/file/d/1mXioVp1LiBQt1TJyE9IStCCpjijpb4SX/view?usp=sharing
The original data can be downloaded from here. https://grouplens.org/datasets/movielens/20m/
In recommender systems, a method called a collaborative filtering system is often used. This is a system that makes recommendations based on similar user behavior. Netflix is famous.
A method called ** collaborative filtering system ** is often used. This is a system that makes recommendations based on similar user behavior. Netflix is famous. However, when using a collaborative filtering system, there are problems such as the need for a large amount of data and the inability to take similarities between similar movie items.
Therefore, ** dimension reduction by matrix factorization ** is used.
Here's a super-simple overview. Please refer to the reference article for the detailed principle. Reference article: Matrix Factorization, Recommendations and Me https://qiita.com/michi_wkwk/items/52660778ad6a900965ee
Matrix factorization first extracts the latent factors of each user and each movie. This latent factor means vectorizing the characteristics of each user.
For example, if a user has rated 1000 movies, that user is represented by a matrix of 1000.
** Compress this matrix tightly to reduce dimensions. It looks like this in the figure. In the figure, it is compressed to k columns. ** **
By predicting the rating of each user item from this compressed data, it is possible to make predictions with less data.
It is still a reference book.
python
'''Main'''
import numpy as np
import pandas as pd
import os, time, re
import pickle, gzip, datetime
from datetime import datetime
'''Data Viz'''
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl
%matplotlib inline
'''Data Prep and Model Evaluation'''
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, mean_squared_error
'''Algos'''
import lightgbm as lgb
'''TensorFlow and Keras'''
# import tensorflow as tf
# import tensorflow as tf
import keras
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Activation, Dense, Dropout
from keras.layers import BatchNormalization, Input, Lambda
from keras.layers import Embedding, Flatten, dot
from keras import regularizers
from keras.losses import mse, binary_crossentropy
sns.set("talk")
python
ratingDFX3 = pd.read_pickle(DATA_DIR + "data/datasets/movielens_data/ratingReducedPickle.pkl")
#Check the contents
n_users = ratingDFX3.userId.unique().shape[0]
n_movies = ratingDFX3.movieId.unique().shape[0]
n_ratings = len(ratingDFX3)
avg_ratings_per_user = n_ratings/n_users
print('Number of unique users: ', n_users)
print('Number of unique movies: ', n_movies)
print('Number of total ratings: ', n_ratings)
print('Average number of ratings per user: ', avg_ratings_per_user)
"""
Number of unique users: 1000
Number of unique movies: 1000
Number of total ratings: 90213
Average number of ratings per user: 90.213
"""
ratingDFX3.head()
When you check the contents, it will be displayed like this. rating is the rating, and newMovieID and newUserID are the IDs used for evaluation.
python
#Separation of training data and learning data
X_train, X_test = train_test_split(ratingDFX3, test_size=0.10, \
shuffle=True, random_state=2018)
X_validation, X_test = train_test_split(X_test, test_size=0.50, \
shuffle=True, random_state=2018)
First, create a matrix of movieID * userID. Since each user only rates a small number of movies, most are in the 0 matrix. Furthermore, the verification data is flattened into one line. We will use this as a matrix for evaluation.
python
#Create a rating matrix, mostly zero sparse matrix
ratings_train = np.zeros((n_users,n_movies))
for row in X_train.itertuples():
ratings_train[row[6]-1,row[5]-1] = row[3]
ratings_train.shape
#(1000,1000)
#Validation data rating matrix
ratings_validation = np.zeros((n_users, n_movies))
for row in X_validation.itertuples():
ratings_validation[row[6]-1, row[5]-1] = row[3]
#Flatten validation data
actual_validation = ratings_validation[ratings_validation.nonzero()].flatten()
The evaluation uses the squared error of the predicted rating and the squared error of the actual rating. As a baseline for evaluation
** "What is the error when predicting everything to 3.5?" **
I think that.
python
#Predicted value 3.Squared error when set to 5(MSE)Calculate against the validation set
#Use this as a baseline
pred_validation = np.zeros((len(X_validation),1))
pred_validation[pred_validation==0] = 3.5
print("3.5 MSE:",mean_squared_error(pred_validation,actual_validation))
#3.5 MSE: 1.055420084238528
** It became 1.055. This is the baseline once. ** **
Can a recommendation system based on matrix factorization exceed this?
python
#Matrix factorization
#Reduce user and item dimensions and compress
n_latent_factors = 3 #Latent factor, in what dimension to embed
#Creating a keras-compressed user column
user_input = Input(shape=[1], name="user")
user_embedding = Embedding(input_dim=n_users+1, output_dim=n_latent_factors,name="user_embedding")(user_input)
user_vec = Flatten(name="flatten_users")(user_embedding)
#Creating a keras-compressed movie sequence
movie_input = Input(shape=[1], name='movie')
movie_embedding = Embedding(input_dim=n_movies + 1, \
output_dim=n_latent_factors,
name='movie_embedding')(movie_input)
movie_vec = Flatten(name='flatten_movies')(movie_embedding)
product = dot([movie_vec, user_vec], axes=1)
model = Model(inputs=[user_input, movie_input], outputs=product)
model.compile('adam', 'mean_squared_error')
history = model.fit(x=[X_train.newUserId, X_train.newMovieId], \
y=X_train.rating, epochs=30, \
validation_data=([X_validation.newUserId, \
X_validation.newMovieId], X_validation.rating), \
verbose=1)
Check the calculated result.
python
pd.Series(history.history['val_loss'][10:]).plot(logy=False)
plt.xlabel("Epoch")
plt.ylabel("Validation Error")
print('Minimum MSE: ', min(history.history['val_loss']))
#Minimum MSE: 0.7946764826774597
** When using matrix factorization, it is 0.794, which is smaller than the baseline of 1.055. ** ** From this, it can be seen that the evaluation prediction system by matrix factorization is working well.
Actually, if you use this to propose a movie that the evaluation users have not evaluated yet but seems to be highly evaluated, it seems that a recommendation system can be created.
This time, we introduced an evaluation prediction system based on matrix factorization when using a movie recommendation system as a theme.
But the true power of unsupervised learning is
** "Unsupervised generative model" **
It is in. I would like to continue learning and put together that much.
Matrix Factorization, Recommendations and Me https://qiita.com/michi_wkwk/items/52660778ad6a900965ee
Recommended Posts