Recommender system using matrix factorization [Unsupervised learning with python Chapter 10]

What to do in this article

**-Introducing a recommender system using movie datasets --Implementation of movie evaluation prediction system using matrix factorization **

Introduction

Last time, I introduced anomaly detection by autoencoder using unsupervised learning. https://qiita.com/nakanakana12/items/f238b9760af2c62fa0e8

However, these methods only identify and cannot generate new data. Next, I will introduce the model that generates the data.

Such models may want to learn the probability distribution of the dataset and make inferences about data they have never seen.

Reference book ["Learning without teacher by python"](url https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82 % 81% E3% 82% 8B% E6% 95% 99% E5% B8% AB% E3% 81% AA% E3% 81% 97% E5% AD% A6% E7% BF% 92-% E2% 80% 95% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 8F% AF% E8% 83% BD% E6% 80% A7% E3% 82% 92% E5% BA% 83% E3% 81% 92% E3% 82% 8B% E3% 83% A9% E3% 83% 99% E3% 83% AB% E3% 81% AA% E3% 81% 97% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 88% A9% E7% 94% A8-Ankur-Patel / dp / 4873119103) The recommendation system of was given as an example. In this article, I will first introduce ** "Evaluation prediction system by matrix factorization" **, which seems to be a widely used method.

Data to handle

The data handled is a movie evaluation dataset. It's called MovieLens 20M and consists of about 20,000,000 ratings. Since the total data is as heavy as 200MB, we will handle ** 1000 high-rated movies and data only for 1000 randomly sampled users **. At this time, the number of ratings will be 90,000.

Click here for the data handled https://drive.google.com/file/d/1mXioVp1LiBQt1TJyE9IStCCpjijpb4SX/view?usp=sharing

The original data can be downloaded from here. https://grouplens.org/datasets/movielens/20m/

What is matrix factorization?

In recommender systems, a method called a collaborative filtering system is often used. This is a system that makes recommendations based on similar user behavior. Netflix is famous.

A method called ** collaborative filtering system ** is often used. This is a system that makes recommendations based on similar user behavior. Netflix is famous. However, when using a collaborative filtering system, there are problems such as the need for a large amount of data and the inability to take similarities between similar movie items.

Therefore, ** dimension reduction by matrix factorization ** is used.

Here's a super-simple overview. Please refer to the reference article for the detailed principle. Reference article: Matrix Factorization, Recommendations and Me https://qiita.com/michi_wkwk/items/52660778ad6a900965ee

Matrix factorization first extracts the latent factors of each user and each movie. This latent factor means vectorizing the characteristics of each user.

For example, if a user has rated 1000 movies, that user is represented by a matrix of 1000.

** Compress this matrix tightly to reduce dimensions. It looks like this in the figure. In the figure, it is compressed to k columns. ** **

image.png

By predicting the rating of each user item from this compressed data, it is possible to make predictions with less data.

Library import

It is still a reference book.

python


'''Main'''
import numpy as np
import pandas as pd
import os, time, re
import pickle, gzip, datetime
from datetime import datetime

'''Data Viz'''
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl

%matplotlib inline

'''Data Prep and Model Evaluation'''
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, mean_squared_error

'''Algos'''
import lightgbm as lgb

'''TensorFlow and Keras'''
# import tensorflow as tf
# import tensorflow as tf
import keras
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Activation, Dense, Dropout
from keras.layers import BatchNormalization, Input, Lambda
from keras.layers import Embedding, Flatten, dot
from keras import regularizers
from keras.losses import mse, binary_crossentropy

sns.set("talk")

Data reading and confirmation

python


ratingDFX3 = pd.read_pickle(DATA_DIR + "data/datasets/movielens_data/ratingReducedPickle.pkl")

#Check the contents
n_users = ratingDFX3.userId.unique().shape[0]
n_movies = ratingDFX3.movieId.unique().shape[0]
n_ratings = len(ratingDFX3)
avg_ratings_per_user = n_ratings/n_users

print('Number of unique users: ', n_users)
print('Number of unique movies: ', n_movies)
print('Number of total ratings: ', n_ratings)
print('Average number of ratings per user: ', avg_ratings_per_user)
"""
Number of unique users:  1000
Number of unique movies:  1000
Number of total ratings:  90213
Average number of ratings per user:  90.213
"""

ratingDFX3.head()

When you check the contents, it will be displayed like this. rating is the rating, and newMovieID and newUserID are the IDs used for evaluation. image.png

python


#Separation of training data and learning data
X_train, X_test = train_test_split(ratingDFX3, test_size=0.10, \
                                   shuffle=True, random_state=2018)

X_validation, X_test = train_test_split(X_test, test_size=0.50, \
                                        shuffle=True, random_state=2018)

Creating a rating matrix

First, create a matrix of movieID * userID. Since each user only rates a small number of movies, most are in the 0 matrix. Furthermore, the verification data is flattened into one line. We will use this as a matrix for evaluation.

python


#Create a rating matrix, mostly zero sparse matrix
ratings_train = np.zeros((n_users,n_movies))
for row in X_train.itertuples():
  ratings_train[row[6]-1,row[5]-1] = row[3]

ratings_train.shape
#(1000,1000)

#Validation data rating matrix
ratings_validation = np.zeros((n_users, n_movies))
for row in X_validation.itertuples():
    ratings_validation[row[6]-1, row[5]-1] = row[3]

#Flatten validation data
actual_validation = ratings_validation[ratings_validation.nonzero()].flatten()

Baseline settings

The evaluation uses the squared error of the predicted rating and the squared error of the actual rating. As a baseline for evaluation

** "What is the error when predicting everything to 3.5?" **

I think that.

python


#Predicted value 3.Squared error when set to 5(MSE)Calculate against the validation set
#Use this as a baseline

pred_validation = np.zeros((len(X_validation),1))
pred_validation[pred_validation==0] = 3.5

print("3.5 MSE:",mean_squared_error(pred_validation,actual_validation))
#3.5 MSE: 1.055420084238528

** It became 1.055. This is the baseline once. ** **

Can a recommendation system based on matrix factorization exceed this?

Prediction by matrix factorization

python


#Matrix factorization
#Reduce user and item dimensions and compress

n_latent_factors = 3 #Latent factor, in what dimension to embed

#Creating a keras-compressed user column
user_input = Input(shape=[1], name="user")
user_embedding = Embedding(input_dim=n_users+1, output_dim=n_latent_factors,name="user_embedding")(user_input)
user_vec = Flatten(name="flatten_users")(user_embedding)

#Creating a keras-compressed movie sequence
movie_input = Input(shape=[1], name='movie')
movie_embedding = Embedding(input_dim=n_movies + 1, \
                            output_dim=n_latent_factors,
                            name='movie_embedding')(movie_input)
movie_vec = Flatten(name='flatten_movies')(movie_embedding)


product = dot([movie_vec, user_vec], axes=1)
model = Model(inputs=[user_input, movie_input], outputs=product)
model.compile('adam', 'mean_squared_error')

history = model.fit(x=[X_train.newUserId, X_train.newMovieId], \
                    y=X_train.rating, epochs=30, \
                    validation_data=([X_validation.newUserId, \
                    X_validation.newMovieId], X_validation.rating), \
                    verbose=1)

Check the calculated result.

python



pd.Series(history.history['val_loss'][10:]).plot(logy=False)
plt.xlabel("Epoch")
plt.ylabel("Validation Error")
print('Minimum MSE: ', min(history.history['val_loss']))
#Minimum MSE:  0.7946764826774597

image.png

** When using matrix factorization, it is 0.794, which is smaller than the baseline of 1.055. ** ** From this, it can be seen that the evaluation prediction system by matrix factorization is working well.

Actually, if you use this to propose a movie that the evaluation users have not evaluated yet but seems to be highly evaluated, it seems that a recommendation system can be created.

At the end

This time, we introduced an evaluation prediction system based on matrix factorization when using a movie recommendation system as a theme.

But the true power of unsupervised learning is

** "Unsupervised generative model" **

It is in. I would like to continue learning and put together that much.

Reference article

Matrix Factorization, Recommendations and Me https://qiita.com/michi_wkwk/items/52660778ad6a900965ee

Recommended Posts

Recommender system using matrix factorization [Unsupervised learning with python Chapter 10]
Make a recommender system with python
Learning Python with ChemTHEATER 03
"Object-oriented" learning with python
Learning Python with ChemTHEATER 05-1
Python: Unsupervised Learning: Basics
Implemented Matrix Factorization (python)
Learning Python with ChemTHEATER 02
Learning Python with ChemTHEATER 01
Make a recommender system with python
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
[S3] CRUD with S3 using Python [Python]
Using Quaternion with Python ~ numpy-quaternion ~
[Python] Using OpenCV with Python (Basic)
Reinforcement learning starting with Python
Beginning with Python machine learning
Python Iteration Learning with Cheminformatics
Presentation Support System with Python3
Using OpenCV with Python @Mac
Python: Unsupervised Learning: Non-hierarchical clustering
Send using Python with Gmail
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 2
Under investigation about PYNQ-Let's do deep learning with FPGA using Python-
Complement python with emacs using company-jedi
Harmonic mean with Python Harmonic mean (using SciPy)
[Python] Using OpenCV with Python (Image Filtering)
Matrix representation with Python standard input
100 Language Processing Knock with Python (Chapter 1)
[Python] Using OpenCV with Python (Image transformation)
[Python] Using OpenCV with Python (Edge Detection)
Machine learning with python (1) Overall classification
Input / output with Python (Python learning memo ⑤)
100 Language Processing Knock with Python (Chapter 3)
Non-negative Matrix Factorization (NMF) with scikit-learn
Perceptron learning experiment learned with Python
Notes on using rstrip with python.
Python: Unsupervised Learning: Principal Component Analysis
"Scraping & machine learning with Python" Learning memo
When using MeCab with virtualenv python
Precautions when using six with Python 2.5
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
Impressions of people with experience in other languages learning Python using PyQ
System trading starting with Python3: long-term investment
[Examples of improving Python] Learning Python with Codecademy
[AWS] Using ini files with Lambda [Python]
Spiral book in Python! Python with a spiral book! (Chapter 14 ~)
Try mathematical formulas using Σ with python
[Python] Matrix multiplication processing time using NumPy
Amplify images for machine learning with python
Behind the flyer: Using Docker with Python
Using Python and MeCab with Azure Databricks
Machine learning with python (2) Simple regression analysis
"System trade starting with Python3" reading memo
Socket communication using socketserver with python now
Try using Python with Google Cloud Functions
100 Language Processing Knock with Python (Chapter 2, Part 2)
VBA user tried using Python / R: Matrix
Check stock prices with slackbot using python
Working with OpenStack using the Python SDK
Ant book with python (chapter3 intermediate edition ~)