[PYTHON] A sample to try Factorization Machines quickly with fastFM

--This is an introductory article to quickly try out the Factorization Machines algorithm, which has been attracting attention in recommendation technology in recent years, using a library. ――It is not a theoretical explanation but an article for moving. ――Basically, follow the tutorial + supplement.

Various reference materials

This time I used a library called fastFM. The explanation and performance of this library are published on arXiv.

For those who want to know the outline and trends of Factorization Machines, there are reference articles inside and outside Qiita, so I will post some of them. The following books are also available for fastFM. Roughly speaking, it is an algorithm that "uses matrix factorization to perform regression, classification, and ranking that is strong against sparse data."

Main story

Introduction

There are notes. In my own environment, it was as follows.

For Python 3.6.10.

--Can be installed with pip install fastFM

For Python 3.7.6

--An error occurred when installing pip. -Can be operated with Install from source on GitHub.

So, for new environments, install from source, build an environment that can run multiple pythons such as pyenv and introduce 3.6 series, or create some kind of 3.6 environment with Docker etc. I think it will be.

Sample data

When it comes to Factorization Machines (FM) samples, I feel that dictionary-type samples are often used for sample data.

[
{user:A, item:X, ...},
{user:B, item:Y, ...}, ...
]

Like. For sparse data, I think that such data is often input, but this time I would like to handle it assuming a simple csv.

Sample dummy data


category,rating,is_A,is_B,is_C,is_X,is_Y,is_Z
A,5,0,0,1,0,1,0
A,1,1,0,0,0,0,1
B,2,0,1,0,0,0,0
B,5,0,0,0,0,1,0
C,1,1,0,0,0,0,1
C,4,0,0,0,0,1,0
...

I'll put all the versions at the bottom. The value is a dummy made appropriately,

--Category column with category information --rating column with 5 grades from 1 to 5 ―― ʻis_?` Column containing flag information ――Imagine a flag that bought a product or a flag that indicates user attributes.

Is assumed.

Processing flow confirmation with regression analysis

The usage of the library itself is simple, and it is familiar to those who have used scikit-learn etc. First, let's create a processing flow with simple regression analysis logic. The details are broken, but the general model creation seems to be as follows.

  1. Read data
  2. Preprocessing (convert data to a format that fits the model)
  3. Divided into "learning data" (, "validation data") and "test data"
  4. Create a model with training data
  5. Apply model to test data
  6. Define and evaluate performance evaluation indicators

This time, I will create a regression model with the theme of ** rating **. (For the sake of clarity, I import each time, but you can import everything at the top.)

Data read

import numpy as np
import pandas as pd

#read csv data
raw = pd.read_csv('fm_sample.csv')

#Separate target columns and other information
target_label = "rating"

data_y = raw[target_label]
data_X = raw.drop(target_label, axis=1)

Pre-processing-data division

#Preprocessing
##Convenient category data processing library, scikit-Use the convenient functions of learn
import category_encoders as ce

##One for the specified column-hot encode
enc = ce.OneHotEncoder(cols=['category'])

X = enc.fit_transform(data_X)
y = data_y

#Data split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=810)

Modeling-evaluation

The evaluation is MAE (Mean Squared Error), but MSE etc. can be calculated quickly.

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error

#Modeling
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

#Evaluation
##Applicability to training data
mean_absolute_error(y_train, reg.predict(X_train))
##Test data error
mean_absolute_error(y_test, reg.predict(X_test))

I think it will be like this. Of course, I would do more for the evaluation part, such as drawing the calculated regression line and seeing how the error deviates, but this time until I move it.

Regression by FM

If the above flow can be done, the rest is completed if the calculation part is made into the fastFM specification to be used this time. One point to note is that ** DataFrame cannot be handled as it is, so csr_matrix is used **.

Modeling-evaluation

from fastFM import als
from scipy.sparse import csr_matrix

#Modeling
fm = als.FMRegression(n_iter=1000, init_stdev=0.1, rank=8, l2_reg_w=0.5, l2_reg_V=0.5, random_state=810)
fm.fit(csr_matrix(X_train), y_train)

#Evaluation
##Applicability to training data
mean_absolute_error(y_train, fm.predict(csr_matrix(X_train)))
##Test data error
mean_absolute_error(y_test, fm.predict(csr_matrix(X_test)))

Sparse matrix, csr_matrix

Here comes the csr_matrix. It deals with sparse data. The image is simple, assuming that DataFrame and normal matrix handle 2D data as follows.

matrix


array([
  [0, 0, 1],
  [0, 0, 0],
  [0, 3, 0]
])

To handle only the part that contains the data

Handling of sparse data


Size: 3 x 3
Where the data is:
([0,2]1 at the point)
([2,1]3 at the point)

It is an image like. There are several types of handling, such as csr_matrix, coo_matrix, csc_matrix, lil_matrix, and it seems that the handling and processing speed are different, so if you are interested, please search with "scipy sparse matrix" etc. note.nkmk.me and so on.

What to remember this time

--Example of conversion from DataFrame: csr_matrix (df) --converting csr_matrix to a matrix todense Example: csr_matrix (X_train) .todense ()

I wonder if.

Classification by FM

Binary classification is also possible, so I will try it. This time, I will do the task of ** guessing 2 classes with rating of 4 or more or less **. After the pre-processing-data division part, the model is created and evaluated after creating the answer data as to whether the rating is 4 or higher. Also note that in the fastFM classification, values are created with -1 or 1 instead of 0 or 1.

from fastFM import sgd
from sklearn.metrics import roc_auc_score

#Pre-processing continued
##1 if 4 or more otherwise-Set to 1
y_ = np.array([1 if r > 3 else -1 for r in y])

##Creation of training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y_, random_state=810)

#Modeling
fm = sgd.FMClassification(n_iter=5000, init_stdev=0.1, l2_reg_w=0,
                          l2_reg_V=0, rank=2, step_size=0.1)
fm.fit(csr_matrix(X_train), y_train)

##It seems that you can get two types of predicted values.
y_pred = fm.predict(csr_matrix(X_test))
y_pred_proba = fm.predict_proba(csr_matrix(X_test))

#Evaluation
##Example of evaluating the value of AUC
roc_auc_score(y_test, y_pred_proba)

Draw a ROC curve

I will refer to the page of note.nkmk.me and write the ROC curve. ..

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba, drop_intermediate=False)

auc = metrics.auc(fpr, tpr)

#Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.show()

image.png

I was able to do it.

other

Below, paste the sample data. It's made properly, so it's not interesting data. Please check the operation.

fm_sample.csv


category,rating,is_A,is_B,is_C,is_X,is_Y,is_Z
A,5,0,0,1,0,1,0
A,1,1,0,0,0,0,1
A,3,0,0,1,0,1,0
A,2,1,0,0,0,0,1
A,4,0,0,0,0,0,1
A,5,1,0,0,1,1,0
A,1,0,1,0,0,0,1
A,2,0,0,0,0,0,1
B,2,0,1,0,0,0,0
B,5,0,0,0,0,1,0
B,3,1,1,0,0,1,0
B,2,0,0,1,0,0,0
B,1,0,0,0,0,0,1
B,3,0,0,1,0,0,1
B,4,0,1,0,0,0,0
B,1,0,0,0,0,0,1
B,2,0,1,0,0,0,1
C,1,1,0,0,0,0,1
C,4,0,0,0,0,1,0
C,2,1,0,1,0,1,0
C,4,0,0,0,0,0,0
C,5,0,0,1,1,1,0
C,2,0,1,0,0,0,1
C,5,1,0,0,0,1,0
C,3,0,0,1,1,1,0
C,2,0,0,0,0,0,1
C,3,0,0,0,0,1,0
A,2,0,0,0,0,0,1
A,4,1,0,0,0,1,0
A,3,0,0,0,0,0,0
A,1,0,0,0,0,0,1
A,3,1,0,0,0,0,0
A,4,0,0,1,0,1,0
A,5,1,1,0,0,1,0
A,3,1,0,0,1,0,0
B,4,0,0,0,0,1,0
B,1,0,0,0,0,0,1
B,5,0,0,0,0,1,0
B,3,0,0,0,0,0,0
B,1,0,0,0,1,0,1
B,3,0,0,1,0,0,0
B,2,0,1,0,0,0,1
B,5,1,0,0,0,1,0
B,4,0,0,0,1,1,1
C,1,0,0,0,0,0,0
C,2,0,0,0,0,0,1
C,3,0,0,1,0,0,0
C,4,0,1,0,0,1,0
C,1,0,0,1,0,0,1
C,1,0,0,0,0,0,0
C,3,0,0,1,0,0,0
C,3,0,0,1,0,1,0
C,5,0,0,0,1,1,0
C,3,0,0,1,0,1,0

Recommended Posts

A sample to try Factorization Machines quickly with fastFM
[GCP] Try a sample to authenticate users with Firebase
Quickly try to visualize datasets with pandas
Try to draw a life curve with python
Try to make a "cryptanalysis" cipher with Python
Try to make a dihedral group with Python
AWS Step Functions to learn with a sample
Try to make a command standby tool with python
Try to dynamically create a Checkbutton with Python's Tkinter
Try to factorial with recursion
Try programming with a shell!
Try to select a language
Try to build a deep learning / neural network with scratch
I wrote a program quickly to study DI with Python ①
Try to bring up a subwindow with PyQt5 and Python
Try to draw a Bezier curve
Try to operate Facebook with Python
Try to create a python environment with Visual Studio Code & WSL
Try to extract a character string from an image with Python3
Rails users try to create a simple blog engine with Django
Try to profile with ONNX Runtime
Try to make a web service-like guy with 3D markup language
Try to output audio with M5STACK
Try adding a wall to your IFC file with IfcOpenShell python
Try to create a Qiita article with REST API [Environmental preparation]
Gist repository to use when you want to try a little with ansible
Try to make a capture software with as high accuracy as possible with python (2)
Try to solve the traveling salesman problem with a genetic algorithm (Theory)
Try to solve a set problem of high school math with Python
Try logging in to qiita with Python
Try to make a kernel of Jupyter
Try drawing a normal distribution with matplotlib
Sample to comprehensively try OpenCV Optical Flow
Try to predict cherry blossoms with xgboost
Try converting to tidy data with pandas
Try HTML scraping with a Python library
How to use CUT command (with sample)
First YDK to try with Cisco IOS-XE
Sample program to display video with PyQt
Try drawing a map with python + cartopy 0.18.0
Try to generate an image with aliasing
Try TensorFlow RNN with a basic model
Sample to convert image to Wavelet with Python
Try to solve the traveling salesman problem with a genetic algorithm (Python code)
Try to beautify with Talking Head Anime from a Single Image [python preparation]
Try to solve the traveling salesman problem with a genetic algorithm (execution result)
A simple workaround for bots to try to post tweets with the same content
Try to create a Todo management site using WebSocket with Django (Swamp Dragon)
WEB scraping with python and try to make a word cloud from reviews
I tried to create a model with the sample of Amazon SageMaker Autopilot
Try to make your own AWS-SDK with bash
How to create sample CSV data with hypothesis
Try to solve the fizzbuzz problem with Keras
How to read a CSV file with Python 2/3
Send a message to LINE with Python (LINE Notify)
How to send a message to LINE with curl
Try to calculate a statistical problem in Python
Try to aggregate doujin music data with pandas
How to draw a 2-axis graph with pyplot
How to develop a cart app with Django
Try to solve the man-machine chart with Python