[PYTHON] I tried to implement a recommendation system (content-based filtering)

Introduction

Of the recommendation system, I would like to implement content-based filtering that makes recommendations based only on item characteristics. (See this article for the types of recommendation systems.)

Content-based filtering

Content-based filtering is a method of making recommendations based on item characteristics . Calculates and presents items with high similarity to items in the user's browsing/purchase history.
Actually implement in the following flow.

  1. Extract the feature vector of the item
  2. Calculation of similarity
  3. Recommend items with high similarity

Item vectorization

To calculate the similarity, first convert the item features (words and sentences) into feature vectors . There are several methods of vectorization, such as One-Hot Encoding and TF-IDF, but this time we will use One-Hot expression because the feature of the item uses word data.

Calculation of similarity

Once the items are vectorized, the next step is to calculate the similarity. There are several ways to calculate the similarity, but this time we will use the commonly used cosine similarity .

\cos({x}, {y}) = \frac{{x} \cdot {y}}{|{x}| |{y}|}

For example, when calculating the similarity of items $ x, y $, if the feature vectors of $ x $ and $ y $ are as follows, $ x = (0,1,1,0,0,1,0,1) $ $ y = (0,1,0,0,0,1,0,0) $ Cosine similarity You can calculate like this. $x \cdot y=1+1=2$$|{x}|=\sqrt{1+1+1+1}=2$$|{y}|=\sqrt{1+1+1}=\sqrt{3}$$cos(x,y)=\frac{2}{|{2}| |{\sqrt{3}}|}=0.57735$

Implementation

Let's actually implement it using the data of the kaggle competition. First, check the data. (This time, since the genre column is used to calculate the characteristics of the item, the type and rating columns are deleted for easy viewing. It is okay to execute without deleting the columns.)

code


import pandas as pd
import numpy as np

#Data reading
anime_data = pd.read_csv("anime.csv")

#Check the length of the data
print("The number of data:", len(anime_data.anime_id))

#Delete unused columns
anime_data = anime_data.drop(columns = ['type', 'episodes', 'rating', 'members'])

#Check the contents of the data
anime_data.head()

The data is as follows.

Execution result


The number of data: 12294

   anime_id      name                                  genre
0	32281	Kimi no Na wa.	                   Drama, Romance, School, Supernatural
1	5114	Fullmetal Alchemist: Brotherhood   Action, Adventure, Drama, Fantasy, Magic, Mili...
2	28977	Gintama	                           Action, Comedy, Historical, Parody, Samurai, S...
3	9253	Steins;Gate	                       SciFi, Thriller
4	9969	Gintama039;     	               Action, Comedy, Historical, Parody, Samurai, S...

Next, vectorize the items. This time, the genre data is included in words, so use One-Hot Encoding to make it a feature vector. Since the genre column of anime_data contains comma-separated genre names, create a genre name column with the following code. We will add genre to genre_col, but since we are using set () at this time, duplicate elements will be removed.

code


genres = anime_data['genre'].map(lambda x: x.split(',')).to_list()
genre_col = list()
for i in genres:
    genre_col.extend(i)
genre_col = list(set(genre_col))

#Check the column length of the genre name
print(len(genre_col)

Execution result


#Genre name column length
83

Use the created genre name column to make a One-Hot expression of the genre element. List each row with row_list and add it to rows. Finally, create a DataFrame and store it in genre_df.

code


#One-Hot Encoding
rows = list()
for index, row in enumerate(genres):
    row_list = np.array([0] * len(genre_col))
    index_list = [genre_col.index(item) for item in row]
    row_list[index_list] = 1
    rows.append(list(row_list))
genre_df = pd.DataFrame(rows, columns = genre_col)
one_hot_data = pd.concat([anime_data, genre_df], axis= 1)

If you output one_hot_data that combines the id and name of the animation for easy understanding, it looks like this.

anime_id name Kids Game Psychological Fantasy Space School
0 32281 Kimi no Na wa. 0 0 0 0 0 0
1 5114 Fullmetal Alchemist 0 0 1 0 0 0
2 28977 Gintama 0 0 0 0 0 0

(12294×86)


This feature vector is used to calculate the cosine similarity. Create a similarity matrix with the following code.

code


#one-Create an array with the hot expression part
item_vectors = np.array(one_hot_data[genre_col])

#Row-by-row vector norm
norm = np.matrix(np.linalg.norm(item_vectors, axis=1))

#Create a similarity matrix using the cosine similarity formula
sim_mat = np.array(np.dot(item_vectors, item_vectors.T)/np.dot(norm.T, norm))

If this sim_mat is left as it is, it is difficult to know which line is which animation, so create a correspondence table of anime_id and index with key-value type.

code


itemindex = dict()
for num, item_id in enumerate(one_hot_data.anime_id):
    itemindex[item_id] = num
itemindex

Execution result


{32281: 0,
 5114: 1,
 28977: 2,
 9253: 3,
 9969: 4,
 32935: 5,
 11061: 6,

Let's actually take out items with high similarity from the similarity matrix. Here, 10 items with high similarity to "Your Name (anime_id: 32281)" are displayed.

code


#anime_Search index by specifying id, row_Store in num
row_num = itemindex[32281]

#Similarity matrix row_Extract the top 10 in the num column
top10_index = np.argsort(sim_mat[row_num])[::-1][1:11]

top10_index

Execution result


array([6394, 5805,  208, 1959,  504, 1494, 2300, 1201, 5127, 1436])

Search for the index of top10_index and the corresponding anime_id.

code


rec_id = list()
for search_index in top10_index:
    for anime_id, index in itemindex.items():
        if index == search_index:
            rec_id.append(anime_id)
rec_id

Execution result


[546, 547, 28725, 713, 6351, 20903, 12175, 10067, 1607, 8481]

Let's display items with high similarity from the obtained anime_id.

code


anime_data.query("anime_id == [546, 547, 28725, 713, 6351, 20903, 12175, 10067, 1607, 8481] ")
anime_id name genre
208 28725 Kokoro ga Sakebitagatterunda. Drama, Romance, School
504 6351 Clannad: After Story - Mou Hitotsu no Sekai Drama, Romance, School
1201 10067 Angel Beats!: Another Epilogue Drama, School, Supernatural
1436 8481 "Bungaku Shoujo" Memoire Drama, Romance, School
1494 20903 Harmonie Drama, School, Supernatural
1959 713 Air Movie Drama, Romance, Supernatural
2300 12175 Koi to Senkyo to Chocolate Drama, Romance, School
5127 1607 Venus Versus Virus Drama, Romance, Supernatural
5805 547 Wind: A Breath of Heart OVA Drama, Romance, School, Supernatural
6394 546 Wind: A Breath of Heart (TV) Drama, Romance, School, Supernatural

Supplement

I did One-Hot Encoding on my own this time, but it's easier with Category Encoders. This article was easy to understand, so I would like to introduce it.

Finally

I think there are other ways to vectorize items and calculate similarity, and I think there are other ways to write code, so I'll try another one. This time, I referred to this article . There were many easy-to-understand articles other than recommendations.

Recommended Posts

I tried to implement a recommendation system (content-based filtering)
I tried to implement PCANet
I tried to implement StarGAN (1)
I tried to implement a pseudo pachislot in Python
I tried to implement Deep VQE
I tried to implement adversarial validation
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried to implement a volume moving average with Quantx
I tried to implement a basic Recurrent Neural Network model
I tried to implement a one-dimensional cellular automaton in Python
I tried to create a linebot (implementation)
I tried to implement PLSA in Python
I tried to implement Autoencoder with TensorFlow
I tried to implement permutation in Python
I tried to create a linebot (preparation)
[Python] I tried to implement stable sorting, so make a note
I tried to implement anomaly detection using a hidden Markov model
I tried to implement PLSA in Python 2
I tried to implement a misunderstood prisoner's dilemma game in Python
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I tried to implement CVAE with PyTorch
I tried to make a Web API
I tried to implement a blockchain that actually works with about 170 lines
I tried to draw a system configuration diagram with Diagrams on Docker
I tried to implement a card game of playing cards in Python
I tried to build a super-resolution method / ESPCN
I tried to implement reading Dataset with PyTorch
I tried to implement TOPIC MODEL in Python
I tried to generate a random character string
I tried to build a super-resolution method / SRCNN ③
I tried to build a super-resolution method / SRCNN ②
I tried to implement selection sort in python
I tried to implement the traveling salesman problem
I tried to make a ○ ✕ game using TensorFlow
I tried to debug.
I tried to paste
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
I tried to easily create a fully automatic attendance system with Selenium + Python
I tried to implement what seems to be a Windows snipping tool in Python
I tried to make a "fucking big literary converter"
I want to easily implement a timeout in python
I tried to create a table only with Django
I tried to implement multivariate statistical process management (MSPC)
I tried to implement and learn DCGAN with PyTorch
I tried to draw a route map with Python
I tried to implement Dragon Quest poker in Python
I tried to implement an artificial perceptron with python
I tried adding system calls and scheduler to Linux
I tried to implement time series prediction with GBDT
I tried to implement GA (genetic algorithm) in Python
[Go + Gin] I tried to build a Docker environment
I tried to automatically generate a password with Python3
I tried to implement Grad-CAM with keras and tensorflow
I tried to implement SSD with PyTorch now (Dataset)
I tried to implement automatic proof of sequence calculation
I tried to draw a configuration diagram using Diagrams
I tried to learn PredNet
I tried to organize SVM.
I tried to reintroduce Linux