Introduction

Of the recommendation system, I would like to implement content-based filtering that makes recommendations based only on item characteristics. (See this article for the types of recommendation systems.)

Content-based filtering

Content-based filtering is a method of making recommendations based on item characteristics . Calculates and presents items with high similarity to items in the user's browsing/purchase history.
Actually implement in the following flow.

Extract the feature vector of the item
Calculation of similarity
Recommend items with high similarity

Item vectorization

To calculate the similarity, first convert the item features (words and sentences) into feature vectors . There are several methods of vectorization, such as One-Hot Encoding and TF-IDF, but this time we will use One-Hot expression because the feature of the item uses word data.

Calculation of similarity

Once the items are vectorized, the next step is to calculate the similarity. There are several ways to calculate the similarity, but this time we will use the commonly used cosine similarity .

\cos({x}, {y}) = \frac{{x} \cdot {y}}{|{x}| |{y}|}

For example, when calculating the similarity of items $ x, y $, if the feature vectors of $ x $ and $ y $ are as follows, $ x = (0,1,1,0,0,1,0,1) $ $ y = (0,1,0,0,0,1,0,0) $ Cosine similarity You can calculate like this. $x \cdot y=1+1=2$$|{x}|=\sqrt{1+1+1+1}=2$$|{y}|=\sqrt{1+1+1}=\sqrt{3}$$cos(x,y)=\frac{2}{|{2}| |{\sqrt{3}}|}=0.57735$

Implementation

Let's actually implement it using the data of the kaggle competition. First, check the data. (This time, since the genre column is used to calculate the characteristics of the item, the type and rating columns are deleted for easy viewing. It is okay to execute without deleting the columns.)

`code`


import pandas as pd
import numpy as np

#Data reading
anime_data = pd.read_csv("anime.csv")

#Check the length of the data
print("The number of data:", len(anime_data.anime_id))

#Delete unused columns
anime_data = anime_data.drop(columns = ['type', 'episodes', 'rating', 'members'])

#Check the contents of the data
anime_data.head()

The data is as follows.

`Execution result`


The number of data: 12294

   anime_id      name                                  genre
0	32281	Kimi no Na wa.	                   Drama, Romance, School, Supernatural
1	5114	Fullmetal Alchemist: Brotherhood   Action, Adventure, Drama, Fantasy, Magic, Mili...
2	28977	Gintama	                           Action, Comedy, Historical, Parody, Samurai, S...
3	9253	Steins;Gate	                       SciFi, Thriller
4	9969	Gintama039;     	               Action, Comedy, Historical, Parody, Samurai, S...

Next, vectorize the items. This time, the genre data is included in words, so use One-Hot Encoding to make it a feature vector. Since the genre column of anime_data contains comma-separated genre names, create a genre name column with the following code. We will add genre to genre_col, but since we are using set () at this time, duplicate elements will be removed.

`code`


genres = anime_data['genre'].map(lambda x: x.split(',')).to_list()
genre_col = list()
for i in genres:
    genre_col.extend(i)
genre_col = list(set(genre_col))

#Check the column length of the genre name
print(len(genre_col)

`Execution result`


#Genre name column length
83

Use the created genre name column to make a One-Hot expression of the genre element. List each row with row_list and add it to rows. Finally, create a DataFrame and store it in genre_df.

`code`


#One-Hot Encoding
rows = list()
for index, row in enumerate(genres):
    row_list = np.array([0] * len(genre_col))
    index_list = [genre_col.index(item) for item in row]
    row_list[index_list] = 1
    rows.append(list(row_list))
genre_df = pd.DataFrame(rows, columns = genre_col)
one_hot_data = pd.concat([anime_data, genre_df], axis= 1)

If you output one_hot_data that combines the id and name of the animation for easy understanding, it looks like this.

	anime_id	name	Psychological
0	32281	Kimi no Na wa.	0
1	5114	Fullmetal Alchemist	1
2	28977	Gintama	0

（12294×86）

This feature vector is used to calculate the cosine similarity. Create a similarity matrix with the following code.

`code`


#one-Create an array with the hot expression part
item_vectors = np.array(one_hot_data[genre_col])

#Row-by-row vector norm
norm = np.matrix(np.linalg.norm(item_vectors, axis=1))

#Create a similarity matrix using the cosine similarity formula
sim_mat = np.array(np.dot(item_vectors, item_vectors.T)/np.dot(norm.T, norm))

If this sim_mat is left as it is, it is difficult to know which line is which animation, so create a correspondence table of anime_id and index with key-value type.

`code`


itemindex = dict()
for num, item_id in enumerate(one_hot_data.anime_id):
    itemindex[item_id] = num
itemindex

`Execution result`


{32281: 0,
 5114: 1,
 28977: 2,
 9253: 3,
 9969: 4,
 32935: 5,
 11061: 6,

Let's actually take out items with high similarity from the similarity matrix. Here, 10 items with high similarity to "Your Name (anime_id: 32281)" are displayed.

`code`


#anime_Search index by specifying id, row_Store in num
row_num = itemindex[32281]

#Similarity matrix row_Extract the top 10 in the num column
top10_index = np.argsort(sim_mat[row_num])[::-1][1:11]

top10_index

`Execution result`


array([6394, 5805,  208, 1959,  504, 1494, 2300, 1201, 5127, 1436])

Search for the index of top10_index and the corresponding anime_id.

`code`


rec_id = list()
for search_index in top10_index:
    for anime_id, index in itemindex.items():
        if index == search_index:
            rec_id.append(anime_id)
rec_id

`Execution result`


[546, 547, 28725, 713, 6351, 20903, 12175, 10067, 1607, 8481]

Let's display items with high similarity from the obtained anime_id.

`code`


anime_data.query("anime_id == [546, 547, 28725, 713, 6351, 20903, 12175, 10067, 1607, 8481] ")

	anime_id	name	genre
208	28725	Kokoro ga Sakebitagatterunda.	Drama, Romance, School
504	6351	Clannad: After Story - Mou Hitotsu no Sekai	Drama, Romance, School
1201	10067	Angel Beats!: Another Epilogue	Drama, School, Supernatural
1436	8481	"Bungaku Shoujo" Memoire	Drama, Romance, School
1494	20903	Harmonie	Drama, School, Supernatural
1959	713	Air Movie	Drama, Romance, Supernatural
2300	12175	Koi to Senkyo to Chocolate	Drama, Romance, School
5127	1607	Venus Versus Virus	Drama, Romance, Supernatural
5805	547	Wind: A Breath of Heart OVA	Drama, Romance, School, Supernatural
6394	546	Wind: A Breath of Heart (TV)	Drama, Romance, School, Supernatural

Supplement

I did One-Hot Encoding on my own this time, but it's easier with Category Encoders. This article was easy to understand, so I would like to introduce it.

Finally

I think there are other ways to vectorize items and calculate similarity, and I think there are other ways to write code, so I'll try another one. This time, I referred to this article . There were many easy-to-understand articles other than recommendations.

[PYTHON] I tried to implement a recommendation system (content-based filtering)

Introduction

Content-based filtering

Item vectorization

Calculation of similarity

Implementation

code

Execution result

code

Execution result

code

code

code

Execution result

code

Execution result

code

Execution result

code

Supplement

Finally

`code`

`Execution result`

`code`

`Execution result`

`code`

`code`

`code`

`Execution result`

`code`

`Execution result`

`code`

`Execution result`

`code`