Introduction

Let's implement the content-based filtering of the recommendation system using TF-IDF. In the previous article , One-Hot Encoding is used to calculate the similarity from anime genre data. .. Please see here if you like.
This time, we will implement as follows.

Vectorize movie outline (text) with TF-IDF
Calculate similarity using cosine similarity
Recommend items with high similarity

We will implement it easily using scikit-learn.

TF-IDF TF-IDF is one of the item vectorization methods last time one-hot encoding. .. This time, I will use TF-IDF to vectorize the outline of the item (the one in the text).

TF-IDF is calculated by the product of TF (Term Frequency) and IDF (Inverse Document Frequency) . It is expressed by the following formula, but the rarer the word is, the more important the word is in expressing the characteristics of the sentence. $ IDF = \ log \ frac {total number of sentences} {number of sentences including word X} $ $ TF = \ frac {frequency of occurrence of word X in sentence A} {sum of frequency of occurrence of all words in sentence A } $ $TFIDF = TF \cdot IDF$

This article was written in detail about TF-IDF, so I would like to introduce it.

Implementation

This time, we will implement it using this data . First, check the data. (Since there are many columns, only the column used this time is taken out.)

`code`


import pandas as pd
import numpy as np

#Data reading
movies = pd.read_csv("movies_metadata.csv")

#Extract only the required columns
movies = movies[['id', 'original_title', 'overview']]

#Check the length of the data
print('The number of data:',len(movies.id))

movies.head()

`Execution result`


The number of data: 45466

	id	  original_title	               overview
0	862	    Toy Story	                Led by Woody, Andys toys live happily in his ...
1	8844	Jumanji    	                When siblings Judy and Peter discover an encha...  
2	15602	Grumpier Old Men	        A family wedding reignites the ancient feud be...  
3	31357	Waiting to Exhale	        Cheated on, mistreated and stepped on, the wom...
4	11862	Father of the Bride Part II	Just when George Banks has recovered from his ...

Since there are missing values in the overview column, delete the row containing the missing values. Please note that this time the purpose is to try TF-IDF, so the processing is complicated.

`code`


#Confirmation of missing values
movies.isnull().sum()

#Delete rows with missing values
movies = movies.dropna(how='any')
print('The number of data:',len(movies.id))

`Execution result`


#Missing value
id                  0
original_title      0
overview          954
dtype: int64


The number of data: 44512

Next, we will actually vectorize the item with TF-IDF. TF-IDF calculations are easily implemented using scikit-learn's TfidfVectorizer.

`code`


#TF column for overview-Vectorization with IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(movies['overview'])

#Check the size of the created matrix
tfidf_matrix.shape

`Execution result`


(44512, 76132)

You can see that the number of movie data is 44512 and the number of words that appear is 76132 words. This time, the similarity matrix is also calculated using scikit-learn. The explanation of cosine similarity is omitted because it was written in previous article .

`code`


#Create a similarity matrix
from sklearn.metrics.pairwise import pairwise_distances

cosine_sim = 1 - pairwise_distances(tfidf_matrix, metric = 'cosine')
cosine_sim

You will have a similarity matrix with a diagonal component of 1.

`Execution result`


[[1.        , 0.0306972 , 0.01283222, ..., 0.00942304, 0.03492933, 0.01737238],
 [0.0306972 , 1.        , 0.05674315, ..., 0.00944854, 0.06448034, 0.03307954],
 [0.01283222, 0.05674315, 1.        , ..., 0.01271578, 0.05854697, 0.02767532],
 ...,
 [0.00942304, 0.00944854, 0.01271578, ..., 1.        , 0.02566797, 0.01480498],
 [0.03492933, 0.06448034, 0.05854697, ..., 0.02566797, 1.        , 0.0590926 ],
 [0.01737238, 0.03307954, 0.02767532, ..., 0.01480498, 0.0590926 , 1.        ]]

Since it is difficult to know which movie is in which row at this rate, create a key-value type correspondence table between index and movie id. Then, search for items with high similarity from the movie id. Here, we will extract items that are highly similar to Toy Story (id: 862).

`code`


#Create a correspondence table between index and movie id
itemindex = dict()
for num, item_id in enumerate(movies.id):
    itemindex[item_id] = num

#Search index by specifying the movie id (862 because it is Toy Story), row_Store in num
row_num = itemindex['862']

#Top10 number of lines for items with high similarity (top 10)_Store in index
top10_index = np.argsort(sim_mat[row_num])[::-1][1:11]
top10_index

#top10_Search for the index stored in index and the id of the corresponding movie
rec_id = list()
for search_index in top10_index:
    for id, index in itemindex.items():
        if index == search_index:
            rec_id.append(id)
rec_id

`Execution result`


['10193', '863', '6957', '82424', '92848', '181801', '364123', '250434', '42816', '355984']

The movie with this id is as follows.

    |id    |	original_title                             |overview

|---- | ---- |------------------------------------- |--------------------------------------- |2997 |863 |Toy Story 2 |Andy heads off to Cowboy Camp, leaving his toy... |8327 |42816 |The Champ |Dink Purcell loves his alcoholic father, ex-he... |10301 |6957 | The 40 Year Old Virgin |Andy Stitzer has a pleasant life with a nice a... |15348 |10193 |Toy Story 3 |Woody, Buzz, and the rest of Andy's toys haven... |23843 |92848 |Andy Hardy's Blonde Trouble |Andy is going to Wainwright College as did his... |24523 |82424 |Small Fry |A fast food restaurant mini variant of Buzz fo... |29202 |181801|Hot Splash |Matt and Woody's summer in Cocoa Beach is goin... |38476 |250434|Superstar: The Life and Times of Andy Warhol|Documentary portrait of Andy Warhol. |42721 |355984|Andy Peters: Exclamation Mark Question Point|Exclamation Mark Question Point is the debut s... |43427 |364123 |Andy Kaufman Plays Carnegie Hall |Andy Kaufman's legendary sold-out Carnegie Hal...

Supplement

By the way, if you just want to find the similarity between two items, scikit-learn's cosine_similarity is convenient.

`code`


from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)[0,1])

When the cosine similarity of the 0th and 1st rows (Toy Story and Jumanji) is calculated from the matrix processed by TF-IDF obtained above, it is found that it matches the value of the similarity matrix calculated above. I understand.

`Execution result`


0.0306972031053245

Finally

Using the outline (text) of the movie as a feature vector, I searched for the top 10 movies that have a high degree of similarity to Toy Story. Some movies were similar, like Toy Story 2, 3 and Small Fry, but others weren't very similar, such as The 40 Year Old Virgin and Andy Kaufman Plays Carnegie Hall (probably the character's name was a hit with Andy). I think···). In the case of movies and books, it may be more accurate to find the similarity purely by genre.

This time, I tried to calculate the similarity of sentences using TF-IDF and felt that I wanted to study the accuracy in detail.

[PYTHON] Implementation of recommendation system ~ I tried to find the similarity from the outline of the movie using TF-IDF ~

Introduction

Implementation

code

Execution result

code

Execution result

code

Execution result

code

Execution result

code

Execution result

Supplement

code

Execution result

Finally

`code`

`Execution result`

`code`

`Execution result`

`code`

`Execution result`

`code`

`Execution result`

`code`

`Execution result`

`code`

`Execution result`