Let's implement the content-based filtering of the recommendation system using TF-IDF. In the previous article , One-Hot Encoding is used to calculate the similarity from anime genre data. .. Please see here if you like.
This time, we will implement as follows.
We will implement it easily using scikit-learn.
TF-IDF TF-IDF is one of the item vectorization methods last time one-hot encoding. .. This time, I will use TF-IDF to vectorize the outline of the item (the one in the text).
TF-IDF is calculated by the product of TF (Term Frequency) and IDF (Inverse Document Frequency) . It is expressed by the following formula, but the rarer the word is, the more important the word is in expressing the characteristics of the sentence.
This article was written in detail about TF-IDF, so I would like to introduce it.
This time, we will implement it using this data . First, check the data. (Since there are many columns, only the column used this time is taken out.)
code
import pandas as pd
import numpy as np
#Data reading
movies = pd.read_csv("movies_metadata.csv")
#Extract only the required columns
movies = movies[['id', 'original_title', 'overview']]
#Check the length of the data
print('The number of data:',len(movies.id))
movies.head()
Execution result
The number of data: 45466
id original_title overview
0 862 Toy Story Led by Woody, Andys toys live happily in his ...
1 8844 Jumanji When siblings Judy and Peter discover an encha...
2 15602 Grumpier Old Men A family wedding reignites the ancient feud be...
3 31357 Waiting to Exhale Cheated on, mistreated and stepped on, the wom...
4 11862 Father of the Bride Part II Just when George Banks has recovered from his ...
Since there are missing values in the overview column, delete the row containing the missing values. Please note that this time the purpose is to try TF-IDF, so the processing is complicated.
code
#Confirmation of missing values
movies.isnull().sum()
#Delete rows with missing values
movies = movies.dropna(how='any')
print('The number of data:',len(movies.id))
Execution result
#Missing value
id 0
original_title 0
overview 954
dtype: int64
The number of data: 44512
Next, we will actually vectorize the item with TF-IDF. TF-IDF calculations are easily implemented using scikit-learn's TfidfVectorizer.
code
#TF column for overview-Vectorization with IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(movies['overview'])
#Check the size of the created matrix
tfidf_matrix.shape
Execution result
(44512, 76132)
You can see that the number of movie data is 44512 and the number of words that appear is 76132 words. This time, the similarity matrix is also calculated using scikit-learn. The explanation of cosine similarity is omitted because it was written in previous article .
code
#Create a similarity matrix
from sklearn.metrics.pairwise import pairwise_distances
cosine_sim = 1 - pairwise_distances(tfidf_matrix, metric = 'cosine')
cosine_sim
You will have a similarity matrix with a diagonal component of 1.
Execution result
[[1. , 0.0306972 , 0.01283222, ..., 0.00942304, 0.03492933, 0.01737238],
[0.0306972 , 1. , 0.05674315, ..., 0.00944854, 0.06448034, 0.03307954],
[0.01283222, 0.05674315, 1. , ..., 0.01271578, 0.05854697, 0.02767532],
...,
[0.00942304, 0.00944854, 0.01271578, ..., 1. , 0.02566797, 0.01480498],
[0.03492933, 0.06448034, 0.05854697, ..., 0.02566797, 1. , 0.0590926 ],
[0.01737238, 0.03307954, 0.02767532, ..., 0.01480498, 0.0590926 , 1. ]]
Since it is difficult to know which movie is in which row at this rate, create a key-value type correspondence table between index and movie id. Then, search for items with high similarity from the movie id. Here, we will extract items that are highly similar to Toy Story (id: 862).
code
#Create a correspondence table between index and movie id
itemindex = dict()
for num, item_id in enumerate(movies.id):
itemindex[item_id] = num
#Search index by specifying the movie id (862 because it is Toy Story), row_Store in num
row_num = itemindex['862']
#Top10 number of lines for items with high similarity (top 10)_Store in index
top10_index = np.argsort(sim_mat[row_num])[::-1][1:11]
top10_index
#top10_Search for the index stored in index and the id of the corresponding movie
rec_id = list()
for search_index in top10_index:
for id, index in itemindex.items():
if index == search_index:
rec_id.append(id)
rec_id
Execution result
['10193', '863', '6957', '82424', '92848', '181801', '364123', '250434', '42816', '355984']
The movie with this id is as follows.
|id | original_title |overview
|---- | ---- |------------------------------------- |--------------------------------------- |2997 |863 |Toy Story 2 |Andy heads off to Cowboy Camp, leaving his toy... |8327 |42816 |The Champ |Dink Purcell loves his alcoholic father, ex-he... |10301 |6957 | The 40 Year Old Virgin |Andy Stitzer has a pleasant life with a nice a... |15348 |10193 |Toy Story 3 |Woody, Buzz, and the rest of Andy's toys haven... |23843 |92848 |Andy Hardy's Blonde Trouble |Andy is going to Wainwright College as did his... |24523 |82424 |Small Fry |A fast food restaurant mini variant of Buzz fo... |29202 |181801|Hot Splash |Matt and Woody's summer in Cocoa Beach is goin... |38476 |250434|Superstar: The Life and Times of Andy Warhol|Documentary portrait of Andy Warhol. |42721 |355984|Andy Peters: Exclamation Mark Question Point|Exclamation Mark Question Point is the debut s... |43427 |364123 |Andy Kaufman Plays Carnegie Hall |Andy Kaufman's legendary sold-out Carnegie Hal...
By the way, if you just want to find the similarity between two items, scikit-learn's cosine_similarity is convenient.
code
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)[0,1])
When the cosine similarity of the 0th and 1st rows (Toy Story and Jumanji) is calculated from the matrix processed by TF-IDF obtained above, it is found that it matches the value of the similarity matrix calculated above. I understand.
Execution result
0.0306972031053245
Using the outline (text) of the movie as a feature vector, I searched for the top 10 movies that have a high degree of similarity to Toy Story. Some movies were similar, like Toy Story 2, 3 and Small Fry, but others weren't very similar, such as The 40 Year Old Virgin and Andy Kaufman Plays Carnegie Hall (probably the character's name was a hit with Andy). I think···). In the case of movies and books, it may be more accurate to find the similarity purely by genre.
This time, I tried to calculate the similarity of sentences using TF-IDF and felt that I wanted to study the accuracy in detail.
Recommended Posts