[PYTHON] Implementation of recommendation system ~ I tried to find the similarity from the outline of the movie using TF-IDF ~

Introduction

Let's implement the content-based filtering of the recommendation system using TF-IDF. In the previous article , One-Hot Encoding is used to calculate the similarity from anime genre data. .. Please see here if you like.
This time, we will implement as follows.

  1. Vectorize movie outline (text) with TF-IDF
  2. Calculate similarity using cosine similarity
  3. Recommend items with high similarity

We will implement it easily using scikit-learn.

TF-IDF TF-IDF is one of the item vectorization methods last time one-hot encoding. .. This time, I will use TF-IDF to vectorize the outline of the item (the one in the text).

TF-IDF is calculated by the product of TF (Term Frequency) and IDF (Inverse Document Frequency) . It is expressed by the following formula, but the rarer the word is, the more important the word is in expressing the characteristics of the sentence. $ IDF = \ log \ frac {total number of sentences} {number of sentences including word X} $ $ TF = \ frac {frequency of occurrence of word X in sentence A} {sum of frequency of occurrence of all words in sentence A } $ $TFIDF = TF \cdot IDF$

This article was written in detail about TF-IDF, so I would like to introduce it.

Implementation

This time, we will implement it using this data . First, check the data. (Since there are many columns, only the column used this time is taken out.)

code


import pandas as pd
import numpy as np

#Data reading
movies = pd.read_csv("movies_metadata.csv")

#Extract only the required columns
movies = movies[['id', 'original_title', 'overview']]

#Check the length of the data
print('The number of data:',len(movies.id))

movies.head()

Execution result


The number of data: 45466

	id	  original_title	               overview
0	862	    Toy Story	                Led by Woody, Andys toys live happily in his ...
1	8844	Jumanji    	                When siblings Judy and Peter discover an encha...  
2	15602	Grumpier Old Men	        A family wedding reignites the ancient feud be...  
3	31357	Waiting to Exhale	        Cheated on, mistreated and stepped on, the wom...
4	11862	Father of the Bride Part II	Just when George Banks has recovered from his ...

Since there are missing values ​​in the overview column, delete the row containing the missing values. Please note that this time the purpose is to try TF-IDF, so the processing is complicated.

code


#Confirmation of missing values
movies.isnull().sum()

#Delete rows with missing values
movies = movies.dropna(how='any')
print('The number of data:',len(movies.id))

Execution result


#Missing value
id                  0
original_title      0
overview          954
dtype: int64


The number of data: 44512

Next, we will actually vectorize the item with TF-IDF. TF-IDF calculations are easily implemented using scikit-learn's TfidfVectorizer.

code


#TF column for overview-Vectorization with IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(movies['overview'])

#Check the size of the created matrix
tfidf_matrix.shape

Execution result


(44512, 76132)

You can see that the number of movie data is 44512 and the number of words that appear is 76132 words. This time, the similarity matrix is ​​also calculated using scikit-learn. The explanation of cosine similarity is omitted because it was written in previous article .

code


#Create a similarity matrix
from sklearn.metrics.pairwise import pairwise_distances

cosine_sim = 1 - pairwise_distances(tfidf_matrix, metric = 'cosine')
cosine_sim

You will have a similarity matrix with a diagonal component of 1.

Execution result


[[1.        , 0.0306972 , 0.01283222, ..., 0.00942304, 0.03492933, 0.01737238],
 [0.0306972 , 1.        , 0.05674315, ..., 0.00944854, 0.06448034, 0.03307954],
 [0.01283222, 0.05674315, 1.        , ..., 0.01271578, 0.05854697, 0.02767532],
 ...,
 [0.00942304, 0.00944854, 0.01271578, ..., 1.        , 0.02566797, 0.01480498],
 [0.03492933, 0.06448034, 0.05854697, ..., 0.02566797, 1.        , 0.0590926 ],
 [0.01737238, 0.03307954, 0.02767532, ..., 0.01480498, 0.0590926 , 1.        ]]

Since it is difficult to know which movie is in which row at this rate, create a key-value type correspondence table between index and movie id. Then, search for items with high similarity from the movie id. Here, we will extract items that are highly similar to Toy Story (id: 862).

code


#Create a correspondence table between index and movie id
itemindex = dict()
for num, item_id in enumerate(movies.id):
    itemindex[item_id] = num

#Search index by specifying the movie id (862 because it is Toy Story), row_Store in num
row_num = itemindex['862']

#Top10 number of lines for items with high similarity (top 10)_Store in index
top10_index = np.argsort(sim_mat[row_num])[::-1][1:11]
top10_index

#top10_Search for the index stored in index and the id of the corresponding movie
rec_id = list()
for search_index in top10_index:
    for id, index in itemindex.items():
        if index == search_index:
            rec_id.append(id)
rec_id

Execution result


['10193', '863', '6957', '82424', '92848', '181801', '364123', '250434', '42816', '355984']

The movie with this id is as follows.

    |id    |	original_title                             |overview                           

|---- | ---- |------------------------------------- |--------------------------------------- |2997 |863 |Toy Story 2 |Andy heads off to Cowboy Camp, leaving his toy... |8327 |42816 |The Champ |Dink Purcell loves his alcoholic father, ex-he... |10301 |6957 | The 40 Year Old Virgin |Andy Stitzer has a pleasant life with a nice a... |15348 |10193 |Toy Story 3 |Woody, Buzz, and the rest of Andy's toys haven... |23843 |92848 |Andy Hardy's Blonde Trouble |Andy is going to Wainwright College as did his... |24523 |82424 |Small Fry |A fast food restaurant mini variant of Buzz fo... |29202 |181801|Hot Splash |Matt and Woody's summer in Cocoa Beach is goin... |38476 |250434|Superstar: The Life and Times of Andy Warhol|Documentary portrait of Andy Warhol. |42721 |355984|Andy Peters: Exclamation Mark Question Point|Exclamation Mark Question Point is the debut s... |43427 |364123 |Andy Kaufman Plays Carnegie Hall |Andy Kaufman's legendary sold-out Carnegie Hal...

Supplement

By the way, if you just want to find the similarity between two items, scikit-learn's cosine_similarity is convenient.

code


from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)[0,1])

When the cosine similarity of the 0th and 1st rows (Toy Story and Jumanji) is calculated from the matrix processed by TF-IDF obtained above, it is found that it matches the value of the similarity matrix calculated above. I understand.

Execution result


0.0306972031053245

Finally

Using the outline (text) of the movie as a feature vector, I searched for the top 10 movies that have a high degree of similarity to Toy Story. Some movies were similar, like Toy Story 2, 3 and Small Fry, but others weren't very similar, such as The 40 Year Old Virgin and Andy Kaufman Plays Carnegie Hall (probably the character's name was a hit with Andy). I think···). In the case of movies and books, it may be more accurate to find the similarity purely by genre.

This time, I tried to calculate the similarity of sentences using TF-IDF and felt that I wanted to study the accuracy in detail.

Recommended Posts

Implementation of recommendation system ~ I tried to find the similarity from the outline of the movie using TF-IDF ~
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to find the entropy of the image with python
I tried to find out the outline about Big Gorilla
I tried to find the average of the sequence with TensorFlow
I tried to find the trend of the number of ships in Tokyo Bay from satellite images.
[Linux] I tried to summarize the command of resource confirmation system
I tried to get the index of the list using the enumerate function
I tried to summarize the frequently used implementation method of pytest-mock
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to execute SQL from the local environment using Looker SDK
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to extract and illustrate the stage of the story using COTOHA
I tried to get the movie information of TMDb API with Python
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
I tried to find 100 million digits of pi
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to detect the iris from the camera image
I tried to summarize the basic form of GPLVM
I tried to implement a recommendation system (content-based filtering)
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried to predict the victory or defeat of the Premier League using the Qore SDK
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to get the location information of Odakyu Bus
Implementation of TF-IDF using gensim
I tried using the Python library from Ruby with PyCall
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried to simulate ad optimization using the bandit algorithm.
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried face recognition of the laughter problem using Keras.
I tried using the Pi Console I / F of the Raspberry Pi IoT starter kit "anyPi" from Mechatrax.
I tried the python version of "Consideration of Conner Davis's answer" Printing numbers from 1 to 100 without using loops, recursion, and goto "
[Python] I tried to visualize the follow relationship of Twitter
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I tried to predict the infection of new pneumonia using the SIR model: ☓ Wuhan edition ○ Hubei edition
I tried to fight the Local Minimum of Goldstein-Price Function
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to sort out the objects from the image of the steak set meal-① Object detection
I tried to get various information from the codeforces API
I tried to approximate the sin function using chainer (re-challenge)
I tried using PI Fu to generate a 3D model of a person from one image
I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API
I tried to output the access log to the server using Node.js
I tried to get data from AS / 400 quickly using pypyodbc
I became horror when I tried to detect the features of anime faces using PCA and NMF.
I tried to sort out the objects from the image of the steak set meal-② Overlap number sorting