[PYTHON] [For beginners] Quantify the similarity of sentences with TF-IDF

0. in short

    1. ** TF-IDF ** is a good way to easily quantify the similarity of sentences.
  1. The similarity is calculated by multiplying the frequency (** TF ) of the words appearing in the sentence by the rarity ( IDF **).
    1. If convenience is important, ** janome ** is recommended from ** Mecab ** for morphological analysis, and ** TF-IDF ** is recommended from ** doc2vec ** for similarity judgment. ⇒ ** I think it will be useful for quantifying the degree of similarity with the prior literature in patent search and for objectively evaluating the answers to the descriptive questions of the university entrance common test. ** However, it cannot be judged unless a person reads it whether or not the sentence makes sense. Well, of course (laughs)

1. 1. Problem awareness

When I read a sentence, I sometimes think that it is a similar sentence. ** TF-IDF ** is a convenient way to objectively express the vague feeling that they are somewhat similar. There is a site that explains TF-IDF in an easy-to-understand manner, so please google it. If you raise the recommended site, it will be as follows. [For beginners] I briefly summarized TFIDF

The point is (1) if the word that appears frequently in a certain sentence (** TF: Term Frequency frequency ), (2) if it is a rare word that does not appear often in a normal sentence ( IDF: Inverse) Document Frequency Rarity **), starting from the idea that the sentence is a topic related to the word. The basic idea of TF-IDF is to compare the sum of the words TF and IDF multiplied and judge the similarity as a sentence.

2. Preparation

Well, it doesn't start even if I say mess, so let's calculate using ** scikit-learn (sklearn) **, which is often used to do AI-like things in Python. First, prepare.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In the calculation of TF-IDF, ** CountVectorizer ** and ** TfidfTransformer ** can be used in combination instead of ** TfidfVectorizer ** described above (I will try it later). When combining, the frequency of words is vectorized with CountVectorizer, and then TF-IDF is calculated with TfidfTransformer. However, if you want to calculate TF-IDF, it is easier to calculate with TfidfVectorizer at once, so I used TfidfVectorizer this time. The results were slightly different between the case using TfidfVectorizer and the case using CountVectorizer and TfidfTransformer (details will be described later). I think it's because of the parameters, but unfortunately the reason is unknown. If you know, please comment. In any case, the similarity is calculated by calculating the ** cosine cos ** of the vector.

Then, when dealing with Japanese, morphological analysis is required. In English, the words are separated, so it can be analyzed as it is, but in Japanese, the words are seamlessly attached in the sentence, so it is necessary to separate each word. The famous morphological analysis is ** Mecab ** in python, but ** janome ** is easy to handle, so I will use ** janome ** this time.

from janome.tokenizer import Tokenizer

3. 3. Analysis target

It's a sentence to compare, but let's try using easy-to-understand news. Let's try focusing on your favorite soccer with your own judgment and prejudice. The first sentence is the news that Frontale, who runs in the lead, beat Cerezo in second place.

text1.txt


Kawasaki F, 2nd place C Defeat Osaka with 3 shots and run alone! 8 consecutive wins that widen the difference in points to "14"
Round 20 of the Meiji Yasuda Life J1 League was held on the 3rd, and Cerezo Osaka welcomed Kawasaki Frontale to the home of Yanmar Stadium Nagai.
Kawasaki F (53 points) and Cerezo Osaka (42 points) are chasing after 20 games so far. Home Cerezo Osaka won for the first time in three games in the previous section, while Kawasaki F faced this summit confrontation with overwhelming momentum of winning seven consecutive games.

Next, let's use another news with the same content. Let's hypothesize and test the similarities between the first two news stories.

text2.txt


The difference in points is 14 ... 8 consecutive wins over Kawasaki F and C Osaka!!
The J1 league held the 20th round on the 3rd, and Cerezo Osaka ranked second in the Yanmar Stadium Nagai(42 points)Kawasaki Frontale are in the lead with 11 points(53 points)Is a match. Kawasaki F took the lead with an own goal in the 37th minute of the first half, but Cerezo Osaka caught up with the goal of FW Hiroaki Okuno in the 17th minute of the second half. However, in the 38th minute FW Leandro Damian and in the 39th minute MF Kaoru Mitoma shook the net in quick succession, and Kawasaki F was 3-I won one.

The third is the same soccer news, but with different content.

text3.txt


Transferred to Gamba Osaka, Yasuhito Endo and Iwata! "Human relationship with director Miyamoto" behind "decision"
Gamba Osaka reported by some sports newspapers, and Japan national team legend Yasuhito Endo's transfer to J2 Iwata for a limited time. As soon as this news came out, not only Gamba Osaka supporters but also many soccer fans were surprised on the internet. Speaking of Endo, he has been active as the mainstay of Gamba Osaka since 2001 when he transferred from Kyoto. As No. 7 of Gamba Osaka, as a command tower, he has contributed as a core player to all the titles won by the team.

Also, the genre is the same in sports, but let's compare it with baseball news.

text4.txt


Now off FA Masahiro Tanaka is "a pitcher worth the value"
Masahiro Tanaka, a pitcher who will become a free agent (FA) after the end of this season, has been voiced early on by the team and local media asking him to remain. The right arm, who had been showing a masterpiece in the playoffs until last season, started in the second round of the wild card series with the Indians on September 30 (1st Japan time). In the bad conditions of rainfall, he struggled with 6 goals in the 4th inning, but the team as a whole decided to advance to the Division Series.

Finally, news is news, but a completely different genre.

text5.txt


[New Corona] US Presidential Hospitalization, White House "Cluster" and WHO
US President Trump arrived at the Walter Reed Medical Center near Washington on the 2nd from the White House on a presidential helicopter to receive treatment for the new coronavirus infection (COVID19). It suggests widespread concern about the severity of the condition.

The above sentences will be taken in and analyzed. The degree of similarity with sentence 1 is expected to be sentence 2> sentence 3> sentence 4> sentence 5. Well, will that be true?

4. Try

First, read the text. Don't forget to perform morphological analysis.

filenames=['text1.txt','text2.txt','text3.txt','text4.txt','text5.txt']
wakati_list = []
for filename in filenames: #Read a text file and assign it to text
    with open(filename,mode='r',encoding = 'utf-8-sig') as f:
        text = f.read()    
    wakati = ''
    t = Tokenizer() 
    for token in t.tokenize(text):  #Morphological analysis
        hinshi = (token.part_of_speech).split(',')[0]  #Part of speech information
        hinshi_2 = (token.part_of_speech).split(',')[1]
        if hinshi in ['noun']:  # 品詞がnounの場合のみ以下実行
            if not hinshi_2 in ['Blank','*']:  
            #Is the second item of part of speech information blank?*In the case of, do not execute below
                word = str(token).split()[0]  #Get the word
                if not ',*,' in word:  #To the word*If is not included, execute the following
                    wakati = wakati + word +' ' 
                    #Add words and spaces to object wakati
    wakati_list.append(wakati) #Add word-separation results to the list
wakati_list_np = np.array(wakati_list) #Convert list to ndarray

Finally, the calculation of similarity. Let's use TfidfVectorizer.

vectorizer = TfidfVectorizer(token_pattern=u'\\b\\w+\\b')
transformer = TfidfTransformer()#Generation of transformer. TF-Use IDF
tf = vectorizer.fit_transform(wakati_list_np) #Vectorization
tfidf = transformer.fit_transform(tf) # TF-IDF
tfidf_array = tfidf.toarray()
cs = cosine_similarity(tfidf_array,tfidf_array)  #cos similarity calculation
print(cs)

The results are as follows. The relative magnitude of the similarity is, of course, as expected.

[[1.         0.48812198 0.04399067 0.02065671 0.00164636]
 [0.48812198 1.         0.02875532 0.01380959 0.00149348]
 [0.04399067 0.02875532 1.         0.02595705 0.        ]
 [0.02065671 0.01380959 0.02595705 1.         0.00350631]
 [0.00164636 0.00149348 0.         0.00350631 1.        ]]

By the way, the case of combining Count Vectorizer and Tfidf Transformer written at the beginning is as follows. You have to import it before you can use it.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#Generate vectorizer. token_pattern=u'\\b\\w+\\b'Settings that include one-letter words in
vectorizer = CountVectorizer(token_pattern=u'\\b\\w+\\b')
#Generation of transformer. TF-Use IDF
transformer = TfidfTransformer()
tf = vectorizer.fit_transform(wakati_list_np) #Vectorization
tfidf = transformer.fit_transform(tf) # TF-IDF
tfidf_array = tfidf.toarray()
cs = cosine_similarity(tfidf_array,tfidf_array)  #cos similarity calculation
print(cs)

The results are as follows. This one has a slightly higher similarity value.

[[1.         0.59097619 0.07991729 0.03932476 0.00441963]
 [0.59097619 1.         0.05323053 0.03037231 0.00418569]
 [0.07991729 0.05323053 1.         0.03980858 0.        ]
 [0.03932476 0.03037231 0.03980858 1.         0.01072682]
 [0.00441963 0.00418569 0.         0.01072682 1.        ]]

5. Finally

If you want to calculate similarity in Python, ** doc2vec ** is also good. However, it is difficult to load the trained model here. In that sense, I think ** TF-IDF ** should be able to easily calculate the similarity of sentences.

For the code, I referred to the following site. We would like to take this opportunity to thank you.

Try various things with Python

Recommended Posts

[For beginners] Quantify the similarity of sentences with TF-IDF
[For beginners] Quantify the similarity of sentences with TF-IDF
Calculate the similarity between sentences with Doc2Vec, an evolution of Word2Vec
The third night of the loop with for
The second night of the loop with for
Add the attribute of the object of the class with the for statement
Overview of Docker (for beginners)
[For beginners] Summary of standard input in Python (with explanation)
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
The story of making a standard driver for db with python.
Seaborn basics for beginners ① Aggregate graph of the number of data (Countplot)
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
Check the memory protection of linux kerne with the code for ARM
Aim for content similarity with Pyspark
4th night of loop with for
I tried running the TensorFlow tutorial with comments (_TensorFlow_2_0_Introduction for beginners)
Learn the basics of Python ① Beginners
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
I measured the speed of list comprehension, for and while with python2.7.
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
INSERT into MySQL with Python [For beginners]
Ask for Pi with the bc command
[Python] Minutes of study meeting for beginners (7/15)
Align the size of the colorbar with matplotlib
Challenges for the Titanic Competition for Kaggle Beginners
Search for files with the specified extension
Check the existence of the file with python
Pandas of the beginner, by the beginner, for the beginner [Python]
[Python] Read images with OpenCV (for beginners)
WebApi creation with Python (CRUD creation) For beginners
Count the number of characters with echo
[For beginners] Try web scraping with Python
For beginners of SageMaker --Collection of material links -
Checklist on how to avoid turning the elements of numpy's array with for
Align the number of samples between classes of data for machine learning with Python
A memorandum of method often used when analyzing data with pandas (for beginners)
[Introduction to Python] How to get the index of data with a for statement
After hitting the Qiita API with Python to get a list of articles for beginners, we will visit the god articles
Pandas basics for beginners ③ Histogram creation with matplotlib
The fastest way for beginners to master Python
The story of doing deep learning with TPU
Note: Prepare the environment of CmdStanPy with docker
The story of low learning costs for Python
Prepare the execution environment of Python3 with Docker
Causal reasoning and causal search with Python (for beginners)
2016 The University of Tokyo Mathematics Solved with Python
[Note] Export the html of the site with python.
See the behavior of drunkenness with reinforcement learning
Exposing the DCGAN model for Cifar 10 with keras
Increase the font size of the graph with matplotlib
Visualize coronavirus infection status with Plotly [For beginners]
Calculate the total number of combinations with python
Use logger with Python for the time being
Watch out for the return value of __len__
Check the date of the flag duty with Python
I played with Floydhub for the time being
~ Tips for Python beginners from Pythonista with love ① ~
Eliminate the inconveniences of QDock Widget with PySide
Rewrite the name of the namespaced tag with lxml
Easy understanding of Python for & arrays (for super beginners)