[PYTHON] Sort the relationship between pitcher Masahiro Yamamoto and each team by janome and TF-IDF Cosine Similarity

The janome introduced at PyCon 2015 is convenient. I would like to sort the relationship between the retired pitcher Masahiro Yamamoto and each team in the text of wikipedia.

What to do this time

Extract Japanese nouns from wikipedia texts using the morphological analyzer janome, and extract feature vectors with TF-IDF. By inner producting the feature vectors of each article and taking cosθ, the similarity of entries in the range of 0 to 1 can be obtained. Sort the articles by sentence similarity and you're done.

Where janome is convenient

Installation of mecab requires dependency on python version and additional installation of dictionary, so it is troublesome to build anyway. Janome, which can be installed with pip, is convenient. You can easily take on the challenge when you need morphological analysis.

Try using janome

pip install janome
from janome.tokenizer import Tokenizer
t = Tokenizer()
text = """
Two years after joining the Hiroshima era, no buds appeared and were overtaken by younger Tomonori Maeda and Akira Eto.
The batting was so weak that the coach at that time said, "Roll and use your legs."
The outfielder is also nicknamed "Mole Killer" because of the bad habit of throwing the ball toward the ground....
"""
for token in t.tokenize(text):
     print(token)

------------------
Hiroshima noun,Proper noun,area,General,*,*,Hiroshima,Hiroshima,Hiroshima
Period noun,General,*,*,*,*,Era,Jidai,Jidai
Join noun,Change connection,*,*,*,*,Join,New Dan,New Dan
After noun,suffix,Adverbs possible,*,*,*,rear,Go,Go
Particles,Attributive,*,*,*,*,of,No,No
2 nouns,number,*,*,*,*,*,*,*
Annual noun,suffix,Classifier,*,*,*,Year,Nenkan,Nenkan
Is a particle,Particle,*,*,*,*,Is,C,Wow
Bud noun,General,*,*,*,*,Bud,Me,Me
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
Verb,Independence,*,*,One step,Imperfective form,Get out,De,De
Auxiliary verb,*,*,*,Special,Continuous connection,Nu,Zu,Zu
, Symbol,Comma,*,*,*,*,、,、,、
Younger noun,General,*,*,*,*,younger,Toshishita,Toshishita
Particles,Attributive,*,*,*,*,of,No,No
Maeda noun,Proper noun,Personal name,Surname,*,*,Maeda,Maeda,Maeda
Wisdom noun,Proper noun,Personal name,Name,*,*,Wisdom,Tomonori,Tomonori
And particles,Parallel particles,*,*,*,*,Or,Ya,Ya
Eto noun,Proper noun,Personal name,Surname,*,*,Eto,Eto,Eto'o
Wisdom noun,Proper noun,Personal name,Name,*,*,Satoshi,Satoshi,Satoshi
Noun,suffix,General,*,*,*,Et al.,La,La
Particles,Case particles,General,*,*,*,To,D,D
Overtake or verb,Independence,*,*,Five-dan / Ka line,Imperfective form,Overtake,Oinuka,Oinuka
...

What is TF-IDF?

The idea is that if you count the number of nouns in a sentence, you can get the feature vector of the sentence. Unlike English, Japanese does not have white space as a delimiter, so morphological analysis is required. In other words, the morphological analyzer janome comes into play.

TF-IDF = number of specific nouns in the sentence / number of all nouns in the sentence

Let's actually try

Example sentence


Meat festival NIIGATA for a night full of meat ❤︎ Steak House Azuma-san's Yukimuro Aged Niigata Prefecture Beef Steak Delicious*\(^o^)/*Perfect for salt or wasabi!

result


from simple_tfidf_japanese.tfidf import TFIDF
text = "Meat festival NIIGATA for a night full of meat ❤︎ Steak House Azuma-san's Yukimuro Aged Niigata Prefecture Beef Steak Delicious*\(^o^)/*Perfect for salt or wasabi!"
result = TFIDF.gen(text, enable_one_char=1)
for key, value in result:
     print key, value

Meat 0.0952380952381
Steak 0.0952380952381
0 0.047619047619
Rice 0.047619047619
Snow 0.047619047619
Niigata 0.047619047619
Aging 0.047619047619
Salt 0.047619047619
Festival 0.047619047619
Azuma 0.047619047619
Wasabi 0.047619047619
Cow 0.047619047619
House 0.047619047619
Night 0.047619047619
Samadhi 0.047619047619

Meat is good ~! This sentence seems to have the best features of meat, steak, and rice.

Let's extract the similarity between pitcher Masahiro Yamamoto and each team

I registered the created tool in PyPi.

pip install simple_tfidf_japanese

Let's compare the relationship between Masa and each team based on the wikipedia page. I'll also mix in soccer articles that have nothing to do with the exam.

from simple_tfidf_japanese.tfidf import TFIDF

#Masahiro Yamamoto
_base_url = "https://ja.wikipedia.org/wiki/%E5%B1%B1%E6%9C%AC%E6%98%8C"

#Comparison
data = [
     ['Yakult', 'https://ja.wikipedia.org/wiki/%E6%9D%B1%E4%BA%AC%E3%83%A4%E3%82%AF%E3%83%AB%E3%83%88%E3%82%B9%E3%83%AF%E3%83%AD%E3%83%BC%E3%82%BA'],
     ['Giant', 'https://ja.wikipedia.org/wiki/%E8%AA%AD%E5%A3%B2%E3%82%B8%E3%83%A3%E3%82%A4%E3%82%A2%E3%83%B3%E3%83%84'],
     ['Hanshin', 'https://ja.wikipedia.org/wiki/%E9%98%AA%E7%A5%9E%E3%82%BF%E3%82%A4%E3%82%AC%E3%83%BC%E3%82%B9'],
     ['Hiroshima', 'https://ja.wikipedia.org/wiki/%E5%BA%83%E5%B3%B6%E6%9D%B1%E6%B4%8B%E3%82%AB%E3%83%BC%E3%83%97'],
     ['Chunichi', 'https://ja.wikipedia.org/wiki/%E4%B8%AD%E6%97%A5%E3%83%89%E3%83%A9%E3%82%B4%E3%83%B3%E3%82%BA'],
     ['Yokohama', 'https://ja.wikipedia.org/wiki/%E6%A8%AA%E6%B5%9CDeNA%E3%83%99%E3%82%A4%E3%82%B9%E3%82%BF%E3%83%BC%E3%82%BA'],
     ['Softbank', 'https://ja.wikipedia.org/wiki/%E7%A6%8F%E5%B2%A1%E3%82%BD%E3%83%95%E3%83%88%E3%83%90%E3%83%B3%E3%82%AF%E3%83%9B%E3%83%BC%E3%82%AF%E3%82%B9'],
     ['Nippon-Ham', 'https://ja.wikipedia.org/wiki/%E5%8C%97%E6%B5%B7%E9%81%93%E6%97%A5%E6%9C%AC%E3%83%8F%E3%83%A0%E3%83%95%E3%82%A1%E3%82%A4%E3%82%BF%E3%83%BC%E3%82%BA'],
     ['Lotte', 'https://ja.wikipedia.org/wiki/%E5%8D%83%E8%91%89%E3%83%AD%E3%83%83%E3%83%86%E3%83%9E%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%BA'],
     ['Seibu', 'https://ja.wikipedia.org/wiki/%E5%9F%BC%E7%8E%89%E8%A5%BF%E6%AD%A6%E3%83%A9%E3%82%A4%E3%82%AA%E3%83%B3%E3%82%BA'],
     ['Orix', 'https://ja.wikipedia.org/wiki/%E3%82%AA%E3%83%AA%E3%83%83%E3%82%AF%E3%82%B9%E3%83%BB%E3%83%90%E3%83%95%E3%82%A1%E3%83%AD%E3%83%BC%E3%82%BA'],
     ['Rakuten', 'https://ja.wikipedia.org/wiki/%E6%9D%B1%E5%8C%97%E6%A5%BD%E5%A4%A9%E3%82%B4%E3%83%BC%E3%83%AB%E3%83%87%E3%83%B3%E3%82%A4%E3%83%BC%E3%82%B0%E3%83%AB%E3%82%B9'],
     ['Japan national football team', 'https://ja.wikipedia.org/wiki/%E3%82%B5%E3%83%83%E3%82%AB%E3%83%BC%E6%97%A5%E6%9C%AC%E4%BB%A3%E8%A1%A8'],
]

#Calculation
result = TFIDF.some_similarity(_base_url, data)

#Result display
result.sord(key=lambda x: x[2], reverse=True)
for title, url, value in result:
     print title, value

"""
Giants 0.437053886215
Yakult 0.399745780763
Hanshin 0.383247816027
Hiroshima 0.356147904333
Lotte 0.351312791912
Chunichi 0.344772305253
Yokohama 0.334360056622
Nippon-Ham 0.326226324436
Orix 0.317250711462
Softbank 0.285703674673
Seibu 0.283181229507
Rakuten 0.275111280558
Japan national football team 0.177026402257
"""

From a bird's-eye view, Se has the highest degree of similarity, Pa has the lowest degree of similarity, and soccer, which has nothing to do with it, has the lowest degree of similarity. For more than 30 years, he has been a pitcher for the Chunichi Dragons, but surprisingly, he is not the Chunichi but the giant. WikiPedia also has a lot of stories about playing against giants, and it seems that the degree of similarity with giants has increased.

Also, Hiroshima is ranked higher than Chunichi because there are many descriptions about Mr. Koji Yamamoto, director of Mr. Akahel. It can be inferred that Yamamoto was ranked high in the connection. The reason why Yakult, Hanshin, and Lotte are ranked higher than Chunichi seems to have changed depending on the number of appearances of the words "Record, Victory, Baseball, Professional, Player".

How to use simple_tfidf_japanese

simple_tfidf_japanese is a Japanese-only TFIDF calculation module that eliminates all alphabets as noise.

#Output tfidf from text(Get TF-IDF from text)
from simple_tfidf_japanese.tfidf import TFIDF
text = "Meat festival NIIGATA for a night full of meat ❤︎ Steak House Azuma-san's Yukimuro Aged Niigata Prefecture Beef Steak Delicious*\(^o^)/*Perfect for salt or wasabi!"
tfidf1 = TFIDF.gen(text, enable_one_char=1)
for key, value in tfidf1:
     print key, value

>>>Meat 0.0952380952381
>>>Steak 0.0952380952381
>>>0 0.047619047619
>>>Rice 0.047619047619
>>>Snow 0.047619047619
>>>Niigata 0.047619047619
>>>Aging 0.047619047619
...

#Output tfidf from the web(Get TF-IDF from Web)
url = "https://ja.wikipedia.org/wiki/%E6%B7%A1%E8%B7%AF%E3%83%93%E3%83%BC%E3%83%95"
tfidf2 = TFIDF.gen_web(url)
for key, value in tfidf2:
     print key, value

>>>Awaji 0.0453257790368
>>>Beef 0.0396600566572
>>>Tajima 0.0198300283286
>>>Awaji Island 0.0169971671388
>>>Page 0.0169971671388
>>>Display 0.014164305949

# TF-Calculate similarity with IDF Cosine Similarity(calc TF-IDF Cosine Similarity)
tfidf1 = [['Apple', 1], ['Orange', 2], ['Banana', 1], ['Kiwi', 0]]
tfidf2 = [['Apple', 1], ['Orange', 0], ['Banana', 2], ['Kiwi', 1]]
print TFIDF.similarity(tfidf1, tfidf2)
>>> 0.5

I want to know more about janome

Read the slides that the creators presented at PyCon 2015! Slides for morphological analysis made and learned in Python

I want to know more about TF-IDF Cosine Similarity

Qiita: TF-IDF Cos similarity estimation method

Recommended Posts

Sort the relationship between pitcher Masahiro Yamamoto and each team by janome and TF-IDF Cosine Similarity
The subtle relationship between Gentoo and pip
About the relationship between Git and GitHub