[PYTHON] Implementation of TF-IDF using gensim

Last time scraped the article of Bungei Online and performed negative / positive analysis. This time, I will use gensim to perform TF-IDF on the content of the article on Bungei Online and extract important words!

flow

  1. Reference article
  2. What is TF-IDF?
  3. Morphological analysis
  4. Preparation before executing DF-IDF
  5. Implementation of TF-IDF
  6. Output high-level words with TF-IDF

Reference article

About TF-IDF https://mieruca-ai.com/ai/tf-idf_okapi-bm25/ About tfidf of gensim https://qiita.com/tatsuya-miyamoto/items/f1539d86ad4980624111

What is TF-IDF?

If TF-IDF is high, it will be a word that characterizes the sentence. I think it can be said that the word is not so important. For example, if you are looking at a sports newspaper and you pay attention to one article and the text of that article contains the word "home run", It's easy to see that the content of this article is about baseball. In such cases, the word TF-IDF for the word baseball is likely to show a high value. On the other hand, let's say the text of an article contains the word player. Since it is a sports newspaper, it is probably used a lot in every article. In such cases, I think the word TF-IDF for the word player is low. For details, please refer to here.

This time, I will use TF-IDF to extract important words for each article. Let's work now. First, load the required lip rally.

from gensim import corpora
from gensim import models
from janome.tokenizer import Tokenizer
from pprint import pprint
import pandas as pd
import logging

I saved Bungei Online created in the previous article as a CSV file, so I will use that.

df = pd.read_csv('/work/data/BunshuOnline/news.csv')
doc_list = list(df.news_page_list)

Let's check the content of the article.

for i in range(10):
    print('%s.' % i, '〜%s〜' % df['title'][i])
    print(df['news_page_list'][i][:200], end='\n\n')
  1. ~ Does the turbulence depend on the running of the rookies? Pay attention to the first team of the "monster generation" Read the 97th Hakone Ekiden Preliminary Round ~ "Every year, people say,'Ekiden is different from trucks, so you shouldn't expect too much from the first graders as a force.' But this year, the atmosphere is a little different." That's how surprised this season. The reporter in charge of the relay road race of the sports newspaper talks. In the midst of the corona wreck that has continued since early spring, the sports world has been greatly influenced this year. The same is true in the long-distance student world, which is a great support not only for canceling the tournament from spring to summer, but also for recording sessions and practice at each school.

  2. ~ Isn't a smartphone or tablet insufficient? What is the best PC for "online lessons"? 2020 was truly an important year to note in the history of digital education. First of all, programming education became compulsory in elementary school. And with the unexpected arrival of Corona, learning with digital devices has become inevitable. While the whole family stayed at home in Corona, parents would work remotely and children would study online, so many readers would have had the experience of having to compete with their parents for only one PC at home.

  3. ~ A former wife and a child confess the DV of Mr. Yoshifumi Yokomine, the founder of early childhood education "Yokomine style" << To the court battle "full confrontation" >> ~ A nursery school child jumps 10 steps in the vaulting box and walks upside down. He also learned multiplication tables, read and write kanji, and read an average of 2000 books in three years. With the motto "All children are geniuses", there is an original educational method that has spread from Kagoshima to the whole country. This teaching method, which ignites children's aspirations and competitiveness and develops their ability to learn by themselves without forcing them, is called the "Yokomine style". Today, about 400 nursery schools and kindergartens nationwide have introduced this horizontal curriculum.

  4. ~ President of BTS company, 8th largest stock millionaire in South Korea ... "The common point is geeky" J.Y.Park said It was a big fuss before the listing. On October 15, "Big Hit Entertainment (BH Entertainment)", the agency of "BTS", was listed on KOSPI (Korea Exchange). The initial price is 2.6 times the public offering price, and the market capitalization is 8,816.9 billion won (about 880 billion yen as of 15:00 on the same day). Far beyond the combined market capitalization of the three major Korean music agencies, it has become the largest music agency in South Korea. founder,

  5. ~ The story that the behavior became suspicious enough to be surprised by the flow from "I do not need a plastic shopping bag" ~ Saho Yamamoto, a manga artist, spells out troublesome days that attract troublesome people. This time, I talked about shopping at a pharmacy and a supermarket. Updated every Thursday. The update has been delayed recently, but the next one will be released on Thursday, October 22nd (probably). The book "Today is a bad day" is now on sale. Volume 1 filled with Mr. Yamamoto's happenings is a great depiction of the "fight" with that ward office that was on fire !! Today is also a difficult day 1 Saho Yamamoto

  6. ~ Haruma Miura, Yuko Takeuchi, Hana Kimura ... Why does the entertainer "chain of suicide" occur? ~ It's been almost three months since Haruma Miura died, but there are still reports of his death. It's not just Miura-san. Hana Kimura, Sei Ashina, Takashi Fujiki, Yuko Takeuchi, and other celebrities who appeared in the reality program "Terrace House" warned that the death of celebrities could lead to the "next death." , Mr. Tamaki Saito, a professor at the University of Tsukuba. "It's hard to say, but when a patient dies in psychiatry, many patients choose to die in the same way.

  7. ~ Obtaining internal materials Daily allowance of 40,000 yen for employees seconded to "GoTo Travel Secretariat" The government's tourism support measure "GoTo Travel Project" has been added to travel to and from Tokyo from October 1st. According to an interview with Shukan Bunshun, a large travel agency employee who is seconded to the "GoTo Travel Secretariat", which is in charge of the operation, is paid a large daily allowance from the government. The GoTo Travel Secretariat consists of KNT-CT Holdings, which has Kinki Nippon Tourist under its umbrella, led by JTB, which is the largest in the industry, except for the All Nippon Travel Agents Association (ANTA).

  8. ~ "What's your age? Why am I the only one who" knocks up "!" "The last monster secretary general" Toshihiro Nikai was sharpened ~ Toshihiro Nikai, who is breaking the record of total employment of the secretary general of the LDP, is often reported as a sleeper, that is, a "skilled person in political technology." It is known that the policy is pro-Chinese and a member of the transportation and land improvement projects, but the political idea is not surprisingly known. Upstairs, who hasn't seen much in the media, talked about "the origin of politics" in an interview this time. I was given 30 minutes, and in the limited time, I asked a number of questions upstairs in quick succession.

  9. ~ "The dismembered body in a pot ..." Takahiro Shiraishi testified that the whole story of the horrifying crime ~ "I turned behind and squeezed my neck with my left arm." Takahiro Shiraishi was accused of robbery and forced sexual murder in October 2017 when the bodies of nine men and women were found in an apartment in Zama City, Kanagawa Prefecture. The lay judge trial of the defendant (30) was held on October 14th at the Tokyo District Court Tachikawa Branch (Judge Naokuni Yano). On this day, the accused was asked about the case of Mr. C (20, male at that time) who killed the third person. Defendant Shiraishi said that Mr. C did not agree to be killed at the time of the murder.

  10. ~ Not "length" ... "Unexpected words" that hairdressers teach when cutting hair ~ The comforters are getting more and more comfortable these days. For those of you who are thinking about putting on a thin jacket or changing your hairstyle to fall / winter specifications due to the sudden temperature difference between morning and night. When you make a reservation and actually meet with a beautician, do you ever lose track of how to convey your hope? How can I tell the beautician this time? It is a story. Here are some examples of common customer orders.

Morphological analysis

Perform morphological analysis. It can be easily implemented by using Tokenizer () of gensim. We don't need the symbol, so use part_of_speech to exclude it.

t = Tokenizer()
wakati_list = []
for doc in doc_list:
    tokens = t.tokenize(doc)
    wakati = []
    for token in tokens:
        if token.part_of_speech.split(',')[0] not in ['symbol']:
            wakati.append(token.base_form)
    wakati_list.append(wakati)

Let's check if the word-separation and symbol removal were successful.

for wakati in wakati_list:
    print(wakati[:10])

['Annual',' Da',' Ekiden',' is','Truck',' and',' is',' Another',' Da','From'] ['2020','year',' is','exactly','digital','education','',' history','for',' special note'] ['Nursery',' Child',' Ga',' Vaulting Box', '10',' Step','',' Jump',' Inverted','Walking'] ['Listed','Before','From','Fuss',' Da',' Ta', '10',' Month', '15',' Day'] ['Manga',' House','',' Yamamoto',' Sa',' Hosu',' N',' Ga',' Trouble',' Da'] ['Miura','Spring','Huma','san','ga','death','te','from', '3','months'] ['10','Month', '1','Day','From','Tokyo','Departure / Arrival','',' Travel','Mo'] ['Liberal Party',' Secretary',' Chief','',' Total',' In-service',' Record',' to',' Update','Middle'] ['Back',' to',' turn',' te',' neck',' to',' left arm',' with','squeeze',' ['Comforter','ga','gradually','comfortable','naru','te','come','ta','today','this']

Preparation before executing DF-IDF

Then use corpora from gensim to attach the ID to the word.

dictionary = corpora.Dictionary(wakati_list)
print('==={word: ID}===')
pprint(dictionary.token2id)

=== {word: ID} === {'!': 1021, '!!': 1822, ',': 2023, '-': 568, '.': 569,

'What': 1084,
'Why': 1085,
'Somehow': 2693,
'Nah': 2694,
'To': 104,
~~~~ <omitted> ~~~~~
'Memo App': 2497,
'Mental': 1890,
'Members': 174,
'Manufacturer': 742,
'Modern': 743}


Next, count the number of occurrences for each article for each word.


```python
corpus = list(map(dictionary.doc2bow, wakati_list))
print('===(Word ID,Number of appearances)===')
pprint(corpus)
```

=== (word ID, number of occurrences) ===
   [[(0, 10),
     (1, 5),
     (2, 1),
     (3, 1),
     (4, 1),
~~~~ <omitted> ~~~~~
     (2872, 1),
     (2873, 1),
     (2874, 2),
     (2875, 1),
     (2876, 1)]]

## Implementation of TF-IDF
Next, calculate TF-IDF using the data of the number of occurrences.
Use `models` of` gensim`.
`gensim` really has a lot of features.
Display a part of the calculation result.


```python
test_model = models.TfidfModel(corpus)
corpus_tfidf = test_model[corpus]
print('===(Word ID, TF-IDF)===')
for doc in corpus_tfidf:
   print(doc[:4])
```

=== (word ID, TF-IDF) ===
   [(0, 0.010270560215244124), (1, 0.010876034850944512), (2, 0.011736346492935309), (3, 0.006756809998052433)]
   [(0, 0.0042816896254018535), (3, 0.005633687484063879), (5, 0.008303667688079712), (7, 0.005633687484063879)]
   [(0, 0.001569848428761509), (1, 0.006649579327355055), (3, 0.005163870001530652), (5, 0.00761118904775017)]
   [(0, 0.004119674666568976), (1, 0.006543809340846026), (3, 0.006775642806339103), (5, 0.02496707813315839)]
   [(0, 0.01276831026581211), (1, 0.013521033373868814), (7, 0.04200016584013773), (13, 0.04200016584013773)]
   [(1, 0.007831949836845296), (2, 0.0422573475812842), (7, 0.024328258270175436), (11, 0.00625933452375299)]
   [(0, 0.00115918318434994), (1, 0.004910079468851687), (9, 0.013246186396066643), (11, 0.0039241607229361)]
   [(0, 0.0014966665107978136), (1, 0.006339594643539755), (2, 0.06841066655360874), (3, 0.009846289814745559)]
   [(0, 0.0026696482016164125), (1, 0.003769373987012683), (5, 0.004314471125835394), (6, 0.007739059517798651)]
   [(0, 0.0011200718719816475), (25, 0.020161293695669654), (26, 0.06631869535917725), (27, 0.0037917579432113335)]


Since it is difficult to understand with the word ID, the word is displayed.


```python
texts_tfidf = []
for doc in corpus_tfidf:
   text_tfidf = []
   for word in doc:
       text_tfidf.append([dictionary[word[0]], word[1]])
   texts_tfidf.append(text_tfidf)
print('===[word, TF-IDF]===')
for i in texts_tfidf:
   print(i[20:24])
```

=== [word, TF-IDF] ===
[['U', 0.022445636964346205], ['m', 0.11222818482173101], ['To the end', 0.022445636964346205], ['Over there', 0.022445636964346205]]
[['Solid', 0.02616203449412644], ['Shut up', 0.002898944443406048], ['Ja', 0.004151833844039856], ['Let', 0.005797888886812096]]
[['This', 0.0026571889707680363], ['More', 0.013652529666738833], ['Shut up', 0.010628755883072145], ['Ja', 0.003805594523875085]]
[['Here', 0.006775642806339103], ['Koto', 0.008239349333137951], ['This', 0.0034865640168190394], ['Well', 0.015732555392960215]]
[['Article', 0.006384155132906055], ['Home', 0.04200016584013773], ['Recently', 0.07295284301359403], ['That', 0.09752136505414427]]
[['Te', 0.053620572470353435], ['Can', 0.01792908931110876], ['is', 0.03132779934738118], ['But', 0.0018489852575983943]]
[['That', 0.0034775495530498207], ['In the first place', 0.010081089692211678], ['That', 0.020865297318298923], ['It', 0.019640317875406748]]
[['Ja', 0.007256376823656625], ['Too', 0.017102666638402184], ['Let', 0.005066636590574954], ['Yes', 0.01451275364731325]]
[['Sure', 0.009037509134257363], ['Ja', 0.004314471125835394], ['Sell', 0.018075018268514726], ['Yes', 0.012943413377506183]]
[['Yes', 0.005430510747744854], ['That', 0.006720431231889886], ['In the first place', 0.00974094962350806], ['That', 0.01948189924701612]]

## Output high-ranking words with TF-IDF
Let's finally display the top words for each article with TF-IDF!
This time, I will display only words with TF-IDF of 0.1 or more!


```python
for i in range(len(texts_tfidf)):
   print('')
   print('%s.' % i, '〜%s〜' % df['title'][i])
   for text in texts_tfidf[i]:
       if text[1] > 0.13:
           print(text)
```

   
0. ~ Does the turbulence depend on the running of the rookies? Pay attention to the first team of the "monster generation" Read the 97th Hakone Ekiden Preliminary Round ~
['Rookie', 0.29179328053650067]
['Qualifying', 0.26934764357215446]
['Meeting', 0.2353324044944065]
['Grade', 0.13467382178607723]
['School', 0.17956509571476964]
['Director', 0.13467382178607723]
['Practice', 0.15711945875042344]
['Running', 0.15711945875042344]
['Ekiden', 0.3591301914295393]
['High School', 0.14119944269664392]
   
1. ~ Isn't a smartphone or tablet insufficient? What is the best PC for "online lessons"?
   ['.', 0.17005322421182187]
   ['/', 0.2807205709669065]
   ['LAVIE', 0.1684323425801439]
   ['Office', 0.1310029331178897]
   ['PC', 0.561441141933813]
['Children', 0.2092962759530115]
['Learning', 0.14971763784901682]
['Installed', 0.1684323425801439]
['Function', 0.1684323425801439]
   
2. ~ A former wife and a child confess the DV of Mr. Yoshifumi Yokomine, the founder of early childhood education "Yokomine style" << To the court battle "full confrontation" >> ~
['Education', 0.15248089693189756]
   ['A', 0.1435114324064918]
['Mine', 0.1715400483643072]
['Horizontal', 0.1715400483643072]
['Nursery', 0.37738810640147585]
['Yoshifumi', 0.3259260918921837]
['Child', 0.2744640773828915]
['Children', 0.2573100725464608]
['Violence', 0.2230020628735994]
['Mr.', 0.17938929050811475]
['Grandmother', 0.13723203869144576]
['Kagoshima', 0.13723203869144576]
   
3. ~ President of BTS company, 8th largest stock millionaire in South Korea ... "The common point is geeky" J.Y.Park said
   ['.', 0.14159299853664192]
   ['BH', 0.20257378379369387]
   ['BoA', 0.13504918919579592]
   ['JYP', 0.13504918919579592]
   ['SM', 0.13504918919579592]
['Artist', 0.13504918919579592]
['Office', 0.27009837839159184]
['Billion', 0.13504918919579592]
['Company', 0.28318599707328385]
['Korea', 0.40514756758738774]
   
4. ~ The story that the behavior became suspicious enough to be surprised by the flow from "I do not need a plastic shopping bag" ~
['Update', 0.14590568602718806]
   ['!!', 0.13952153089428201]
   ['22', 0.13952153089428201]
['Grate', 0.13952153089428201]
['Today', 0.27904306178856403]
['Maybe', 0.13952153089428201]
['Spell', 0.13952153089428201]
['Hosu', 0.13952153089428201]
['Horu', 0.13952153089428201]
['Trouble', 0.13952153089428201]
['Happening', 0.13952153089428201]
['Ward Office', 0.13952153089428201]
['Book', 0.13952153089428201]
['Troublesome', 0.13952153089428201]
['Troublesome day', 0.27904306178856403]
['Significantly', 0.13952153089428201]
['Popular', 0.13952153089428201]
['Yamamoto', 0.41856459268284607]
['Volume', 0.13952153089428201]
['Attract', 0.13952153089428201]
['Thursday', 0.13952153089428201]
['Next time', 0.13952153089428201]
['Weekly', 0.13952153089428201]
['Cartoon', 0.13952153089428201]
['Flame', 0.13952153089428201]
['Pharmacy', 0.13952153089428201]
['Clog', 0.13952153089428201]
['Shopping', 0.13952153089428201]
['Fight', 0.13952153089428201]
   
5. ~ Haruma Miura, Yuko Takeuchi, Hana Kimura ... Why does the entertainer "chain of suicide" occur? ~
['San', 0.14596954962105263]
['Projection', 0.16163344929474321]
['Kimura', 0.16163344929474321]
['Death', 0.4849003478842297]
['Suicide', 0.14790071653449471]
['Celebrity', 0.25419809869738275]
   
6. ~ Obtaining internal materials Daily allowance of 40,000 yen for employees seconded to "GoTo Travel Secretariat"
['Travel', 0.1593642568499776]
   [',', 0.20266551686226678]
   ['GoTo', 0.32933146490118353]
['Travel', 0.20266551686226678]
['Office work', 0.3546646545089669]
['Seconded', 0.1519991376467001]
['Major', 0.17733232725448345]
['Station', 0.3546646545089669]
['Engineer', 0.17733232725448345]
['Technology', 0.17707139649997508]
['Daily allowance', 0.25333189607783346]
['Employees', 0.17733232725448345]
   
7. ~ "What's your age? Why am I the only one who" knocks up "!" "The last monster secretary general" Toshihiro Nikai was sharpened ~
['2', 0.25653999957603274]
['Floor', 0.3200732773176539]
['Country', 0.13717426170756597]
['Toshihiro', 0.13083466706402622]
['Origin', 0.13083466706402622]
['Equilibrium', 0.16354333383003275]
['Politics', 0.26166933412805243]
['Development', 0.22896066736204587]
['Prefectural Assembly', 0.13083466706402622]
['Secretary', 0.16354333383003275]
['Road', 0.22896066736204587]
   
8. ~ "The dismembered body in a pot ..." Takahiro Shiraishi testified that the whole story of the horrifying crime ~
['San', 0.2575923910688616]
   ['C', 0.5029569855573655]
['Killing', 0.3889560913276654]
['Shiraishi', 0.3695082867612821]
['Defendant', 0.3889560913276654]
   
9. ~ Not "length" ... "Unexpected words" that hairdressers teach when cutting hair ~
['Master', 0.23953570973227745]
['Customer', 0.19582749984882405]
['Hair', 0.14687062488661806]
['Hairstyle', 0.44061187465985413]
['Cosmetology', 0.44061187465985413]


It was a simple implementation of `gensim`, but it seems that only words that seem to be important to some extent can be extracted.
I found `gensim` to be very useful.

Next time, I will use `TF-IDF` and` word2vec` to decide the theme and implement clustering!


Recommended Posts

Implementation of TF-IDF using gensim
Implementation of desktop notifications using Python
Implementation of dialogue system using Chainer [seq2seq]
Implementation of "blurred" neural network using Chainer
Implementation of Fibonacci sequence
Example of using lambda
Implementation of object authenticity judgment condition using __bool__ method
Implementation of a convolutional neural network using only Numpy
Implementation of Chainer series learning using variable length mini-batch
Quantum computer implementation of quantum walk 2
Rank learning using neural network (Implementation of RankNet by Chainer)
Implementation of MathJax on Sphinx
Explanation and implementation of SocialFoceModel
Implementation of game theory-Prisoner's dilemma-
python: Basics of using scikit-learn ①
Implementation of independent component analysis
Quantum computer implementation of quantum walk 3
Introduction of caffe using pyenv
Python implementation of particle filters
Verification and implementation of video reconstruction method using GRU and Autoencoder
Implementation of quicksort in Python
Quantum computer implementation of quantum walk 1
Deep reinforcement learning 2 Implementation of reinforcement learning
Implementation of Scale-space for SIFT
Implementation of recommendation system ~ I tried to find the similarity from the outline of the movie using TF-IDF ~
A memorandum of using eigen3
Implementation of JWT authentication functionality in Django REST Framework using djoser
Implementation of CRUD using REST API with Python + Django Rest framework + igGrid
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
Implementation of Datetime picker action using line-bot-sdk-python and implementation sample of Image Carousel
Simple neural network implementation using Chainer
Vertical Tower of Pisa using OpenCV
Explanation and implementation of PRML Chapter 4
Introduction and Implementation of JoCoR-Loss (CVPR2020)
Image capture of firefox using python
Benefits and examples of using RabbitMq
Explanation and implementation of ESIM algorithm
Judgment of backlit image using OpenCV
Calculation of normal vector using convolution
Introduction and implementation of activation function
Qiskit: Implementation of quantum Boltzmann machine
Let's use tomotopy instead of gensim
Python implementation of self-organizing particle filters
Summary of basic implementation by PyTorch
Implementation of a simple particle filter
Implementation of a two-layer neural network 2
Implementation of login function in Django
Qiskit: Implementation of Quantum Hypergraph States
Removal of haze using Python detailEnhanceFilter
Visualization of mixed matrices using sklearn.metrics.ConfusionMatrixDisplay
Quantum computer implementation of 3-state quantum walk
Collective measurement of Volume using FSL
I tried using GrabCut of OpenCV
Einsum implementation of value iterative method
Implementation of life game in Python
Explanation and implementation of simple perceptron
Limitation of default dict using typing
An implementation of ArcFace for TensorFlow
implementation of c / c ++> RingBuffer (N margins)
Python implementation of non-recursive Segment Tree
Implementation of Light CNN (Python Keras)