3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF

We will look back on the past year by TF-IDF analysis for news articles related to the new coronavirus.

** ⑴ Document creation **

** 1. Data source **

** 2. Data acquisition and preprocessing **

Moon article
number
Main events
1 64 1/6 Ministry of Health, Labor and Welfare calls attention "Pneumonia of unknown cause in Wuhan, China"
1/16 First confirmed infected person in Japan, Chinese man traveling to Wuhan
2 210 2/3 Cruise ship confirmed to be infected by passengers, entering Yokohama port
2/13 A woman in her 80s living in Kanagawa prefecture who died for the first time in Japan
3 88 3/9 Expert meeting calls for avoidance of "three dense"
3/24 Decided to postpone the Tokyo Olympics
4 320 4/7 Declaration of emergency in the Greater Tokyo Area and 7 prefectures of Osaka, Hyogo, and Fukuoka
4/16 Expand the state of emergency nationwide
5 357 5/4 State of emergency extended until May 31
5/25 Completely lift the state of emergency
6 65 6/Alleviate self-restraint from moving across 19 prefectures nationwide
6/29 Over 500,000 dead in the world
7 35 7/3 Over 200 people infected in Japan for the first time in 2 months
7/22 GoTo Travel Start/795 people infected daily in Japan, the highest number ever
8 18 8/17 4-June GDP is 27 annually.8%Decrease
8/20 Countermeasures subcommittee views that the epidemic has reached its peak
9 7 9/5 WHO “Vaccine distribution will start in the middle of next year”
9/18 GoTo Travel Reservation to / from Tokyo lifted
10 12 10/1 GoTo eat start
10/12 Rapid spread of infection in Europe
11 25 11/19 The number of domestically infected people reached a record high for the second consecutive day
11/20 Government Subcommittee Recommendations for Government to Review GoTo
12 15 12/14 GoTo Travel Stopped all over Japan
12/17 Tokyo, 822 new infections per day, to the highest alert level
Total 1216

** 3. Upload text file **

from google.colab import files
uploaded = files.upload()

image.png

image.png

** 4. Reading a text file **

#Fill the arithmetic progression from 1 to 12 with 0 and make it 2 digits
months = ['{0:02d}'.format(i) for i in range(1,13,1)]

docs = []
for month in months:
    #Generate file name
    file_name = "nipponcom_covid19_2020-" + month + ".txt"
    #Read as text
    with open(file_name, mode='rt', encoding='utf-8-sig') as f:
        text = f.read()
        docs.append(text)

image.png

** ⑵ Overview of data **

** 1. Monthly number of extracts and vocabulary **

import pandas as pd

metrics = []
for doc in docs:
    value = []
    #Split with whitespace as delimiter
    words = pd.Series(doc.split(" "))
    #Count the number of elements
    value.append(len(words))
    #Count the number of unique elements
    value.append(words.nunique())
    metrics.append(value)

#Formatted to data frame
names = ["Number of extracts", "Vocabulary number"]
months = ['{0}Moon'.format(i) for i in range(1, 13, 1)]
pd.DataFrame(metrics, columns=names, index=months)

image.png

** 2. Top 10 words of monthly appearance frequency **

from collections import Counter

rank_frequency = []
for doc in docs:
    value = []
    #Split with whitespace as delimiter
    words = pd.Series(doc.split(" "))
    #Count the number of unique vocabularies
    cnt = Counter(words)
    v = cnt.most_common(10) #Top
    value.append(v)
    rank_frequency.append(value)
    
rank_frequency

image.png

import numpy as np

#Get the top 10 words each month
ranking = []
for a in rank_frequency:    
    temp = []
    for i in a:
        for n in range(0,10,1):
            j = i[n]
            temp.append(j[0])
    ranking.append(temp)

#Data frame
data = np.array(ranking).T
rank = ['{0}Rank'.format(i) for i in range(1, 11, 1)]
pd.DataFrame(data, columns=months, index=rank)

image.png

** ⑶ TF-IDF analysis **

from sklearn.feature_extraction.text import TfidfVectorizer

#Generate model
vectorizer = TfidfVectorizer(smooth_idf=False)
X = vectorizer.fit_transform(docs)

#Data frame
values = X.toarray()
feature_names = vectorizer.get_feature_names()
month_num = ['{0:02d}'.format(i) for i in range(1,13,1)]
df_score = pd.DataFrame(values, columns = feature_names, index=month_num)

print(df_score)

image.png

for i in range(0,12,1):
    monthly_rank = []
    df_score_ = df_score[i:i+1].T
    df_score_sorted = df_score_.sort_values(month_num[i], ascending=False)
    print(df_score_sorted.head(10))

image.png

result = []
for i,j in zip(range(0,12,1), month_num):
    test = df_score[i:i+1].T
    #Get the top 10 words
    test_sorted = test.sort_values(j, ascending=False)
    test_rank = test_sorted.head(10)
    #Extract only noun labels
    r = test_rank.index
    result.append(r)

pd.DataFrame(result,columns=rank,index=months).T

image.png

** ⑷ Comparison of frequency of occurrence and TF-IDF analysis **

** Top 10 words of appearance frequency **

image.png

** Top 10 words of TF-IDF **

image.png

** ⑸ Transition of new words **

import itertools

#Specify October
n = 10

word_list = []
for i in range(0,n,1):
    df = df_score[i:n-1]
    df = df.loc[:, (df != 0).any(axis=0)]
    word = list(df.columns)
    word_list.append(word)

#Flatten to one dimension
word_list = list(itertools.chain.from_iterable(word_list))

len(word_list)

image.png

#Extract only this month
df_current = df_score[n-1:n]
df_current = df_current.loc[:, (df_current != 0).any(axis=0)]

#Removal of existing words
for i in word_list:
    if i in df_current:
        df_current = df_current.drop(i, axis=1)

# TF-Extract the top 10 words of IDF
df_current = df_current.T
df_sorted = df_current.sort_values(str(n), ascending=False)
df_sorted.head(10)

image.png

image.png

Recommended Posts

3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 1-1. Word N-gram
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Python: Natural language processing
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
[Chapter 5] Introduction to Python with 100 knocks of language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing with Python Knock 2015
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
10 functions of "language with battery" python
Dockerfile with the necessary libraries for natural language processing in python
3. Natural language processing with Python 5-4. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (noun edition)]
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python
Basics of binarized image processing with Python
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 Language Processing Knock with Python (Chapter 2, Part 2)
I tried a functional language with Python
100 Language Processing Knock with Python (Chapter 2, Part 1)
Receive a list of the results of parallel processing in Python with starmap
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
Drawing with Matrix-Reinventor of Python Image Processing-
The story of blackjack A processing (python)
I tried natural language processing with transformers.
Try image processing with Python when asked for entertainment at a wedding ceremony
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
Getting started with Python with 100 knocks on language processing
[AtCoder] Solve A problem of ABC101 ~ 169 with Python
Solve A ~ D of yuki coder 247 with python
Python: Deep Learning in Natural Language Processing: Basics
Let's enjoy natural language processing with COTOHA API
Unbearable shortness of Attention in natural language processing
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
After doing 100 language processing knock 2015, I got a lot of basic Python skills Chapter 1
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
[Natural language processing / NLP] How to easily perform back translation by machine translation in Python
Set up a development environment for natural language processing
I learned Python with a beautiful girl at Paiza # 02
[Language processing 100 knocks 2020] Summary of answer examples by Python
A memo connected to HiveServer2 of EMR with python
Learn Nim with Python (from the beginning of the year).
Recommendation of building a portable Python environment with conda
I learned Python with a beautiful girl at Paiza # 01
Looking back on creating a web service with Django 1
[Practice] Make a Watson app with Python! # 1 [Language discrimination]
Take a peek at the processing of LightGBM Tuner
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing