Qiita Advent Calendar 2020 This is a summary of the data creation procedure used in "A Year of Corona Looking Back at TF-IDF" on the 22nd day of "Natural Language Processing".
There are three steps: (1) scraping, (2) cleansing, and (3) morphological analysis.

⑴ Data acquisition by scraping

1. Collect URLs for each news

For the period from January to December 2020 (12/20), we will collect the URLs of news articles related to Corona reported during this period.
As a resource, the archive of the multilingual information transmission site "nippon.com" "[New Coronavirus](https://www.nippon.com/ja/tag /% E6% 96% B0% E5% 9E% 8B% E3% 82% B3% E3" % 83% AD% E3% 83% 8A% E3% 82% A6% E3% 82% A4% E3% 83% AB% E3% 82% B9 /? Pnum = 1) ”. In addition, the period before that (before 3/13) is also extracted from the news archive of nippon.com from the viewpoint of corona-related.
All URLs are in the format https://www.nippon.com/ja/ + hoge/hoge012345/, so the following is a list of the following, for example, June. I will show you, but I will only post a part in consideration of copyright.

# covid-19_2020-06
pagepath = ["japan-topics/bg900175/",
            "in-depth/d00592/",
            "news/p01506/",
            "news/p01505/",
            "news/p01501/",
            #Omission
            "news/fnn2020060147804/",
            "news/fnn2020060147795/",
            "news/fnn2020060147790/"]

2. Get HTML data, extract necessary parts

import requests
from bs4 import BeautifulSoup

requests is Python's HTTP communication library that sends requests to the URL to be scraped and ** retrieves HTML data **.
BeautifulSoup is an HTML parser library that analyzes the acquired ** HTML data and extracts only the necessary parts (= parsing processing) **.
➊ requests to get the whole HTML, ➋ Beautiful Soup to format it, determine the conditions to extract the necessary parts, and ➌ select to extract the necessary parts according to the conditions. ..

docs = []
for i in pagepath:
    #➊ Get HTML data
    response = requests.get("https://www.nippon.com/ja/" + str(i))
    html_doc = response.text

    #➋ Perspective processing
    soup = BeautifulSoup(html_doc, 'html.parser')
    # ➌ <div class="editArea">Directly below<p>Extract the tag part
    target = soup.select('.editArea > p')

    # <p>Extract only text for each sentence enclosed in tags
    value = []
    for t in target:
        val = t.get_text()
        value.append(val)

    #Delete empty data in the list
    value_ = filter(lambda str:str != '', value)
    value_ = list(value_)

    #"Full-width blank"\delete "u3000"
    doc = []
    for v in value_:
        val = v.replace('\u3000', '')
        doc.append(val)

    docs.append(doc)

The extracted required parts docs are:
For reference, let's take a look at the above process step by step. First, the stage where HTML data is acquired by ➊ requests, converted to text and stored. ➋ Next, a BeautifulSoup object is created, and the character string is formatted and output by print (soup.prettify ()). Determine the conditions for getting the required part from here. ➌ Next, the extraction condition is passed to select and only the necessary part is extracted, and the result of extracting only the body text from this is docs.

⑵ Text cleansing

Define cleansing as a function to remove characters and phrases that are noisy in analysis from the text.

1. Define cleansing function

Use the module neologdn that standardizes Japanese sentences together with the Python regular expression module re.

!pip install neologdn===0.3.2

Specify the conditional expression for deletion or replacement. For example, in addition to standard punctuation marks and parentheses, morphological analysis is performed for half a year from January to June on a trial basis, and is added as appropriate.

import re
import neologdn

def cleansing(text):
    text = ','.join(text) #Flatten with comma delimiters
    text = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text) #URL deletion
    text = neologdn.normalize(text) #Alphabet / Number: Half-width, Katakana: Full-width
    text = re.sub(r'[0-9]{4}Year', '', text) ##日付を削除（yyyyYear）
    text = re.sub(r'[0-9]{2}Year', '', text) ##日付を削除（yyYear）
    text = re.sub(r'\d+Moon', '', text) #日付を削除（何Moon）
    text = re.sub(r'\d+Day', '', text) #Day付を削除（何Day）
    text = re.sub(r'\d+Time', '', text) #Time間を削除（何Time）
    text = re.sub(r'\d+Minutes', '', text) #時間を削除（何Minutes）
    text = re.sub(r'\d+Substitute', '', text) #年Substituteを削除
    text = re.sub(r'\d+Man', '', text) #Man数を削除（何Man）
    text = re.sub(r'\d+Ten thousand people', '', text) #人数を削除（何Ten thousand people）
    text = re.sub(r'\d+\.\d+\%', '', text) #Delete percentage (decimal)
    text = re.sub(r'\d+\%', '', text) #Delete percentage (integer)
    text = re.sub(r'\d+\.\d+％', '', text) #Delete percentage (decimal)
    text = re.sub(r'\d+％', '', text) #Delete percentage (integer)
    text = re.sub(r'\d+Months', '', text) #月数を削除（何Months）
    text = re.sub(r'\【.*\】', '', text) #[] And its contents deleted
    text = re.sub(r'\[.*\]', '', text) #[]And its contents deleted
    text = re.sub(r'、|。', '', text) #Remove punctuation
    text = re.sub(r'「|」|『|』|\（|\）|\(|\)', '', text) #Remove parentheses
    text = re.sub(r'：|:|＝|=|／|/|～|~|・', '', text) #Remove sign

    #News source
    text = text.replace("Afro", "")
    text = text.replace("Jiji Press", "")
    text = text.replace("Current events", "")
    text = text.replace("TV nishinippon", "")
    text = text.replace("Kansai TV", "")
    text = text.replace("Fuji Television Network, Inc", "")
    text = text.replace("FNN Prime Online", "")
    text = text.replace("Nippon Dotcom Editorial Department", "")
    text = text.replace("unerry", "")
    text = text.replace("THE PAGE", "")
    text = text.replace("THE PAGE Youtube channel", "")
    text = text.replace("Live News it!", "")
    text = text.replace("AFP", "")
    text = text.replace("KDDI", "")
    text = text.replace("Pakutaso", "")
    text = text.replace("PIXTA", "")

    #Idioms / idiomatic phrases
    text = text.replace("Banner photo", "")
    text = text.replace("Photo courtesy", "")
    text = text.replace("Document photo", "")
    text = text.replace("Below photo", "")
    text = text.replace("Banner image", "")
    text = text.replace("Image courtesy", "")
    text = text.replace("Photographed by the author", "")
    text = text.replace("Provided by the author", "")
    text = text.replace("Click here for original articles and videos", "")
    text = text.replace("Click here for the original article", "")
    text = text.replace("Published", "")
    text = text.replace("photograph", "")
    text = text.replace("source", "")
    text = text.replace("Video", "")
    text = text.replace("Offer", "")
    text = text.replace("Newsroom", "")

    #Unnecessary spaces and line breaks
    text = text.rstrip() #Line breaks / spaces removed
    text = text.replace("\xa0", "")

    text = text.upper() #Alphabet: uppercase
    text = re.sub(r'\d+', '', text) ##Remove arabic numerals

    return text

2. Execution of cleansing process

docs_ = []
for i in docs:
    text = cleansing(i)
    docs_.append(text)

⑶ Consideration and designation of stop words

Before designating the stop word, I checked what kind of idioms and phrases in the alphabet are listed just in case.

1. Get the alphabet phrase

The argument of re.findall () is (word pattern to be searched, character string to be searched, re.ASCII), and the third argument re.ASCII is" ASCII character (half-width English). Only matches numbers, symbols, control characters, etc.) "is specified.

alphabets = []
for i in docs_:
    alphabet = re.findall(r'\w+', i, re.ASCII)
    if alphabet:
        alphabets.append(alphabet)

print(alphabets)

2. Get the top 10 words of appearance frequency

itertools is a module that collects iterator generation functions that execute time-consuming loop processing of for statements more efficiently.
The chain.from_iterable flattens all the elements contained in the multidimensional list and puts them together in one list.
Get the number of occurrences for each word in the Python standard library collections, and get the top 10 occurrences in collections.Counter.

import itertools
import collections
from collections import Counter
import pandas as pd

#Flattening a multidimensional array
alphabets_list = list(itertools.chain.from_iterable(alphabets))

#Get the number of appearances
cnt = Counter(alphabets_list)
#Get the top 10 words
cnt_sorted = cnt.most_common(10)

#Data frame
pd.DataFrame(cnt_sorted, columns=["English words", "Number of appearances"])

We checked whether words that are not included in the news text, such as the credit of the news provider, are frequently mentioned.

3. Designation of stop word

Words that do not have a specific meaning by themselves are excluded as stop words.

stopwords = ["one", "two", "three", "four", "Five", "Six", "Seven", "Eight", "Nine", "〇",  #Chinese numeral
             "which one", "Which", "Which", "Where", "who is it", "Who", "what", "When",  #Infinitive
             "this", "It", "that", "Here", "there", "there",  #Demonstrative
             "here", "Over there", "Over there", "here", "There", "あThere",
             "I", "I", "me", "you", "You", "he", "he女",  #Personal pronoun
             "Pieces", "Case", "Times", "Every time", "door", "surface", "Basic", "Floor", "Eaves", "Building",  #Classifier
             "Stand", "Sheet", "Discount", "Anniversary", "Man", "Circle", "Year", "Time", "Person", "Ten thousand", 
             "number", "Stool", "Eye", "Billion", "age", "Total", "point", "Period", "Day",
             "of", "もof", "thing", "Yo", "Sama", "Sa", "For", "Per",  #Modified noun
             "Should be", "Other", "reason", "Yellowtail", "By the way", "home", "Inside", "Hmm", 
             "Next", "Field", "limit", "Edge", "One", "for",     
             "Up", "During ~", "under", "Before", "rear", "left", "right", "以Up", "以under",  #Suffix
             "Other than", "Within", "Or later", "Before", "To", "while", "Feeling", "Key", "Target", 
             "Faction", "Schizophrenia", "Around", "city", "Mr", "Big", "Decrease", "ratio", "rate",
              "Around", "Tend to", "so", "Etc.", "Ra", "Mr.",
             "©", "◎", "○", "●", "▼", "*"]  #symbol

⑷ Data creation by morphological analysis

Using the morphological analysis engine MeCab and the dictionary mecab-ipadic-NEologd, morphological analysis is performed for each sentence, and a list of only nouns excluding stop words is created.

1. Install MeCab and mecab-ipadic-NEologd

# MeCab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null

# mecab-ipadic-NEologd
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1

#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc

Check the dictionary path.

!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

import MeCab

path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
m_neo = MeCab.Tagger(path)

You have created an instance with path or mecab-ipadic-NEologd as the output mode.

2. Extract nouns by morphological analysis

From the one-month data docs divided into article units, process each article and store the result in noun.

noun = []
for d in docs_:
    result = []
    v1 = m_neo.parse(d) #Results of morphological analysis
    v2 = v1.splitlines() #List divided into word units
    for v in v2:
        v3 = v.split("\t") #Divide the analysis result for one word into "original word" and "content part of analysis" with a blank
        if len(v3) == 2: #EOS"Or" "except for
            v4 = v3[1].split(',') #Content part of analysis
            if (v4[0] == "noun") and (v4[6] not in stopwords):
                #print(v4[6])
                result.append(v4[6])
    noun.append(result)

print(noun)

3. Format data for TF-IDF

Use sum to convert a two-dimensional list to a one-dimensional list, and join to convert from comma-separated to half-width space-separated.

doc_06 = sum(noun, [])
text_06 = ' '.join(doc_06)

print(text_06)

With the above, one month's worth of noun data for each article has been collected into one monthly document.

4. Download to local PC

Now write text_06 to a file called'nipponcom_covd19_2020-06.txt'. The argument 'w' is the write mode specification.

with open('nipponcom_covid19_2020-06.txt', 'w') as f:
    f.write(text_06)

files is a module for uploading or downloading files between Colaboratory and your local PC.

from google.colab import files

files.download('nipponcom_covid19_2020-06.txt')

The above process was performed for 12 months, and the one that was imported to the local PC was used for the TF-IDF analysis of the article.

3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]

** ⑴ Data acquisition by scraping **

** 1. Collect URLs for each news **

** 2. Get HTML data, extract necessary parts **

** ⑵ Text cleansing **

** 1. Define cleansing function **

** 2. Execution of cleansing process **

** ⑶ Consideration and designation of stop words **

** 1. Get the alphabet phrase **

** 2. Get the top 10 words of appearance frequency **

** 3. Designation of stop word **

** ⑷ Data creation by morphological analysis **

** 1. Install MeCab and mecab-ipadic-NEologd **

** 2. Extract nouns by morphological analysis **

** 3. Format data for TF-IDF **

** 4. Download to local PC **

⑴ Data acquisition by scraping

1. Collect URLs for each news

2. Get HTML data, extract necessary parts

⑵ Text cleansing

1. Define cleansing function

2. Execution of cleansing process

⑶ Consideration and designation of stop words

1. Get the alphabet phrase

2. Get the top 10 words of appearance frequency

3. Designation of stop word

⑷ Data creation by morphological analysis

1. Install MeCab and mecab-ipadic-NEologd

2. Extract nouns by morphological analysis

3. Format data for TF-IDF

4. Download to local PC