- Qiita Advent Calendar 2020 This is a summary of the data creation procedure used in "A Year of Corona Looking Back at TF-IDF" on the 22nd day of "Natural Language Processing".
- There are three steps: (1) scraping, (2) cleansing, and (3) morphological analysis.
** ⑴ Data acquisition by scraping **
** 1. Collect URLs for each news **
- For the period from January to December 2020 (12/20), we will collect the URLs of news articles related to Corona reported during this period.
- As a resource, the archive of the multilingual information transmission site "nippon.com" "[New Coronavirus](https://www.nippon.com/ja/tag /% E6% 96% B0% E5% 9E% 8B% E3% 82% B3% E3" % 83% AD% E3% 83% 8A% E3% 82% A6% E3% 82% A4% E3% 83% AB% E3% 82% B9 /? Pnum = 1) ”. In addition, the period before that (before 3/13) is also extracted from the news archive of nippon.com from the viewpoint of corona-related.
- All URLs are in the format
https://www.nippon.com/ja/
+ hoge/hoge012345/
, so the following is a list of the following, for example, June. I will show you, but I will only post a part in consideration of copyright.
# covid-19_2020-06
pagepath = ["japan-topics/bg900175/",
"in-depth/d00592/",
"news/p01506/",
"news/p01505/",
"news/p01501/",
#Omission
"news/fnn2020060147804/",
"news/fnn2020060147795/",
"news/fnn2020060147790/"]
** 2. Get HTML data, extract necessary parts **
import requests
from bs4 import BeautifulSoup
requests
is Python's HTTP communication library that sends requests to the URL to be scraped and ** retrieves HTML data **.
BeautifulSoup
is an HTML parser library that analyzes the acquired ** HTML data and extracts only the necessary parts (= parsing processing) **.
- ➊
requests
to get the whole HTML, ➋ Beautiful Soup
to format it, determine the conditions to extract the necessary parts, and ➌ select
to extract the necessary parts according to the conditions. ..
docs = []
for i in pagepath:
#➊ Get HTML data
response = requests.get("https://www.nippon.com/ja/" + str(i))
html_doc = response.text
#➋ Perspective processing
soup = BeautifulSoup(html_doc, 'html.parser')
# ➌ <div class="editArea">Directly below<p>Extract the tag part
target = soup.select('.editArea > p')
# <p>Extract only text for each sentence enclosed in tags
value = []
for t in target:
val = t.get_text()
value.append(val)
#Delete empty data in the list
value_ = filter(lambda str:str != '', value)
value_ = list(value_)
#"Full-width blank"\delete "u3000"
doc = []
for v in value_:
val = v.replace('\u3000', '')
doc.append(val)
docs.append(doc)
- The extracted required parts
docs
are:
- For reference, let's take a look at the above process step by step.
First, the stage where HTML data is acquired by ➊ requests, converted to text and stored.
➋ Next, a BeautifulSoup object is created, and the character string is formatted and output by
print (soup.prettify ())
. Determine the conditions for getting the required part from here.
➌ Next, the extraction condition is passed to select
and only the necessary part is extracted, and the result of extracting only the body text from this is docs
.
** ⑵ Text cleansing **
- Define
cleansing
as a function to remove characters and phrases that are noisy in analysis from the text.
** 1. Define cleansing function **
- Use the module neologdn that standardizes Japanese sentences together with the Python regular expression module
re
.
!pip install neologdn===0.3.2
- Specify the conditional expression for deletion or replacement. For example, in addition to standard punctuation marks and parentheses, morphological analysis is performed for half a year from January to June on a trial basis, and is added as appropriate.
import re
import neologdn
def cleansing(text):
text = ','.join(text) #Flatten with comma delimiters
text = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text) #URL deletion
text = neologdn.normalize(text) #Alphabet / Number: Half-width, Katakana: Full-width
text = re.sub(r'[0-9]{4}Year', '', text) ##日付を削除(yyyyYear)
text = re.sub(r'[0-9]{2}Year', '', text) ##日付を削除(yyYear)
text = re.sub(r'\d+Moon', '', text) #日付を削除(何Moon)
text = re.sub(r'\d+Day', '', text) #Day付を削除(何Day)
text = re.sub(r'\d+Time', '', text) #Time間を削除(何Time)
text = re.sub(r'\d+Minutes', '', text) #時間を削除(何Minutes)
text = re.sub(r'\d+Substitute', '', text) #年Substituteを削除
text = re.sub(r'\d+Man', '', text) #Man数を削除(何Man)
text = re.sub(r'\d+Ten thousand people', '', text) #人数を削除(何Ten thousand people)
text = re.sub(r'\d+\.\d+\%', '', text) #Delete percentage (decimal)
text = re.sub(r'\d+\%', '', text) #Delete percentage (integer)
text = re.sub(r'\d+\.\d+%', '', text) #Delete percentage (decimal)
text = re.sub(r'\d+%', '', text) #Delete percentage (integer)
text = re.sub(r'\d+Months', '', text) #月数を削除(何Months)
text = re.sub(r'\【.*\】', '', text) #[] And its contents deleted
text = re.sub(r'\[.*\]', '', text) #[]And its contents deleted
text = re.sub(r'、|。', '', text) #Remove punctuation
text = re.sub(r'「|」|『|』|\(|\)|\(|\)', '', text) #Remove parentheses
text = re.sub(r':|:|=|=|/|/|~|~|・', '', text) #Remove sign
#News source
text = text.replace("Afro", "")
text = text.replace("Jiji Press", "")
text = text.replace("Current events", "")
text = text.replace("TV nishinippon", "")
text = text.replace("Kansai TV", "")
text = text.replace("Fuji Television Network, Inc", "")
text = text.replace("FNN Prime Online", "")
text = text.replace("Nippon Dotcom Editorial Department", "")
text = text.replace("unerry", "")
text = text.replace("THE PAGE", "")
text = text.replace("THE PAGE Youtube channel", "")
text = text.replace("Live News it!", "")
text = text.replace("AFP", "")
text = text.replace("KDDI", "")
text = text.replace("Pakutaso", "")
text = text.replace("PIXTA", "")
#Idioms / idiomatic phrases
text = text.replace("Banner photo", "")
text = text.replace("Photo courtesy", "")
text = text.replace("Document photo", "")
text = text.replace("Below photo", "")
text = text.replace("Banner image", "")
text = text.replace("Image courtesy", "")
text = text.replace("Photographed by the author", "")
text = text.replace("Provided by the author", "")
text = text.replace("Click here for original articles and videos", "")
text = text.replace("Click here for the original article", "")
text = text.replace("Published", "")
text = text.replace("photograph", "")
text = text.replace("source", "")
text = text.replace("Video", "")
text = text.replace("Offer", "")
text = text.replace("Newsroom", "")
#Unnecessary spaces and line breaks
text = text.rstrip() #Line breaks / spaces removed
text = text.replace("\xa0", "")
text = text.upper() #Alphabet: uppercase
text = re.sub(r'\d+', '', text) ##Remove arabic numerals
return text
** 2. Execution of cleansing process **
docs_ = []
for i in docs:
text = cleansing(i)
docs_.append(text)
** ⑶ Consideration and designation of stop words **
- Before designating the stop word, I checked what kind of idioms and phrases in the alphabet are listed just in case.
** 1. Get the alphabet phrase **
- The argument of
re.findall ()
is (word pattern to be searched, character string to be searched, re.ASCII)
, and the third argument re.ASCII
is" ASCII character (half-width English). Only matches numbers, symbols, control characters, etc.) "is specified.
alphabets = []
for i in docs_:
alphabet = re.findall(r'\w+', i, re.ASCII)
if alphabet:
alphabets.append(alphabet)
print(alphabets)
** 2. Get the top 10 words of appearance frequency **
- itertools is a module that collects iterator generation functions that execute time-consuming loop processing of
for statements
more efficiently.
- The
chain.from_iterable
flattens all the elements contained in the multidimensional list and puts them together in one list.
- Get the number of occurrences for each word in the Python standard library
collections
, and get the top 10 occurrences in collections.Counter
.
import itertools
import collections
from collections import Counter
import pandas as pd
#Flattening a multidimensional array
alphabets_list = list(itertools.chain.from_iterable(alphabets))
#Get the number of appearances
cnt = Counter(alphabets_list)
#Get the top 10 words
cnt_sorted = cnt.most_common(10)
#Data frame
pd.DataFrame(cnt_sorted, columns=["English words", "Number of appearances"])
- We checked whether words that are not included in the news text, such as the credit of the news provider, are frequently mentioned.
** 3. Designation of stop word **
- Words that do not have a specific meaning by themselves are excluded as stop words.
stopwords = ["one", "two", "three", "four", "Five", "Six", "Seven", "Eight", "Nine", "〇", #Chinese numeral
"which one", "Which", "Which", "Where", "who is it", "Who", "what", "When", #Infinitive
"this", "It", "that", "Here", "there", "there", #Demonstrative
"here", "Over there", "Over there", "here", "There", "あThere",
"I", "I", "me", "you", "You", "he", "he女", #Personal pronoun
"Pieces", "Case", "Times", "Every time", "door", "surface", "Basic", "Floor", "Eaves", "Building", #Classifier
"Stand", "Sheet", "Discount", "Anniversary", "Man", "Circle", "Year", "Time", "Person", "Ten thousand",
"number", "Stool", "Eye", "Billion", "age", "Total", "point", "Period", "Day",
"of", "もof", "thing", "Yo", "Sama", "Sa", "For", "Per", #Modified noun
"Should be", "Other", "reason", "Yellowtail", "By the way", "home", "Inside", "Hmm",
"Next", "Field", "limit", "Edge", "One", "for",
"Up", "During ~", "under", "Before", "rear", "left", "right", "以Up", "以under", #Suffix
"Other than", "Within", "Or later", "Before", "To", "while", "Feeling", "Key", "Target",
"Faction", "Schizophrenia", "Around", "city", "Mr", "Big", "Decrease", "ratio", "rate",
"Around", "Tend to", "so", "Etc.", "Ra", "Mr.",
"©", "◎", "○", "●", "▼", "*"] #symbol
** ⑷ Data creation by morphological analysis **
- Using the morphological analysis engine MeCab and the dictionary mecab-ipadic-NEologd, morphological analysis is performed for each sentence, and a list of only nouns excluding stop words is created.
** 1. Install MeCab and mecab-ipadic-NEologd **
# MeCab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null
# mecab-ipadic-NEologd
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc
- Check the dictionary path.
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"
import MeCab
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
m_neo = MeCab.Tagger(path)
- You have created an instance with
path
or mecab-ipadic-NEologd as the output mode.
** 2. Extract nouns by morphological analysis **
- From the one-month data
docs
divided into article units, process each article and store the result in noun
.
noun = []
for d in docs_:
result = []
v1 = m_neo.parse(d) #Results of morphological analysis
v2 = v1.splitlines() #List divided into word units
for v in v2:
v3 = v.split("\t") #Divide the analysis result for one word into "original word" and "content part of analysis" with a blank
if len(v3) == 2: #EOS"Or" "except for
v4 = v3[1].split(',') #Content part of analysis
if (v4[0] == "noun") and (v4[6] not in stopwords):
#print(v4[6])
result.append(v4[6])
noun.append(result)
print(noun)
** 3. Format data for TF-IDF **
- Use
sum
to convert a two-dimensional list to a one-dimensional list, and join
to convert from comma-separated to half-width space-separated.
doc_06 = sum(noun, [])
text_06 = ' '.join(doc_06)
print(text_06)
- With the above, one month's worth of noun data for each article has been collected into one monthly document.
** 4. Download to local PC **
- Now write
text_06
to a file called'nipponcom_covd19_2020-06.txt'
. The argument 'w'
is the write mode specification.
with open('nipponcom_covid19_2020-06.txt', 'w') as f:
f.write(text_06)
files
is a module for uploading or downloading files between Colaboratory and your local PC.
from google.colab import files
files.download('nipponcom_covid19_2020-06.txt')
- The above process was performed for 12 months, and the one that was imported to the local PC was used for the TF-IDF analysis of the article.