3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]

** ⑴ Data acquisition by scraping **

** 1. Collect URLs for each news **

# covid-19_2020-06
pagepath = ["japan-topics/bg900175/",
            "in-depth/d00592/",
            "news/p01506/",
            "news/p01505/",
            "news/p01501/",
            #Omission
            "news/fnn2020060147804/",
            "news/fnn2020060147795/",
            "news/fnn2020060147790/"]

** 2. Get HTML data, extract necessary parts **

import requests
from bs4 import BeautifulSoup
docs = []
for i in pagepath:
    #➊ Get HTML data
    response = requests.get("https://www.nippon.com/ja/" + str(i))
    html_doc = response.text

    #➋ Perspective processing
    soup = BeautifulSoup(html_doc, 'html.parser')
    # ➌ <div class="editArea">Directly below<p>Extract the tag part
    target = soup.select('.editArea > p')

    # <p>Extract only text for each sentence enclosed in tags
    value = []
    for t in target:
        val = t.get_text()
        value.append(val)

    #Delete empty data in the list
    value_ = filter(lambda str:str != '', value)
    value_ = list(value_)

    #"Full-width blank"\delete "u3000"
    doc = []
    for v in value_:
        val = v.replace('\u3000', '')
        doc.append(val)

    docs.append(doc)

** ⑵ Text cleansing **

** 1. Define cleansing function **

!pip install neologdn===0.3.2
import re
import neologdn

def cleansing(text):
    text = ','.join(text) #Flatten with comma delimiters
    text = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text) #URL deletion
    text = neologdn.normalize(text) #Alphabet / Number: Half-width, Katakana: Full-width
    text = re.sub(r'[0-9]{4}Year', '', text) ##日付を削除(yyyyYear)
    text = re.sub(r'[0-9]{2}Year', '', text) ##日付を削除(yyYear)
    text = re.sub(r'\d+Moon', '', text) #日付を削除(何Moon)
    text = re.sub(r'\d+Day', '', text) #Day付を削除(何Day)
    text = re.sub(r'\d+Time', '', text) #Time間を削除(何Time)
    text = re.sub(r'\d+Minutes', '', text) #時間を削除(何Minutes)
    text = re.sub(r'\d+Substitute', '', text) #年Substituteを削除
    text = re.sub(r'\d+Man', '', text) #Man数を削除(何Man)
    text = re.sub(r'\d+Ten thousand people', '', text) #人数を削除(何Ten thousand people)
    text = re.sub(r'\d+\.\d+\%', '', text) #Delete percentage (decimal)
    text = re.sub(r'\d+\%', '', text) #Delete percentage (integer)
    text = re.sub(r'\d+\.\d+%', '', text) #Delete percentage (decimal)
    text = re.sub(r'\d+%', '', text) #Delete percentage (integer)
    text = re.sub(r'\d+Months', '', text) #月数を削除(何Months)
    text = re.sub(r'\【.*\】', '', text) #[] And its contents deleted
    text = re.sub(r'\[.*\]', '', text) #[]And its contents deleted
    text = re.sub(r'、|。', '', text) #Remove punctuation
    text = re.sub(r'「|」|『|』|\(|\)|\(|\)', '', text) #Remove parentheses
    text = re.sub(r':|:|=|=|/|/|~|~|・', '', text) #Remove sign

    #News source
    text = text.replace("Afro", "")
    text = text.replace("Jiji Press", "")
    text = text.replace("Current events", "")
    text = text.replace("TV nishinippon", "")
    text = text.replace("Kansai TV", "")
    text = text.replace("Fuji Television Network, Inc", "")
    text = text.replace("FNN Prime Online", "")
    text = text.replace("Nippon Dotcom Editorial Department", "")
    text = text.replace("unerry", "")
    text = text.replace("THE PAGE", "")
    text = text.replace("THE PAGE Youtube channel", "")
    text = text.replace("Live News it!", "")
    text = text.replace("AFP", "")
    text = text.replace("KDDI", "")
    text = text.replace("Pakutaso", "")
    text = text.replace("PIXTA", "")

    #Idioms / idiomatic phrases
    text = text.replace("Banner photo", "")
    text = text.replace("Photo courtesy", "")
    text = text.replace("Document photo", "")
    text = text.replace("Below photo", "")
    text = text.replace("Banner image", "")
    text = text.replace("Image courtesy", "")
    text = text.replace("Photographed by the author", "")
    text = text.replace("Provided by the author", "")
    text = text.replace("Click here for original articles and videos", "")
    text = text.replace("Click here for the original article", "")
    text = text.replace("Published", "")
    text = text.replace("photograph", "")
    text = text.replace("source", "")
    text = text.replace("Video", "")
    text = text.replace("Offer", "")
    text = text.replace("Newsroom", "")

    #Unnecessary spaces and line breaks
    text = text.rstrip() #Line breaks / spaces removed
    text = text.replace("\xa0", "")

    text = text.upper() #Alphabet: uppercase
    text = re.sub(r'\d+', '', text) ##Remove arabic numerals

    return text

** 2. Execution of cleansing process **

docs_ = []
for i in docs:
    text = cleansing(i)
    docs_.append(text)

image.png

** ⑶ Consideration and designation of stop words **

** 1. Get the alphabet phrase **

alphabets = []
for i in docs_:
    alphabet = re.findall(r'\w+', i, re.ASCII)
    if alphabet:
        alphabets.append(alphabet)

print(alphabets)

image.png

** 2. Get the top 10 words of appearance frequency **

import itertools
import collections
from collections import Counter
import pandas as pd

#Flattening a multidimensional array
alphabets_list = list(itertools.chain.from_iterable(alphabets))

#Get the number of appearances
cnt = Counter(alphabets_list)
#Get the top 10 words
cnt_sorted = cnt.most_common(10)

#Data frame
pd.DataFrame(cnt_sorted, columns=["English words", "Number of appearances"])

image.png

** 3. Designation of stop word **

stopwords = ["one", "two", "three", "four", "Five", "Six", "Seven", "Eight", "Nine", "〇",  #Chinese numeral
             "which one", "Which", "Which", "Where", "who is it", "Who", "what", "When",  #Infinitive
             "this", "It", "that", "Here", "there", "there",  #Demonstrative
             "here", "Over there", "Over there", "here", "There", "あThere",
             "I", "I", "me", "you", "You", "he", "he女",  #Personal pronoun
             "Pieces", "Case", "Times", "Every time", "door", "surface", "Basic", "Floor", "Eaves", "Building",  #Classifier
             "Stand", "Sheet", "Discount", "Anniversary", "Man", "Circle", "Year", "Time", "Person", "Ten thousand", 
             "number", "Stool", "Eye", "Billion", "age", "Total", "point", "Period", "Day",
             "of", "もof", "thing", "Yo", "Sama", "Sa", "For", "Per",  #Modified noun
             "Should be", "Other", "reason", "Yellowtail", "By the way", "home", "Inside", "Hmm", 
             "Next", "Field", "limit", "Edge", "One", "for",     
             "Up", "During ~", "under", "Before", "rear", "left", "right", "以Up", "以under",  #Suffix
             "Other than", "Within", "Or later", "Before", "To", "while", "Feeling", "Key", "Target", 
             "Faction", "Schizophrenia", "Around", "city", "Mr", "Big", "Decrease", "ratio", "rate",
              "Around", "Tend to", "so", "Etc.", "Ra", "Mr.",
             "©", "◎", "○", "●", "▼", "*"]  #symbol

** ⑷ Data creation by morphological analysis **

** 1. Install MeCab and mecab-ipadic-NEologd **

# MeCab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null

# mecab-ipadic-NEologd
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1

#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

image.png

import MeCab

path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
m_neo = MeCab.Tagger(path)

** 2. Extract nouns by morphological analysis **

noun = []
for d in docs_:
    result = []
    v1 = m_neo.parse(d) #Results of morphological analysis
    v2 = v1.splitlines() #List divided into word units
    for v in v2:
        v3 = v.split("\t") #Divide the analysis result for one word into "original word" and "content part of analysis" with a blank
        if len(v3) == 2: #EOS"Or" "except for
            v4 = v3[1].split(',') #Content part of analysis
            if (v4[0] == "noun") and (v4[6] not in stopwords):
                #print(v4[6])
                result.append(v4[6])
    noun.append(result)

print(noun)

image.png

** 3. Format data for TF-IDF **

doc_06 = sum(noun, [])
text_06 = ' '.join(doc_06)

print(text_06)

image.png

** 4. Download to local PC **

with open('nipponcom_covid19_2020-06.txt', 'w') as f:
    f.write(text_06)
from google.colab import files

files.download('nipponcom_covid19_2020-06.txt')

Recommended Posts

3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
[Spotify API] Looking back on 2020 with playlists --Part.1 Acquisition of playlist data
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Folium: Visualize data on a map with Python
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
Looking back on creating a web service with Django 1
[Chapter 5] Introduction to Python with 100 knocks of language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Performance verification of data preprocessing in natural language processing
Looking back on creating a web service with Django 2
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
Overview of natural language processing and its data preprocessing
[Chapter 4] Introduction to Python with 100 knocks of language processing
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
100 Language Processing with Python Knock 2015
Notes on handling large amounts of data with python + pandas
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python
Easily build a natural language processing model with BERT + LightGBM + optuna
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Dockerfile with the necessary libraries for natural language processing in python
Study natural language processing with Kikagaku
Looking back on Python 2020 around me
[Natural language processing] Preprocessing with Japanese
100 Language Processing Knock with Python (Chapter 3)
Easy padding of data that can be used in natural language processing
I have 0 years of programming experience and challenge data processing with python
Receive a list of the results of parallel processing in Python with starmap
Run a batch of Python 2.7 with nohup on Amazon Linux AMI on EC2
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
[Spotify] Looking back on 2020 with playlists --Part.2 EDA (basic statistics), data preprocessing
Basics of binarized image processing with Python
100 Language Processing Knock-91: Preparation of Analogy Data
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
Add a Python data source with Redash
Looking back on 2019 ~ One year since I started posting to Qiita ~ + Where I got stuck with python pip related
A memo with Python2.7 and Python3 on CentOS
Map rent information on a map with python
Looking back on 2016 in the Crystal language
100 Language Processing Knock with Python (Chapter 2, Part 1)
Drawing with Matrix-Reinventor of Python Image Processing-
Recommendation of Altair! Data visualization with Python
The story of blackjack A processing (python)
I tried natural language processing with transformers.
Example of efficient data processing with PANDAS
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
[Python] Plot data by prefecture on a map (number of cars owned nationwide)
[Introduction to Python] How to get the index of data with a for statement
A memo of a tutorial on running python on heroku