[PYTHON] [Introduction to WordCloud] Let's play with scraping ♬

WordCloud, I tried to scrape the article and create it with the sentence because it was a headline of the sentence. It's easy to try scraping a little, but it was troublesome to analyze html by all means, but since it became easy to get the article, I will write it as an article. 【reference】 ① I tried scraping the main news titles of "Yahoo! News" with pythonGet the title of Nihon Keizai Shimbun with Python3

What i did

・ Easy scraping ・ Get Yahoo News headlines ・ WC Nikkei articles

・ Easy scraping

Reference (1) seems to be applicable to what I want to do, so it didn't work, but Reference (2) works well and is easy to understand, so I will enter here. So, if you look at Reference ② and then refer to Reference ①, the following code will be applied.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib3
import re

url = "http://www.nikkei.com/"
"""
#If you use requests, this is the place
r = requests.get(url)
print(r.headers)
print("--------")
print(r.encoding)
print(r.content)
print(r.text)
soup = BeautifulSoup(r.text, 'html.parser')
#print(soup)
"""

#If you use urllib3, this is the place
http = urllib3.PoolManager()
r = http.request('GET', url)

soup = BeautifulSoup(r.data, 'html.parser')
#print(soup)
#Get the title element →<title>Economic, stock, business and political news:Nikkei electronic version</title>
title_tag = soup.title
#Get the element string → Economic, Stock, Business, Political News:Nikkei electronic version
title = title_tag.string
#Output title element
print(title_tag)
#Output the title as a character string
print(title)

The same result is obtained with requests. BeautifulSoup's argument is a little different because r is different. Result is

http://www.nikkei.com/


<title>Nihon Keizai Shimbun</title>
Nihon Keizai Shimbun

If you get the page of Nikkei article, it will be as follows

https://www.nikkei.com/article/DGXMZO56522090X00C20A3000000/


<title>Over 100,000 new corona infections become a pandemic in 3 months (photo)=Reuters):Nihon Keizai Shimbun</title>
Over 100,000 new corona infections become a pandemic in 3 months (photo)=Reuters):Nihon Keizai Shimbun

・ Get Yahoo News headlines

# -*- coding: utf-8 -*-

url = "https://news.yahoo.co.jp"
#If you use urllib3, this is the place
http = urllib3.PoolManager()
r = http.request('GET', url)

soup = BeautifulSoup(r.data, 'html.parser')
#Get the title element
title_tag = soup.title
#Get the string of the element
title = title_tag.string
#Output title element
print(title_tag)
#Output the title as a character string
print(title)

#Added below
for title in soup.select("p"):
    print(title.getText())

The following results were obtained. However, I'm going to look at the top page, but I'm not sure because it's a little different from what is actually lined up.

https://news.yahoo.co.jp result.


<title>Yahoo!news</title>
Yahoo!news
keyword:
Search
Get new more conveniently with ID
Login
JavaScript is currently disabled. Yahoo!Please enable JavaScript settings to use all the features of the news. Click here for how to change the JavaScript settings.
See more
List of topics in all categories
Behind-door horse racing is advantageous for the most popular horses?
why? My address on Amazon ... Posted as the location of the seller. "I'm dying to worry" I feel distrust of women
"I want to jump off the ship", appealing one after another in a private room waiting, new corona
Rush back to Japan, farewell party canceled "What should I do?" Screams of Japanese residents in China and South Korea
South Korea rages at Japan's immigration restrictions, "anti-Japanese" countermeasures full of contradictions
Does the thorn stuck in the finger go to the heart? If you pull it out yourself, use a 5-yen coin
Yahoo!JAPAN special page
Summary of new coronavirus infections
Shizuoka Prefectural Assembly puts up a large number of masks on the net.
First confirmed infection in Gunma Prefecture: New coronavirus
The first infected person in Hiroshima "Seki" from a month ago
The first infected person in Hiroshima "Seki" from a month ago
New corona, domestic infection 56 people in 6 days alone
A lie that "foreign matter is mixed".
Former AKB48 "Mayu Watanabe" and "UTAGE!" Who disappeared are also absent, what is happening to her
Concerns about new corona infection: Reasons why 50% of dental clinics are dangerous
Too stupid behavior of fools who lose themselves in Corona
Japan is finally over
Why is "Shinkansen Tobira" so narrow ...? Reason for that convincing

So, let's output the Nikkei article This has somehow obtained Information on this page.

https://www.nikkei.com/article/DGXMZO56522090X00C20A3000000/


<title>Over 100,000 new corona infections become a pandemic in 3 months (photo)=Reuters):Nihon Keizai Shimbun</title>
Over 100,000 new corona infections become a pandemic in 3 months (photo)=Reuters):Nihon Keizai Shimbun
Article save
Only available to paying members. You can also view the saved articles on your smartphone or tablet.
> New member registration
> Login
Save Evernote
Membership registration is required to use
> New member registration
> Login
If you would like to share articles at companies, reprint / duplicate in meeting materials, print orders, etc., please see the link.
Click here for details

The number of people infected with the new coronavirus is increasing rapidly in Iran (Tehran, the capital)=Reuters
[Geneva=Rintaro Hosokawa] The number of people infected with the new coronavirus in the world has exceeded 100,000. About three months after the first patient appeared in Wuhan, Hubei Province, China in December 2019, it spread rapidly to other Asian countries, Europe and the United States. Even now, the spread of the infection is not expected to end, and travel restrictions and large-scale events are being canceled one after another. The threat of viruses casts a big shadow on people's lives and business activities.

The number of infected people greatly exceeded the severe acute respiratory syndrome (SARS) that occurred in 2002-03 and the Middle East respiratory syndrome (MERS) that occurred in 2012, and became a global epidemic.
According to aggregated data from Johns Hopkins University in the United States, the number of infected people worldwide is about 102,000 (as of 6:00 am on the 7th of Japan time). About 80% of this is from China (mainland). Other than China, South Korea has the largest number, with about 6,600 people. This is followed by Iran (about 4750 people) and Italy (about 4640 people). Japan has exceeded 400 people.
Recently, while the number of new infections has decreased in China, the number of infections in other countries / regions has increased rapidly. World Health Organization (WHO) Secretary-General Tedros said he was "deeply concerned" about the global spread at a press conference on the 6th. In particular, he expressed strong concern about the spread of infection to developing countries with vulnerable medical infrastructure. He said, "I want all countries to make virus containment a top priority," and called for further measures such as strengthening medical systems and border measures.
Research on vaccines and therapeutic agents has begun in each country. According to WHO, 20 vaccines are currently under development and many clinical studies are underway for therapeutic agents. However, there are many unclear points such as infectivity of the new corona. Some say that the infection will slow down as the temperature rises toward the summer. However, WHO urged not to relax, saying, "There is no evidence that the virus will disappear in the summer, and we should assume that it will continue to spread."
Due to the spread of the new corona infection, travel restrictions and business trip / travel cancellations are spreading in each country. Auto parts factories have been forced to shut down in northern Italy, where infections are rampant. Many are concerned that the rapid stagnation of movements of people and goods will have a serious impact on the world economy. The International Monetary Fund (IMF) has indicated that its 20-year global economic growth forecast could be the lowest growth in 11 years since 2009, immediately after the financial crisis.
WHO declared the new Corona "an internationally concerned public health emergency" on January 30. About a month later, on February 28, the global risk level was raised to the highest level of "very high" in four stages.
According to WHO, it is important to thoroughly wash your hands and regularly wipe your computer with a disinfectant to prevent infection. On the other hand, asymptomatic people do not need to wear masks for preventive purposes and are urged to refrain from excessive use.

Select free / paid plan
Here are the members
Article save
Only available to paying members. You can also view the saved articles on your smartphone or tablet.
> New member registration
> Login
Save Evernote
Membership registration is required to use
> New member registration
> Login
If you would like to share articles at companies, reprint / duplicate in meeting materials, print orders, etc., please see the link.
Click here for details
Electronic version top
Press release
Typhoon No. 19 Relief Fund Acceptance
The electronic version is free for the first month! Click here for details
weather
Press release search
Account list
Correction / Apology

・ WC Nikkei articles

At this point, all you have to do is add the code for the WC. WordCloud/yahoo_title.py

#$ python3 yahoo_title.py -d /usr/lib/aarch64-linux-gnu/mecab/dic/mecab-ipadic-neologd conversation_anzen.csv -s stop_words.txt

import requests
from bs4 import BeautifulSoup
import urllib3
import re
import argparse
from MeCab import Tagger
from wordcloud import WordCloud
import matplotlib.pyplot as plt

parser = argparse.ArgumentParser(description="convert csv")
parser.add_argument("input", type=str, help="csv file")
parser.add_argument("--dictionary", "-d", type=str, help="mecab dictionary")
parser.add_argument("--stop_words", "-s", type=str, help="stop words list")
args = parser.parse_args()

t = Tagger(" -d " + args.dictionary)

#url = "https://news.yahoo.co.jp"
url = "https://www.nikkei.com/article/DGXMZO56522090X00C20A3000000/"

stop_words = []
if args.stop_words:
    for line in open(args.stop_words, "r", encoding="utf-8"):
        stop_words.append(line.strip())

#A function that converts a list to a string
def join_list_str(list):
    return ' '.join(list)

#Stopword exclusion function
def exclude_stopword(text):
    changed_text = [token for token in text.lower().split(" ") if token != "" if token not in stop_words]
    #If it is left as above, it will be in list format, so convert it to a space-separated character string
    changed_text = join_list_str(changed_text)
    return changed_text

#Use urllib3
http = urllib3.PoolManager()
r = http.request('GET', url)
yahoo = BeautifulSoup(r.data, 'html.parser')

wc = WordCloud(font_path="/home/muauan/.fonts/NotoSansCJKjp-Regular.otf")
sk=0
for title in yahoo.select("p"):
    title = title.getText()
    title = re.sub(r"[^one-龥-Hmm-0-9]", "", title)
    
    if len(title)>50:
        print("content{};{}".format(sk,title))
        splitted = " ".join([x.split("\t")[0] for x in t.parse(title).splitlines()[:-1] if x.split("\t")[1].split(",")[0] not in ["Particle", "Auxiliary verb", "adverb", "Adnominal adjective", "verb"]])
        splitted = exclude_stopword(splitted)
        wc.generate(splitted)
        plt.axis("off")
        plt.title("content_{};".format(sk))
        plt.tight_layout()
        plt.imshow(wc)
        plt.pause(0.05)
        plt.savefig('./output_yahoo/yahoo{}_{}.png'.format(sk,title[0:10])) 
        plt.close()
        sk += 1

As stopwords, I am using the previous japanese.txt this time as well.

Item number Sentence WordCloud
content0 Rintaro Hosokawa, Geneb The number of people infected with the new coronavirus in the world exceeded 100,000. Even now, there is no prospect that the spread of infection will end, and the threat of viruses such as travel restrictions and cancellation of large-scale events is casting a big shadow on people's lives and corporate activities. yahoo0_Geneva Rintaro Hosokawa.png
content1 The number of infected people greatly exceeded the severe acute respiratory syndrome that occurred in 20023 and the Middle East respiratory syndrome that occurred in 2012, and became a global epidemic. yahoo1_The number of infected people is 20020.png
content2 According to the total data of Johns Hopkins University in the United States, the number of infected people in the world is about 102,000 as of 6:00 am on the 7th of Japan time. Of these, about 80% is occupied by mainland China. Iran has about 4750 people, Italy has about 4640 people, and Japan has more than 400 people. yahoo2_Johns Hopkins.png
content3 Recently, the number of new infections has decreased in China, but the number of infections in other countries is increasing rapidly. Tedros, the secretary general of the World Health Organization, was deeply concerned about the global spread at a press conference on the 6th. All countries that have expressed strong concern about the spread of infection to developing countries with particularly vulnerable medical infrastructure have called for the containment of viruses to be a top priority, and have taken further measures such as strengthening medical systems and border measures. Asked to take yahoo3_Recently a new feeling in China.png
content4 According to the fact that research on vaccines and therapeutic agents has started in each country, 20 vaccines are currently under development and many clinical studies are underway for therapeutic agents. However, there are many unclear points such as infectivity of the new corona in the coming summer. There are also voices that the infection will slow down as the temperature rises, but there is no evidence that the virus will disappear in the summer and it should be assumed that it will continue to have the ability to spread, and he appealed not to loosen the caution. yahoo4_Vaccines and cures in each country.png
content5 Due to the spread of the new corona infection, travel restrictions and cancellations of business trips are spreading in each country. Many are concerned that it will be a serious blow to the world economy The International Monetary Fund has indicated that it may be the lowest growth in 11 years since 2009 immediately after the financial crisis in its 20-year global economic growth forecast. yahoo5_Spread of infection of new corona.png
content6 Approximately one month after declaring the new Corona as a public health emergency of international concern on January 30, the global risk level was raised to the highest level in four stages on February 28. Raised yahoo6_Is a new model on January 30th.png
content7 According to the report, it is important to thoroughly wash hands and regularly wipe the computer with a disinfectant to prevent infection, but people without symptoms do not need to wear a mask for preventive purposes and use it excessively. Asking to refrain yahoo7_According to to prevent infection.png

I'm not very satisfied, but I think it shows words that represent sentences. As for how far you should go, I would like you to treat the first sentence more than the names of people such as Geneva, such as the cancellation of the event and the inability to see the convergence of the spread of infection. .. ..

Summary

・ I tried scraping news articles into WordCloud. ・ It turned out that it can be converted in almost real time.

・ I want to use it more actively ・ Expand the application of scraping

Recommended Posts

[Introduction to WordCloud] Let's play with scraping ♬
[Introduction to Python] Let's use foreach with Python
Introduction to Web Scraping
Let's play with 4D 4th
Let's play with Amedas data-Part 1
[Introduction to WordCloud] It's easy to use even with Jetson-nano ♬
Introduction to RDB with sqlalchemy Ⅰ
Let's play with Amedas data-Part 4
Let's play with JNetHack 3.6.2 which is easier to compile!
Let's play with Amedas data-Part 3
Let's play with Amedas data-Part 2
[Let's play with Python] Image processing to monochrome and dots
Let's play with Excel with Python [Beginner]
Introduction to RDB with sqlalchemy II
[Introduction to Python] Let's use pandas
Let's do image scraping with Python
[Introduction to Python] Let's use pandas
[Introduction to Python] Let's use pandas
Let's feel like a material researcher with python [Introduction to pymatgen]
[Complement] [PySide] Let's play with Qt Designer
Introduction to Python Image Inflating Image inflating with ImageDataGenerator
Fractal to make and play with Python
[Python] Introduction to CNN with Pytorch MNIST
I want to play with aws with python
[Introduction to Pytorch] I played with sinGAN ♬
[Part1] Scraping with Python → Organize to csv!
Introduction to Statistical Hypothesis Testing with stats models
Introduction to MQTT (Introduction)
Introduction to Scrapy (1)
Play with Prophet
[Python] Easy introduction to machine learning with python (SVM)
Scraping with selenium
Introduction to Artificial Intelligence with Python 1 "Genetic Algorithm-Theory-"
Python hand play (let's get started with AtCoder?)
Introduction to Scrapy (3)
Introduction to Supervisor
Scraping tabelog with python and outputting to CSV
Scraping with selenium ~ 2 ~
Scraping with Python
Markov Chain Chatbot with Python + Janome (1) Introduction to Janome
Introduction to Python Let's prepare the development environment
Markov Chain Chatbot with Python + Janome (2) Introduction to Markov Chain
Introduction to Tkinter 1: Introduction
[REAPER] How to play with Reascript in Python
Introduction to Artificial Intelligence with Python 2 "Genetic Algorithm-Practice-"
[Piyopiyokai # 1] Let's play with Lambda: Creating a Lambda function
Scraping with Python
[Introduction to StyleGAN2] Independent learning with 10 anime faces ♬
Play with PyTorch
Introduction to Tornado (1): Python web framework started with Tornado
I wanted to play with the Bezier curve
Introduction to PyQt
Introduction to Scrapy (2)
Play with 2016-Python
Scraping with Selenium
Introduction to formation flight with Tello edu (Python)
[Introduction to minimize] Data analysis with SEIR model ♬
[Linux] Introduction to Linux
Introduction to Python with Atom (on the way)
Introduction to Scrapy (4)
Introduction to discord.py (2)