It's been half a year since I became an engineer, so for the sake of looking back, I tried scraping the text data of the articles I posted so far to create a word cloud. I would like to leave the procedure at that time.
I was able to do something like this. The big word "component" is probably because Vue.js "Introduction to" component " talks about components. After that, there were a lot of things like heroku, Docker, Flask, and so on, and I wrote articles, which made me feel nostalgic. There are many general-purpose words such as "change" and "addition", so you may set a stop word if you like. By the way, My Hatena Blog also scraped to make a word cloud. This is the header image of Twitter, so please take a look. (It's interesting because it says something completely different.)
I will create it using.
I will proceed with the procedure.
Use get items in Qiita API. The syntax of RequestURL is as follows.
https://qiita.com/api/v2/items?page={{page number}}&per_page={{Number of articles per page}}&query=user%3A{{User ID}}
For example, if you want to get 100 articles of me (kiyokiyo_kzsby), you can send the following request.
https://qiita.com/api/v2/items?page=1&per_page=100&query=user%3Akiyokiyo_kzsby
The response will be returned in JSON format.
[
{
"rendered_body": "<h1>Example1</h1>",
(Abbreviation)
"title": "Example title 1",
(Abbreviation)
},
{
"rendered_body": "<h1>Example2</h1>",
(Abbreviation)
"title": "Example title 2",
(Abbreviation)
},
・ ・ ・
]
Extract rendered_body
and title
from this and use them for the word cloud.
If you put the above code into Python code, it looks like this.
qiita_scraper.py
import requests
import json
from bs4 import BeautifulSoup
def scrape_all(user_id):
text = ""
r = requests.get("https://qiita.com/api/v2/items?page=1&per_page=100&query=user%3A" + user_id)
json_list = json.loads(r.text)
for article in json_list:
print("scrape " + article["title"])
text += article["title"]
content = article["rendered_body"]
soup = BeautifulSoup(content, "html.parser")
for valid_tag in soup.find_all(["p","li","h1","h2","h3","h4","table"]):
text += valid_tag.text
return text
requests
is a library that executes HTTP requests, json
is a library that handles JSON, and BeautifulSoup
is a library that handles html. Let's put each in pip install
. (For Beautiful Soup, this article is good.)
$ pip install requests
$ pip install json
$ pip install beautifulsoup4
Specify the html tag to read with soup.find_all (["p "," li "," h1 "," h2 "," h3 "," h4 "," table "])
on the third line from the bottom. I am. At first, I tried to read the whole text, but the embedded code was also included, and the resulting word cloud became just words that often appear in codes such as for and if, so only the text part This is specified to extract. Please adjust this area to your liking.
Unlike English, scraped Japanese text is not divided into words, so if you just dig into the word cloud generation library, it will not work. Therefore, morphological analysis is performed to divide (separate) into words. This time, we will use a morphological analyzer called Mecab. I have referred to this article quite a bit.
First, install the libraries required to install Mecab.
brew install mecab mecab-ipadic git curl xz
Next, we will install the main body of Mecab.
brew install mecab mecab-ipadic
Type mecab
on the terminal and enter a sentence to perform morphological analysis. (End is Control + C)
$ mecab
Of the thighs and thighs
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS
Now you are ready to go. Now let's write the code to perform the morphological analysis in Python. For the Python code, I referred to this article.
mecab.py
import MeCab as mc
def mecab_analysis(text):
print("start mecab analysis")
t = mc.Tagger('-Ochasen')
node = t.parseToNode(text)
output = []
while(node):
if node.surface != "":
word_type = node.feature.split(",")[0]
if word_type in ["adjective","noun"]:
output.append(node.surface)
node = node.next
print("end mecab analysis")
return output
We will use the MeCab library, so let's do pip install
.
$ pip install mecab-python3
The fifth line from the bottom, ʻif word_type in ["adjective", "noun"]: `, limits the words included in output to" adjective "and" noun ". If you want to include adverbs and verbs, you can add them to this array.
Now that we've split the scraped text into words up to the previous step, let's finally dive into the WordCloud library to complete it.
First, follow the Word Cloud Library README to pip install
.
Also, install the screen drawing library matplotlib.
$ pip install wordcloud
$ pip install matplotlib
Next, write the Python code as follows. For the Python code, I referred to this article.
word_cloud.py
import matplotlib.pyplot as plt
from wordcloud import WordCloud
def create_wordcloud(text):
print("start create wordcloud")
#Specify the font path according to the environment.
fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"
#Stop word setting
stop_words = [ u'Teru', u'Is', u'Become', u'To be', u'To do', u'is there', u'thing', u'this', u'Mr.', u'do it', \
u'Give me', u'do', u'Give me', u'so', u'Let', u'did', u'think', \
u'It', u'here', u'Chan', u'Kun', u'', u'hand',u'To',u'To',u'Is',u'of', u'But', u'When', u'Ta', u'Shi', u'so', \
u'Absent', u'Also', u'Nana', u'I', u'Or', u'So', u'Yo', u'', u'Alsoの', u'this week', u'Summary',u'For', \
u'Designation', u'If', u'Less than', u'Create', u'Yo', u'part', u'File', u'Use', u'use']
wordcloud = WordCloud(background_color="white",font_path=fpath, width=800, height=500, \
stopwords=set(stop_words)).generate(text)
print("end create wordcloud")
print("now showing")
plt.figure(figsize=(15,12))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
The font path is specified in the part of fpath =" / System / Library / Fonts / Hiragino Mincho ProN.ttc "
on the 6th line from the top. The path and font name may differ depending on the PC you are using, so adjust accordingly.
The stop words are specified after stop_words =
on the 8th line from the top. The words listed here will no longer be displayed on the word cloud. Let's specify words such as "things" and "things" that have no meaning but are frequently displayed as big deca.
Finally, let's create a file that collectively processes these from scraping to word cloud generation.
main.py
from qiita_scraper import scrape_all
from mecab import mecab_analysis
from word_cloud import create_wordcloud
text = scrape_all("kiyokiyo_kzsby")
wordlist = mecab_analysis(text)
create_wordcloud(" ".join(wordlist))
Qiita's user_id is passed to the argument of scrape_all
on the 4th line. You can also create a word cloud for another user by changing this.
When you run main.py, the word cloud screen opens after a message like this is spit out in the log.
Study scrape GoF design patterns
Somehow understand the frequently used terms of scrape DDD
(Omission)
scrape AtCoder 400 points algorithm summary(Java edition)
scrape AWS Solutions Architect-I want to get an associate
start mecab analysis
end mecab analysis
start create wordcloud
end create wordcloud
now showing
Yes! !!
Actually, I think that it will be completed by adjusting the part of speech and stop words from here. It's fun, so please play around with it.
It's good because you can visually understand what kind of output you have made when you try to make a word cloud. I think it's interesting to see different results even if you change the scraping target to Hatena Blog or Twitter.
Recommended Posts