[PYTHON] Scraping your Qiita articles to create a word cloud

Introduction

It's been half a year since I became an engineer, so for the sake of looking back, I tried scraping the text data of the articles I posted so far to create a word cloud. I would like to leave the procedure at that time.

Completed form

I was able to do something like this. スクリーンショット 2020-05-08 10.43.08.png The big word "component" is probably because Vue.js "Introduction to" component " talks about components. After that, there were a lot of things like heroku, Docker, Flask, and so on, and I wrote articles, which made me feel nostalgic. There are many general-purpose words such as "change" and "addition", so you may set a stop word if you like. By the way, My Hatena Blog also scraped to make a word cloud. This is the header image of Twitter, so please take a look. (It's interesting because it says something completely different.)

environment

I will create it using.

Rough procedure

  1. Scraping to get text data
  2. Divide into words through a morphological analyzer (MeCab)
  3. Create a word cloud

I will proceed with the procedure.

Scraping

Use get items in Qiita API. The syntax of RequestURL is as follows.

https://qiita.com/api/v2/items?page={{page number}}&per_page={{Number of articles per page}}&query=user%3A{{User ID}}

For example, if you want to get 100 articles of me (kiyokiyo_kzsby), you can send the following request.

https://qiita.com/api/v2/items?page=1&per_page=100&query=user%3Akiyokiyo_kzsby

The response will be returned in JSON format.

[
  {
    "rendered_body": "<h1>Example1</h1>",
    (Abbreviation)
    "title": "Example title 1",
    (Abbreviation)
  },
  {
    "rendered_body": "<h1>Example2</h1>",
    (Abbreviation)
    "title": "Example title 2",
    (Abbreviation)
  },
・ ・ ・
]

Extract rendered_body and title from this and use them for the word cloud.

If you put the above code into Python code, it looks like this.

qiita_scraper.py


import requests
import json
from bs4 import BeautifulSoup

def scrape_all(user_id):
    text = ""
    r = requests.get("https://qiita.com/api/v2/items?page=1&per_page=100&query=user%3A" + user_id)
    json_list = json.loads(r.text)
    for article in json_list:
        print("scrape " + article["title"])
        text += article["title"]
        content = article["rendered_body"]
        soup = BeautifulSoup(content, "html.parser")
        for valid_tag in soup.find_all(["p","li","h1","h2","h3","h4","table"]):
            text += valid_tag.text
    return text

requests is a library that executes HTTP requests, json is a library that handles JSON, and BeautifulSoup is a library that handles html. Let's put each in pip install. (For Beautiful Soup, this article is good.)

$ pip install requests
$ pip install json
$ pip install beautifulsoup4

Specify the html tag to read with soup.find_all (["p "," li "," h1 "," h2 "," h3 "," h4 "," table "]) on the third line from the bottom. I am. At first, I tried to read the whole text, but the embedded code was also included, and the resulting word cloud became just words that often appear in codes such as for and if, so only the text part This is specified to extract. Please adjust this area to your liking.

Morphological analysis

Unlike English, scraped Japanese text is not divided into words, so if you just dig into the word cloud generation library, it will not work. Therefore, morphological analysis is performed to divide (separate) into words. This time, we will use a morphological analyzer called Mecab. I have referred to this article quite a bit.

First, install the libraries required to install Mecab.

brew install mecab mecab-ipadic git curl xz

Next, we will install the main body of Mecab.

brew install mecab mecab-ipadic

Type mecab on the terminal and enter a sentence to perform morphological analysis. (End is Control + C)

$ mecab
Of the thighs and thighs
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS

Now you are ready to go. Now let's write the code to perform the morphological analysis in Python. For the Python code, I referred to this article.

mecab.py


import MeCab as mc

def mecab_analysis(text):
    print("start mecab analysis")
    t = mc.Tagger('-Ochasen')
    node = t.parseToNode(text)
    output = []
    while(node):
        if node.surface != "":
            word_type = node.feature.split(",")[0]
            if word_type in ["adjective","noun"]:
                output.append(node.surface)
        node = node.next
    print("end mecab analysis")
    return output

We will use the MeCab library, so let's do pip install.

$ pip install mecab-python3

The fifth line from the bottom, ʻif word_type in ["adjective", "noun"]: `, limits the words included in output to" adjective "and" noun ". If you want to include adverbs and verbs, you can add them to this array.

Creating a word cloud

Now that we've split the scraped text into words up to the previous step, let's finally dive into the WordCloud library to complete it. First, follow the Word Cloud Library README to pip install. Also, install the screen drawing library matplotlib.

$ pip install wordcloud
$ pip install matplotlib

Next, write the Python code as follows. For the Python code, I referred to this article.

word_cloud.py


import matplotlib.pyplot as plt
from wordcloud import WordCloud

def create_wordcloud(text):
    print("start create wordcloud")

    #Specify the font path according to the environment.
    fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"

    #Stop word setting
    stop_words = [ u'Teru', u'Is', u'Become', u'To be', u'To do', u'is there', u'thing', u'this', u'Mr.', u'do it', \
             u'Give me', u'do', u'Give me', u'so', u'Let', u'did',  u'think',  \
             u'It', u'here', u'Chan', u'Kun', u'', u'hand',u'To',u'To',u'Is',u'of', u'But', u'When', u'Ta', u'Shi', u'so', \
             u'Absent', u'Also', u'Nana', u'I', u'Or', u'So', u'Yo', u'', u'Alsoの', u'this week', u'Summary',u'For', \
             u'Designation', u'If', u'Less than', u'Create', u'Yo', u'part', u'File', u'Use', u'use']

    wordcloud = WordCloud(background_color="white",font_path=fpath, width=800, height=500, \
                          stopwords=set(stop_words)).generate(text)
    print("end create wordcloud")

    print("now showing")
    plt.figure(figsize=(15,12))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

The font path is specified in the part of fpath =" / System / Library / Fonts / Hiragino Mincho ProN.ttc " on the 6th line from the top. The path and font name may differ depending on the PC you are using, so adjust accordingly. The stop words are specified after stop_words = on the 8th line from the top. The words listed here will no longer be displayed on the word cloud. Let's specify words such as "things" and "things" that have no meaning but are frequently displayed as big deca.

Finally, let's create a file that collectively processes these from scraping to word cloud generation.

main.py


from qiita_scraper import scrape_all
from mecab import mecab_analysis
from word_cloud import create_wordcloud

text = scrape_all("kiyokiyo_kzsby")
wordlist = mecab_analysis(text)
create_wordcloud(" ".join(wordlist))

Qiita's user_id is passed to the argument of scrape_all on the 4th line. You can also create a word cloud for another user by changing this.

When you run main.py, the word cloud screen opens after a message like this is spit out in the log.

Study scrape GoF design patterns
Somehow understand the frequently used terms of scrape DDD
(Omission)
scrape AtCoder 400 points algorithm summary(Java edition)
scrape AWS Solutions Architect-I want to get an associate
start mecab analysis
end mecab analysis
start create wordcloud
end create wordcloud
now showing

Yes! !! スクリーンショット 2020-05-08 10.43.08.png

Actually, I think that it will be completed by adjusting the part of speech and stop words from here. It's fun, so please play around with it.

in conclusion

It's good because you can visually understand what kind of output you have made when you try to make a word cloud. I think it's interesting to see different results even if you change the scraping target to Hatena Blog or Twitter.

Recommended Posts

Scraping your Qiita articles to create a word cloud
If you want to create a Word Cloud.
I made a tool to create a word cloud from wikipedia
Create a Word Cloud from an academic program
WEB scraping with python and try to make a word cloud from reviews
Migrate Qiita articles to GitHub
Migrate Qiita articles to WordPress
Create a tool to check scraping rules (robots.txt) in Python
Try to make Qiita's Word Cloud from your browser history
Steps to create a Django project
How to create a Conda package
How to create your own Transform
Create a word cloud with only positive / negative words on Twitter
How to create a virtual bridge
I tried to summarize what was output with Qiita with Word cloud
How to create a Dockerfile (basic)
5 Ways to Create a Python Chatbot
How to create a config file
Try to create a Qiita article with REST API [Environmental preparation]
Create a shortcut to run a Python file in VScode on your terminal
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
I tried to create a linebot (implementation)
Add a GPIO board to your computer. (1)
How to create a clone from Github
Create a bot to retweet coronavirus information
How to create a git clone folder
Qiita (1) How to write a code name
I tried to create a linebot (preparation)
Create a model for your Django schedule
Create a python environment on your Mac
Migrate Qiita articles to GitHub Pages + VuePress
Create a word frequency counter with Python 3.4
Various ways to create a dictionary (memories)
How to create a repository from media
Script to create a Mac dictionary file
[Python] List Comprehension Various ways to create a list
Edit Excel from Python to create a PivotTable
I want to easily create a Noise Model
How to create a Python virtual environment (venv)
How to create a function object from a string
I want to create a window in Python
Randomly sample MNIST data to create a dataset
How to create a JSON file in Python
Create a command to encode / decode Splunk base64
Use click to create a sub-sub command --netsted sub-sub command -
Steps to create a Twitter bot with python
How to remember when you forget a word
Try to create a new command on linux
How to create a shortcut command for LINUX
Get a list of Qiita likes by scraping
I want to create a plug-in type implementation
[Note] How to create a Ruby development environment
How to create a Kivy 1-line input box
How to create a multi-platform app with kivy
I made a tool to get new articles
How to create a Rest Api in Django
Create a wheel of your own OpenCV module
Create a command to get the work log
[Note] How to create a Mac development environment
I tried scraping food recall information with Python to create a pandas data frame