[PYTHON] Visualize 2ch threads with WordCloud-Scraping-

Introduction

2ch is a well-known anonymous bulletin board system and boasts a huge amount of information. However, if you read each thread less, it will take an enormous amount of time to get the whole picture. Therefore, I tried to visualize this information ** with WordCloud ** and easily get the whole picture.

The above is the result of thread search with "FFRK", which is the output of the lessons for the last 8 months in WordCloud. *

It has been about a year since the Synchro Mystery was implemented, but it can be seen that the Awakening Mystery is still more talked about. In addition to FFRK Ori-chara's Deshi Urara, Butts, Edge, Cloud, Mog, etc. are relatively talked about. Can be expected. *

I'm a beginner in both scraping and natural language processing, but I'd like to write about it because it took shape in my own way. This time, as the first part, we will collect from thread information to less content by web scraping.

Overall flow

[Scraping "log speed" to extract the URL of the target thread](#### [Scraping the thread list from "log speed"]) ← Explanation this time
[Scraping 2ch thread to extract less](#### [Scraping 2ch thread]) ← Explanation this time
Morphological analysis of the extracted less content with Mecab
Output with WordCloud

Full code

Click to view full text (including processing other than scraping)

#Library import
import requests, bs4
import re
import time
import pandas as pd
from urllib.parse import urljoin

#Install fonts locally in Colab
from google.colab import drive
drive.mount("/content/gdrive")
#Create a folder called font at the top of My Drive in your Google Drive in advance, and put the desired font file in it.
#Copy each folder locally to Colab
!cp -a "gdrive/My Drive/font/" "/usr/share/fonts/"

# ------------------------------------------------------------------------
#Preparation
log_database = []  #A list that stores thread information
base_url = "https://www.logsoku.com/search?q=FFRK&p="

#Implementation of web scraping
for i in range(1,4):  #Which page to go back to (here, tentatively up to the 4th page)
  logs_url = base_url+str(i)

  #Scraping processing body
  res = requests.get(logs_url)
  soup = bs4.BeautifulSoup(res.text, "html.parser")

  #What to do when no search results are found
  if soup.find(class_="search_not_found"):break

  #Get table / row where thread information is stored
  thread_table = soup.find(id="search_result_threads")
  thread_rows = thread_table.find_all("tr")

  #Processing for each row
  for thread_row in thread_rows:
    tmp_dict = {}
    tags = thread_row.find_all(class_=["thread","date","length"])

    #Organize the contents
    for tag in tags:
      if "thread" in str(tag):
        tmp_dict["title"] = tag.get("title")
        tmp_dict["link"] = tag.get("href")
      elif "date" in str(tag):
        tmp_dict["date"] = tag.text
      elif "length" in str(tag):
        tmp_dict["length"] = tag.text

    #Only those with more than 50 lesss will be added to the database
    if tmp_dict["length"].isdecimal() and int(tmp_dict["length"]) > 50:
      log_database.append(tmp_dict)

  time.sleep(1)

#Convert to DataFrame
thread_df = pd.DataFrame(log_database)

# ------------------------------------------------------------------------
#Get less from past logs
log_url_base = "http://nozomi.2ch.sc/test/read.cgi/"
res_database = []

for thread in log_database:
  #Board name and bulletin board No. from the past log list.And generate the URL of the past log
  board_and_code_match = re.search("[a-zA-Z0-9_]*?/[0-9]*?/$",thread["link"])
  board_and_code = board_and_code_match.group()
  thread_url = urljoin(log_url_base, board_and_code)

  #HTML extraction from past log page
  res = requests.get(thread_url)
  soup = bs4.BeautifulSoup(res.text, "html5lib")

  tmp_dict = {}
  #Information such as date in the dt tag
  #The comment is stored in the dd tag
  dddt = soup.find_all(["dd","dt"])

  for tag in dddt[::-1]:  #Extract from behind

    #Extract only the date from the dt tag
    if "<dt>" in str(tag):
      date_result = re.search(r"\d*/\d*/\d*",tag.text)  #  "(←'"'I don't care (to avoid display abnormalities of qiita)
      if date_result:
        date_str = date_result.group()
        tmp_dict["date"] = date_str

    #Extract less content from dd tag
    if "<dd>" in str(tag):
      tmp_dict["comment"] = re.sub("\n","",tag.text)

    # tmp_The contents stored in dict are res_Post to database
    if "date" in tmp_dict and "comment" in tmp_dict:
      tmp_dict["thread_title"] = thread["title"]
      res_database.append(tmp_dict)
      tmp_dict = {}

  time.sleep(1)  #promise

#Convert to DataFrame
res_df = pd.DataFrame(res_database)

# ------------------------------------------------------------------------

#Morphological analysis library MeCab and dictionary(mecab-ipadic-NEologd)Installation of
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
!pip install mecab-python3 > /dev/null

#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc

#Wordcloud installation
!pip install wordcloud

#Combine all the less
sentences = ",".join(res_df["comment"])
sentences_sep = []
n = 10000
for i in range(0,len(sentences), n):
  sentences_sep.append(sentences[i:i + n])

#Les n(=1000)Separate by less and combine with commas
#The purpose of partitioning is because later mecab cannot handle too many characters.
sentences_sep = []
n = 1000
for i in range(0, len(res_df["comment"]), n):
  sentences_sep.append(",".join(res_df["comment"][i: i + n]))

# ------------------------------------------------------------------------
import MeCab

# mecab-ipadic-Specify the path where the neologd dictionary is stored
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
#Above path (/usr/~) Can be obtained with the following command
# !echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

#Creating a Tagger object
mecab = MeCab.Tagger(path)

#Perform morphological analysis for each separated group
chasen_list = [mecab.parse(sentence) for sentence in sentences_sep]

word_list = []

# chasen_Decompose list to one line
# ex.Iron giant noun,Proper noun,General,*,*,*,Iron giant,Tetsukyojin,Tetsukyojin）
for chasen in chasen_list:
  for line in chasen.splitlines():
    
    if len(line) <= 1: break

    speech = line.split()[-1]
    if "noun" in speech:
      if  (not "Non-independent" in speech) and (not "Pronoun" in speech) and (not "number" in speech):
        word_list.append(line.split()[0])

word_line = ",".join(word_list)

# ------------------------------------------------------------------------
from wordcloud import WordCloud
import matplotlib.pyplot as plt
#Fonts must be installed locally in Colab in advance
f_path = "BIZ-UDGothicB.ttc"
stop_words = ["https","imgur","net","jpg","com","so"]

wordcloud = WordCloud(
    font_path=f_path,
    width=1024, height=640,   # default width=400, height=200
    background_color="white",   # default=”black”
    stopwords=set(stop_words),
    max_words=350,   # default=200
    max_font_size=200,   #default=4
    min_font_size=5,   #default=4
    collocations = False   #default = True
    ).generate(word_line)
plt.figure(figsize=(18,15))
plt.imshow(wordcloud)
plt.axis("off") #Hide memory
plt.show()

Environment ~ Google Colaboratory ~

** Use Google Colaboratory ** Google Colaboratory is a Python execution environment on a browser that anyone can use as long as they have a Google account. Since it can use a powerful GPU, it is often used in machine learning situations, but ** it is not necessary to install a library if you only scrape **, so it is also recommended if you want to scrape casually. (Additional installation is required for Mecab, WordCloud, etc., which will be explained next time) See the article below for how to use Google Colaboratory ⇒ Summary of how to use Google Colab (I haven't tried it, but I think it's okay if you don't have Colab if you install the library etc.)

Commentary

Basics of scraping

I get a web page object with requests.get () and parse it with bs4.BeautifulSoup (). The entrance to scraping. "Html.parser" is the parser specification. For log speed, "html.parser" (a commonly used one) is fine, but ** 2ch uses a parser called "html5lib" </ font> **. (Because "html.parser" fails to analyze for some reason)

  #Scraping processing body
  res = requests.get(logs_url)
  soup = bs4.BeautifulSoup(res.text, "html.parser")

promise

When accessing a website multiple times by scraping, insert time.sleep (1) between repetitions so as not to overload the server.

[Scraping the thread list from "log speed"]

Search for threads (including current threads) that include any keyword from "log speed" and extract the results. Log speed: https://www.logsoku.com/

Get the URL to scrape

When I try searching on the above site, the URL looks like this. Search by "FFRK": https://www.logsoku.com/search?q=FFRK 2nd and subsequent pages: https://www.logsoku.com/search?q=FFRK&p=2

From the URL of the search result, we found the following. ・ The URL with q =

attached to https://www.logsoku.com/search? Is the search result page (q is quest). q?) ・ You can directly access each page of the search result with p = (p is p of page?) ・ The first page can be displayed even with p = 1. ・ Pages with no search results can be accessed

From the above, it seems that the URL of the page to be accessed can be obtained by turning the for statement. Fortunately, you can access the page itself even if the search page is not valid. There is a class called "search_not_found" on pages with no search results, so use this for judgment.

base_url = "https://www.logsoku.com/search?q=FFRK&p="
for i in range(1,100):
  logs_url = base_url+str(i)

  #Performing scraping
  res = requests.get(logs_url)
  soup = bs4.BeautifulSoup(res.text, "html.parser")

  #What to do when no search results are found
  if soup.find(class_="search_not_found"):break
　：
(Processing for each page)
　：

Scraping for each search page

Before actually scraping, look at the HTML of the target page and think about how to process it. At that time, ** "development tools" ** provided in each browser are useful. Press F12 in your browser to launch it. If you click ** "Select element from page" ** in that state, you will be able to see where in the HTML you are pointing when you look at the cursor anywhere on the Web page.

Looking at the 2ch thread with the development tool, I found the following points. -All necessary information is stored under div # search_result_threads. -Information for one thread is stored in each tr tag. -The thread title and link are stored in a.thred in the tr tag. -The number of threads is stored in td.length in the tr tag. -The thread update date and time is stored in td.date in the tr tag.

Based on these, scraping is carried out as follows. Threads with less than 50 threads are omitted because there is a high possibility of duplication. It is easy to understand if the extraction result is stored in a dictionary type, and it can be enjoyed when converting to the DataFrame format described later.

  #Get table / row where thread information is stored
  thread_table = soup.find(id="search_result_threads")
  thread_rows = thread_table.find_all("tr")

  #Processing for each row
  for thread_row in thread_rows:
    tmp_dict = {}
    tags = thread_row.find_all(class_=["thread","date","length"])

    #Organize the contents
    for tag in tags:
      if "thread" in str(tag):
        tmp_dict["title"] = tag.get("title")
        tmp_dict["link"] = tag.get("href")
      elif "date" in str(tag):
        tmp_dict["date"] = tag.text
      elif "length" in str(tag):
        tmp_dict["length"] = tag.text

    #Only those with more than 50 lesss will be added to the database
    if tmp_dict["length"].isdecimal() and int(tmp_dict["length"]) > 50:
      log_database.append(tmp_dict)

  time.sleep(1)

For the time being, I was able to get the 2ch thread information that I searched for. Convert it to pandas DataFrame format so that it can be easily handled later.

thread_df = pd.DataFrame(log_database)  #conversion

display

thread_df

First, I was able to extract the thread list.

[Scraping 2ch threads]

Based on the thread information acquired above, the contents of the thread are extracted. In 2ch, specify the thread URL as follows. "Http://nozomi.2ch.sc/test/read.cgi/" + "board code /" + "bulletin board No./ " The board code and bulletin board number are the regular expressions re.search ("[a-zA-Z0-9_] *? / [0-9] *? / $ ", Thread [" from the link obtained by scraping above. Extracted using link "].

#Get less from past logs
log_url_base = "http://nozomi.2ch.sc/test/read.cgi/"
res_database = []

for thread in log_database:
  #Board name and bulletin board No. from the past log list.And generate the URL of the past log
  board_and_code_match = re.search("[a-zA-Z0-9_]*?/[0-9]*?/$",thread["link"])
  board_and_code = board_and_code_match.group()  #Convert results from regular expression objects
  thread_url = urljoin(log_url_base, board_and_code)
　：
(Processing for each thread)
　：

Acquisition of less content / posting date

Use the browser development tool (F12) as well as log speed to investigate the tendency of scraping pages. I could catch this tendency on 2ch. ・ One less with a set of dd tag and dt tag. ・ The dd and dt tags are unlikely to be used anywhere other than Les. ・ In addition to the date and time of the reply, the dd tag also includes information such as iron and ID. -The contents of the reply are stored in the dt tag.

Based on these, scraping was performed as follows. As mentioned above ** the parser uses "html5lib" **.

  #HTML extraction from past log page
  res = requests.get(thread_url)
  soup = bs4.BeautifulSoup(res.text, "html5lib")  #Use html5lib for 2ch

  tmp_dict = {}
  #Information such as date in the dt tag
  #The less content is stored in the dd tag
  dddt = soup.find_all(["dd","dt"])

  for tag in dddt[::-1]:  #Extract from behind

    #Extract only the date from the dt tag
    if "<dt>" in str(tag):
      date_result = re.search(r"\d*/\d*/\d*",tag.text)  #  "(←'"'Don't worry about qiita (to avoid display abnormalities)
      if date_result:
        date_str = date_result.group()
        tmp_dict["date"] = date_str

    #Extract less content from dd tag
    if "<dd>" in str(tag):
      tmp_dict["comment"] = re.sub("\n","",tag.text)

    # tmp_The contents stored in dict are res_Post to database
    if "date" in tmp_dict and "comment" in tmp_dict:
      tmp_dict["thread_title"] = thread["title"]
      res_database.append(tmp_dict)
      tmp_dict = {}

  time.sleep(1)  #promise

As a flow, we first extracted the dd tag and the dt tag in a jumbled manner, decomposed them one by one, judged whether they were dt tags or dd tags, and stored them in a dictionary type. (Actually, the dd tag and dt tag extracted together are arranged regularly and judgment is not necessary. However, I tried to put a judgment so that it can be dealt with when another pattern occurs)

Convert to DataFrame

res_df = pd.DataFrame(res_database)

display

res_df

I was able to safely extract the information on the 2ch lessons and the posting date.

Future plans

Next time, after performing morphological analysis on the extracted content, it will be output to WordCloud.