[PYTHON] If you try to make a word cloud with comments from WEB manga, it is interesting to visually understand what kind of manga it is.

Introduction

I read WEB manga once in a while, but there are so many that I don't know which one to read. I thought about whether comments could be used as an index for choosing which manga to read. Popular manga have many comments, and even if there are few comments, there are many interesting manga. So, I'm thinking of analyzing the comments in various ways, but as the first step, when I made the comments into a word cloud, I could visually grasp the comments and intuitively understand whether it was an intriguing manga. I was able to see it.

By selecting manga from a new perspective based on the word cloud, I hope to become the gateway to the manga that the writer wrote hard and contribute to the revitalization of the manga world. Is it a little exaggerated?

environment

python 3.7.6 selenium 3.141.0 ChromeDriver 80.0.3987.16 wordcloud 1.6.0 BeautifulSoup 4.8.2 mecab-python-windows 0.996.3

Target

Nico Nico Seiga

WEB Manga Word Cloud Reference Site

We have created a site below where you can see the results. Click on the word cloud to move to that manga.

WEB Manga Word Cloud

Word cloud output result

The following is the output result. Are you wondering what kind of manga it is? With comments such as "beautiful woman" and "like", I think I want to read a little.

image.png

Check the terms

We will scrape, so check the terms.


Excerpt from niconico Terms of Service
** 5 Prohibitions ** The following acts are prohibited regarding the use of "niconico" by users.

--Acts listed in paragraphs 3 and 4 of the Nico Nico Activity Guidelines or acts equivalent to these acts (including acts performed through means other than writing comments and posting videos, etc.) --Acts that violate the provisions of these Terms of Use --Acts that violate the Public Offices Election Act -** Acts that put an excessive burden on the "niconico" server ** --Acts that interfere with the operation of "niconico" --Links to child prostitution / pornography, uncensored video video download sites, etc. --Selling, auctioning, monetary payments and other similar acts without the permission of the operating company ――Advertising products without the permission of the operating company, publishing profile contents for the purpose of promotion, and other acts for the purpose of soliciting spam mails, chain mails, etc. ――The act of a minor over 13 years old using "niconico" without the consent of a legal representative (parental guardian) --Acts that the operating company considers inappropriate --Other acts similar to the above

Therefore, be careful not to put an excessive load on it. It is executed while not running continuously, sandwiching sleep, etc.

Process flow

Execute the process according to the following flow.

  1. Log in to Nico Nico
  2. Display Nico Nico still images in update order and get the URL list from the manga list
  3. Transition to cartoon details
  4. Get comments
  5. Process comments with WordCloud

Log in to Nico Nico

You need to log in to see Nico Nico Seiga. Here, we use selenium to log in to NicoNico in the background.

It is assumed that selenium and Chrome Driver are installed. ChromeDriver

Library import

Import the required libraries below.

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import urllib.parse

WebDriver construction

Set the options and build the driver. The --headless option is specified to run in the background. Also, set_page_load_timeout sets the timeout to 30 seconds.

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1024,768')

driver = webdriver.Chrome(options=options)
driver.set_page_load_timeout(30)

Login

First, access https://account.nicovideo.jp/login?site=seiga&next_url=%2F. Next, get the items of e-mail address and password by ID and set each. Finally, click the login button. Please change [email address] and [password] to your own.

driver.get('https://account.nicovideo.jp/login?site=seiga&next_url=%2F')

e = driver.find_element(By.ID, "input__mailtel")
e.send_keys('[mail address]')
e = driver.find_element(By.ID, "input__password")
e.send_keys('[password]')

e = driver.find_element(By.ID, 'login__submit')
e.click()

You can also log in using the requests post, but in that case you need to get ʻauth_id` from the login screen and post it as well. Processing around that is unnecessary with selenium. Also, even if the screen is updated with JavaScript etc. after the screen is displayed, it takes a lot of trouble with requests, but with selenium it is convenient to be able to process without worrying about that.

Display Nico Nico still images in update order and get URL list from manga list

Get a list of manga in order of update

We will get a list of still images in the following states. image.png Get the manga URL of the manga list on each page in a list while changing pages. Considering the load, here we will get 1 to 3 pages.

url_root = 'https://seiga.nicovideo.jp'
desc_urls = []

for n in range(1, 4):
    target_url = urllib.parse.urljoin(url_root, 'manga/list?page=%d&sort=manga_updated' % n)
    try:
        driver.get(target_url)
        html = driver.page_source.encode('utf-8')
        soup = BeautifulSoup(html, 'html.parser')

        # change to loop
        for desc in soup.select('.mg_description'):
            title = desc.select('.title')
            desc_urls.append(urllib.parse.urljoin(url_root, title[0].find('a').get('href')))
    except Exception as e:
        print(e)
        continue

Save the URL to the manga in the desc_urls list. Set the URL to each page in target_url. Since the page is controlled by setting the number in page = of QueryString, set the number of the page you want to get there.

Get the page with driver.get. Once you get it, get the HTML inside with driver.page_source.encode ('utf-8') and set it to BeautifulSoup for ease of use. You can handle it without setting it to BeautifulSoup, but I'm used to it, so I decided to use it. WebDriver can also use XPath, so I think it's okay as it is.

Since select of BeautifulSoup is a CSS selector, we get the .mg_description and get the .title in it and the href of the ʻa` tag set there. image.png

You now have a list of manga titles and URLs on the page.

Transition to cartoon details

Get page by URL in list

Get the page with the URL stored in desc_urls. The acquisition is done with driver.get (desc_url). Once you get it, get the HTML as well and set it to BeautifulSoup.

for desc_url in desc_urls:
    try:
        driver.get(desc_url)

        html = driver.page_source.encode('utf-8')
        soupdesc =  BeautifulSoup(html, 'html.parser')

Get and confirm title and author

Get the element with id ng_main_column in the div tag. Get the elements of the `.main_title'class in it, and get the title and author. Try printing and see if you can get it properly.

        maindesc = soupdesc.find('div', id = 'mg_main_column')

        titlediv = maindesc.select('.main_title')[0]

        title = titlediv.find('h1').text.strip()
        author = titlediv.find('span').text.strip()

        print(title)
        print(author)

The structure of HTML is as follows. image.png

Get the URL to the subtitle and details from the episode list and transition

Since each episode is in the element whose class is .episode_item, get the list with the CSS selector select. You will get multiple elements, from each element you will get the URL to the subtitle and details.

        for eps in soupdesc.select('.episode_item'):
            eps_ttl_div = eps.select('.title')
            eps_title = eps_ttl_div[0].find('a')
            eps_url = urllib.parse.urljoin(url_root, eps_title.get('href'))
            eps_t = eps_title.text
            print(eps_t)

            try:
                driver.get(eps_url)
                html = driver.page_source.encode('utf-8')
                soupeps = BeautifulSoup(html, 'html.parser')

The title is taken from the .title class and the URL is taken from the a tag href. image.png I am getting the details screen with driver.get (eps_url). Once obtained, set it to Beautiful Soup.

Get comments

Get a list of comments and set the text in it to an array

The class is getting the elements for .comment_list and all the .comments in it. I get the string in it with c.text and set it in the array comments_text. The settings for the array use list comprehension notation. The inclusion notation of python seems to be Turing complete.

                crlist = soupeps.select('.comment_list')
                comments = crlist[0].select('.comment')
                comments_text = [c.text for c in comments]

The HTML structure of the comment part is as follows. It seems that you can also find with comment_viewer. Let's specify this area in a nice way. image.png

Process comments in Word Cloud

Morphological analysis with MeCab

The acquired comment character string is morphologically analyzed by MeCab. Let's add the import amount.

import MeCab

Perform morphological analysis with parse in MeCab.

                m = MeCab.Tagger('')
                parsed = m.parse('。'.join(comments_text))

The result of morphological analysis is as follows.

'As expected\t noun,Adjectival noun stem,*,*,*,*,As expected,Sasuga,Sasuga\to n\t particle,Adverbization,*,*,*,*,To,D,D\n not\t adjective,Independence,*,*,Adjective, Auoudan,Continuous connection,Absent,Naka,Naka\n\t auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta\nwww\t noun,General,*,*,*,*,*\n。\t sign,Kuten,*,*,*,*,。,。,。\n it\t noun,Pronoun,General,*,*,*,It,Sole,Sole\n is\t particle,Particle,*,*,*,*,Is,C,Wow\n Writing implements\t noun,General,*,*,*,*,Writing implements,Hicking,Hicking\at n\t particle,Case particles,General,*,*,*,so,De,De\n is\t particle,Particle,*,*,*,*,Is,C,Wow\n Yes\t verb,Independence,*,*,Five steps, La line,Continuous form,is there,Ants,Ants\n\t auxiliary verb,*,*,*,Special / mass,Imperfective form,Masu,Mase,Mase\n\t auxiliary verb,*,*,*,Immutable type,Uninflected word,Hmm,Down,Down\n…\t sign,General,*,*,*,*,…,…,…\n。\t sign,Kuten,*,*,*,*,。,。,。\n Kisigai\t noun,General,*,*,*,*,*\n。\t sign,...

Since \ n is line by line, take out line by line with splitlines and get the basic form of the morpheme from the 7th on the right side separated by \ t. In doing so, we exclude particles and auxiliary verbs, pronouns, and some strings such as "suru" and "teru". If you do not exclude it, when you create a word cloud, it will be displayed in large letters.

                words = ' '.join([x.split('\t')[1].split(',')[6] for x in parsed.splitlines()[:-1] if x.split('\t')[1].split(',')[0] not in ['Particle', 'Auxiliary verb'] and x.split('\t')[1].split(',')[1] not in ['Pronoun'] and x.split('\t')[1].split(',')[6] not in ['To do', 'Teru', 'Become', 'Mr.', 'so', 'this', 'is there']])

Create a word cloud with Word Cloud

Create a word cloud with to_file in WordCloud. comic_titles`` comic_subtitles``` comic_images comic_urls` is a variable declared in an array and will be used later when creating HTML. Each holds a title, subtitle, image name, and URL.

When building WordCloud, the font, background color, and size are specified. The font used is "Ranobe POP", which seems to be often used on YouTube. Please specify this area as you like.

I am outputting to a file with wordcloud.to_file.

                if len(words) > 0:
                    try:
                        comic_titles.append(title)
                        comic_subtitles.append(eps_t)
                        comic_images.append('%d.png' % (comic_index))
                        comic_urls.append(eps_url)
                        wordcloud = WordCloud(font_path=r"C:\\WINDOWS\\Fonts\\Ranobe POP.otf", background_color="white", width=800,height=800).generate(words)
                        wordcloud.to_file("[Path you want to save]/wordcloud/%d.png " % (comic_index))
                        comic_index += 1
                    except Exception as e:
                        print(e)

The output result is the one shown first. Create HTML with these and publish it on the site.

Published site

https://comic.g-at.net/

When you access the above URL, the following list of word clouds will be displayed. Click on the word cloud to open the manga. image.png

in conclusion

There are many comments such as Jump + and Manga One that are quite harsh, but Nico Nico has many gentle comments. After all, you're used to commenting.

It would be great if we could not only create a word cloud, but also analyze various things and open the door to masterpieces that were difficult to meet until now.

Recommended Posts

If you try to make a word cloud with comments from WEB manga, it is interesting to visually understand what kind of manga it is.
WEB scraping with python and try to make a word cloud from reviews
If you want to create a Word Cloud.
[CleanArchitecture with Python] Apply CleanArchitecture step by step to a simple API and try to understand "what kind of change is strong" in the code base.
Make a note of what you want to do in the future with Raspberry Pi
Try to make Qiita's Word Cloud from your browser history
[Python] What is a slice? An easy-to-understand explanation of how to use it with a concrete example.
If you know Python, you can make a web application with Django
When it is troublesome to copy what you built with vue
Try to make a web service-like guy with 3D markup language
What to do if you get a UnicodeDecodeError with pip install
Is it possible to enter a venture before listing and make a lot of money with stock options?
Since it is the 20th anniversary of the formation, I tried to visualize the lyrics of Perfume with Word Cloud
2. Make a decision tree from 0 with Python and understand it (2. Python program basics)
What to do if you get a TypeError with numpy min, max
Make a decision tree from 0 with Python and understand it (4. Data structure)
Try to make a kernel of Jupyter
What is the last programming language you learn in your life? (If you want to catch up with a club with a lifetime salary of 300 million yen)
Let's make an A to B conversion web application with Flask! From scratch ...
[OpenCV] When you want to check if it is read properly with imread
If you want to make a discord bot with python, let's use a framework
If you try to install Python2 pip after installing Python3 pip and it is rejected
Try it with Word Cloud Japanese Python JupyterLab.
Try to make a "cryptanalysis" cipher with Python
Try to make a dihedral group with Python
Try to create a battle record table with matplotlib from the data of "Schedule-kun"
What to do if you get a memory error when converting from PySparkDataFrame to PandasDataFrame
Understand Python yield If you put yield in a function, it will change to a generator