[PYTHON] I tried morphological analysis of the general review of Kusoge of the Year

I will play with morphological analysis.

Preface

When I was reading the review of KOTY [^ 1] to kill time the other day, suddenly ** I came up with the idea that it would be fun to analyze the morphological analysis of KOTY so far. I have never done morphological analysis, so I would like to do it as a study.

Download general review

For the time being, get a general comment on Nokogiri from KOTY deferred wiki. The ones posted are from 2005 to 2018. By the way, mobile phones and eroge will go through. [^ 2] The URL of the general review for each year is

https://koty.wiki/(Year)GC

Because it is unified in, you can easily get it. Of particular note is

Since the description method in the HTML source of the general comment is different in each of the three periods, I wonder if these three patterns were processed differently. Click here for the code to get with Nokogiri

KOTY_Scrape.rb


require 'nokogiri'
require 'open-uri'
if ! Dir::exist?("KOTY general comment") #Create folder for saving
    Dir::mkdir("KOTY general comment")
end
for year in 2005..2018 do #Initialize the text file in the folder
    File.open("KOTY general comment/#{year}Year.txt","w") do |text|
    end
end
for year in 2005..2009 do #Inside the p element in the body is the body, br is not sandwiched between p and p
    sleep 1
    doc = Nokogiri::HTML(URI.open("https://koty.wiki/#{year}GC"))
    doc.xpath("//div[@id='body']//p").each do |paragraph|
        File.open("KOTY general comment/#{year}Year.txt","a") do |text|
            text.puts paragraph.inner_text
        end
    end
end
for year in 2010..2011 do #There is no line break in the HTML source, only the br tag is used.
    sleep 1
    doc = Nokogiri::HTML(URI.open("https://koty.wiki/#{year}GC"))
    comment = doc.xpath("//p[@class='aapro']")
    comment.search('br').each do |br|
        br.replace("\n")
    end
    File.open("KOTY general comment/#{year}Year.txt","a") do |text|
        text.puts comment.inner_text
    end
end
for year in 2011..2018 do #Described in the blockquote element
    sleep 1
    doc = Nokogiri::HTML(URI.open("https://koty.wiki/#{year}GC"))
    File.open("KOTY general comment/#{year}Year.txt","a") do |text|
        text.puts doc.xpath("//blockquote").inner_text
    end
end

I got it now. Like this image.png

Looking at it like this, it seems that the file size tends to increase with each passing year.

The problem here

Ah, I was originally planning to analyze morphological elements on Windows64 / Ruby / Mecab. ** "Win64" ** and ** "Ruby" ** and ** "Mecab" ** is ... It's very troublesome to build an environment.

No, I tried it, about twice I've read various existing articles, but ... ** The bad ones are bad ** That's why ...

** Use Python **

Environment

Well yes, build the environment Using Mecab on Python / Win64 is pretty easy and quick

  1. Download the stray build installer for Win64 from here and install it in UTF-8.
  2. Get the mecab library by referring to this article

only this You don't have to rewrite the DLL in hell, I'm even impressed, this is destructive, it's destroying the concept, it's a paradigm shift.

I played a lot

For the time being, let's morphologically analyze the overall review of 2018 and output it with WordCloud. I will go with the feeling of extracting only "nouns".

MecabKOTY.py


import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()

with open('KOTY general comment/2018.txt',encoding="UTF-8") as txt_file:
    text = txt_file.read()

nodes = t.parseToNode(text)
s = []

while nodes:
    if nodes.feature[:2] in ['noun']:
        s.append(nodes.surface)
    nodes = nodes.next

wc = WordCloud(width=720, height=480, background_color="black",stopwords=
    {"this", "For", "It", "Yo", "thing", "thing"}
    , font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('KOTY_wc.png')

image.png Ahh, it feels good, it feels good It's just like morphological analysis! !! !! If this was for 2007, for example image.png ** "Scenario" ** stands out like this,

If it was from 2014 image.png ** "Rider" ** words stand out. KOTY's general comment also has its own individuality every year.

Next, let's analyze all the reviews together.

MecabKOTY.py


import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()
s = []

for y in range(2005,2018):
    with open(f'KOTY general comment/{y}Year.txt',encoding="UTF-8") as txt_file:
        text = txt_file.read()
    nodes = t.parseToNode(text)
    while nodes:
        if nodes.feature[:2] == "noun":
            s.append(nodes.surface)
        nodes = nodes.next

wc = WordCloud(width=720, height=480, background_color="black",stopwords=
    {"this", "For", "It", "Yo", "thing", "thing"}
    , font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('KOTY_wc.png')

It will be like this. image.png It's a masterpiece. ** "Player" **, ** "Game" **, ** "Kusoge" ** It feels like a symbol of KOTY ~~~~~

Now, what if we narrow this down further and try to extract only proper nouns?

MecabKOTY.py


import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()
s = []

for y in range(2005,2018):
    with open(f'KOTY general comment/{y}Year.txt',encoding="UTF-8") as txt_file:
        text = txt_file.read()
    nodes = t.parseToNode(text)
    while nodes:
        if nodes.feature[:7] == "noun,固有noun":
            s.append(nodes.surface)
        nodes = nodes.next

wc = WordCloud(width=720, height=480, background_color="black"
    , font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('KOTY_wc.png')

image.png It's history. ** You can feel the history. ** **

By the way, I wondered if I could extract ** "Kusoge Maker" ** by narrowing it down further and limiting it to ** "organization name" **.

image.png ** "It's not an organization" ** I stopped because there was a lot of things mixed in. It might be cool if you change the dictionary.

Next, let's change the taste a little. ** Check "Changes in the frequency of specific words by year" **. The first thing to look for is ... well, let's make it a ** "bug" **. I will draw a line graph with matplotlib.

KOTYPlot.py


import MeCab
from wordcloud import WordCloud
import matplotlib.pyplot as plt
t = MeCab.Tagger()
c = []
for y in range(2005, 2018):
    c.append(0)
    with open(f'KOTY general comment/{y}Year.txt',encoding="UTF-8") as txt_file:
        text = txt_file.read()
    nodes = t.parseToNode(text)
    while nodes:
        if nodes.surface == "bug":
            c[-1] += 1
        nodes = nodes.next

plt.plot(range(2005, 2018), c, linewidth=4)
plt.xlabel("Year", fontsize = 24)
plt.ylabel("Occurrence:Bug", fontsize=24)
plt.grid(True)
plt.savefig("KOTYgraph.png ")

image.png too scary. 2015 is going to be amazing. Probably the result of the collision of two big bug towers, "Ajinoko" and "Tetaru". Also, it's hidden in the impact, but the state of 2013 with no ** "bugs" is amazing. Certainly, in 2013, I think there was competition for shit in a different direction from bugs.

The next word to make a graph is ... ** "Year-end" **.

image.png

!?!?!??!??!? I feel a lot of regularity! ?? !! ?? It's a roller coaster type, isn't it? It ’s a little to go up in an instant Feeling that it takes time I wonder if the appearance of "year-end monsters" has some periodicity.

Summary

it was fun

[^ 1]: 2/5 Channel's abbreviation for "Kusoge of the Year", the thread that decides "the most fucking game of the year". [^ 2]: The reason for using multiple sentences for one year is that it is a little off the point and that we are a minor.

Recommended Posts

I tried morphological analysis of the general review of Kusoge of the Year
I tried cluster analysis of the weather map
I tried morphological analysis and vectorization of words
I tried the asynchronous server of Django 3.0
I tried the pivot table function of pandas
Before the coronavirus, I first tried SARS analysis
I tried to touch the API of ebay
I tried using the image filter of OpenCV
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to summarize the basic form of GPLVM
I tried the MNIST tutorial for beginners of tensorflow.
I tried to predict the J-League match (data analysis)
[OpenCV / Python] I tried image analysis of cells with OpenCV
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried scraping the advertisement of the pirated cartoon site
I tried the simplest method of multi-label document classification
I tried to classify the voices of voice actors
I tried running the sample code of the Ansible module
I tried to summarize the string operations of Python
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried "gamma correction" of the image with Python + OpenCV
I tried to get the location information of Odakyu Bus
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried the changefinder library!
[Python] I tried to visualize the follow relationship of Twitter
I tried a little bit of the behavior of the zip function
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I tried to fight the Local Minimum of Goldstein-Price Function
I displayed the chat of YouTube Live and tried playing
[Linux] I tried to summarize the command of resource confirmation system
I tried to get the index of the list using the enumerate function
I tried to build the SD boot image of LicheePi Nano
I looked at the meta information of BigQuery & tried using it
I tried to make an analysis base of 5 patterns in 3 years
I tried to expand the size of the logical volume with LVM
I tried running the DNN part of OpenPose with Chainer CPU
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python
I tried to visualize the common condition of VTuber channel viewers
I tried the TensorFlow tutorial 1st
I tried the Naro novel API 2
I investigated the mechanism of flask-login!
I tried the TensorFlow tutorial 2nd
Review of the basics of Python (FizzBuzz)
I tried the Naruro novel API
I played with Mecab (morphological analysis)!
I tried to move the ball
I tried using the checkio API
I tried to estimate the interval.
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to transform the face image using sparse_image_warp of TensorFlow Addons