I will play with morphological analysis.
When I was reading the review of KOTY [^ 1] to kill time the other day, suddenly ** I came up with the idea that it would be fun to analyze the morphological analysis of KOTY so far. I have never done morphological analysis, so I would like to do it as a study.
For the time being, get a general comment on Nokogiri from KOTY deferred wiki. The ones posted are from 2005 to 2018. By the way, mobile phones and eroge will go through. [^ 2] The URL of the general review for each year is
https://koty.wiki/(Year)GC
Because it is unified in, you can easily get it. Of particular note is
Since the description method in the HTML source of the general comment is different in each of the three periods, I wonder if these three patterns were processed differently. Click here for the code to get with Nokogiri
KOTY_Scrape.rb
require 'nokogiri'
require 'open-uri'
if ! Dir::exist?("KOTY general comment") #Create folder for saving
Dir::mkdir("KOTY general comment")
end
for year in 2005..2018 do #Initialize the text file in the folder
File.open("KOTY general comment/#{year}Year.txt","w") do |text|
end
end
for year in 2005..2009 do #Inside the p element in the body is the body, br is not sandwiched between p and p
sleep 1
doc = Nokogiri::HTML(URI.open("https://koty.wiki/#{year}GC"))
doc.xpath("//div[@id='body']//p").each do |paragraph|
File.open("KOTY general comment/#{year}Year.txt","a") do |text|
text.puts paragraph.inner_text
end
end
end
for year in 2010..2011 do #There is no line break in the HTML source, only the br tag is used.
sleep 1
doc = Nokogiri::HTML(URI.open("https://koty.wiki/#{year}GC"))
comment = doc.xpath("//p[@class='aapro']")
comment.search('br').each do |br|
br.replace("\n")
end
File.open("KOTY general comment/#{year}Year.txt","a") do |text|
text.puts comment.inner_text
end
end
for year in 2011..2018 do #Described in the blockquote element
sleep 1
doc = Nokogiri::HTML(URI.open("https://koty.wiki/#{year}GC"))
File.open("KOTY general comment/#{year}Year.txt","a") do |text|
text.puts doc.xpath("//blockquote").inner_text
end
end
I got it now. Like this
Looking at it like this, it seems that the file size tends to increase with each passing year.
Ah, I was originally planning to analyze morphological elements on Windows64 / Ruby / Mecab. ** "Win64" ** and ** "Ruby" ** and ** "Mecab" ** is ... It's very troublesome to build an environment.
No, I tried it, about twice I've read various existing articles, but ... ** The bad ones are bad ** That's why ...
** Use Python **
Well yes, build the environment Using Mecab on Python / Win64 is pretty easy and quick
mecab
library by referring to this articleonly this You don't have to rewrite the DLL in hell, I'm even impressed, this is destructive, it's destroying the concept, it's a paradigm shift.
For the time being, let's morphologically analyze the overall review of 2018 and output it with WordCloud. I will go with the feeling of extracting only "nouns".
MecabKOTY.py
import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()
with open('KOTY general comment/2018.txt',encoding="UTF-8") as txt_file:
text = txt_file.read()
nodes = t.parseToNode(text)
s = []
while nodes:
if nodes.feature[:2] in ['noun']:
s.append(nodes.surface)
nodes = nodes.next
wc = WordCloud(width=720, height=480, background_color="black",stopwords=
{"this", "For", "It", "Yo", "thing", "thing"}
, font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('KOTY_wc.png')
Ahh, it feels good, it feels good It's just like morphological analysis! !! !! If this was for 2007, for example ** "Scenario" ** stands out like this,
If it was from 2014 ** "Rider" ** words stand out. KOTY's general comment also has its own individuality every year.
Next, let's analyze all the reviews together.
MecabKOTY.py
import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()
s = []
for y in range(2005,2018):
with open(f'KOTY general comment/{y}Year.txt',encoding="UTF-8") as txt_file:
text = txt_file.read()
nodes = t.parseToNode(text)
while nodes:
if nodes.feature[:2] == "noun":
s.append(nodes.surface)
nodes = nodes.next
wc = WordCloud(width=720, height=480, background_color="black",stopwords=
{"this", "For", "It", "Yo", "thing", "thing"}
, font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('KOTY_wc.png')
It will be like this. It's a masterpiece. ** "Player" **, ** "Game" **, ** "Kusoge" ** It feels like a symbol of KOTY ~~~~~
Now, what if we narrow this down further and try to extract only proper nouns?
MecabKOTY.py
import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()
s = []
for y in range(2005,2018):
with open(f'KOTY general comment/{y}Year.txt',encoding="UTF-8") as txt_file:
text = txt_file.read()
nodes = t.parseToNode(text)
while nodes:
if nodes.feature[:7] == "noun,固有noun":
s.append(nodes.surface)
nodes = nodes.next
wc = WordCloud(width=720, height=480, background_color="black"
, font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('KOTY_wc.png')
It's history. ** You can feel the history. ** **
By the way, I wondered if I could extract ** "Kusoge Maker" ** by narrowing it down further and limiting it to ** "organization name" **.
** "It's not an organization" ** I stopped because there was a lot of things mixed in. It might be cool if you change the dictionary.
Next, let's change the taste a little. ** Check "Changes in the frequency of specific words by year" **. The first thing to look for is ... well, let's make it a ** "bug" **. I will draw a line graph with matplotlib.
KOTYPlot.py
import MeCab
from wordcloud import WordCloud
import matplotlib.pyplot as plt
t = MeCab.Tagger()
c = []
for y in range(2005, 2018):
c.append(0)
with open(f'KOTY general comment/{y}Year.txt',encoding="UTF-8") as txt_file:
text = txt_file.read()
nodes = t.parseToNode(text)
while nodes:
if nodes.surface == "bug":
c[-1] += 1
nodes = nodes.next
plt.plot(range(2005, 2018), c, linewidth=4)
plt.xlabel("Year", fontsize = 24)
plt.ylabel("Occurrence:Bug", fontsize=24)
plt.grid(True)
plt.savefig("KOTYgraph.png ")
too scary. 2015 is going to be amazing. Probably the result of the collision of two big bug towers, "Ajinoko" and "Tetaru". Also, it's hidden in the impact, but the state of 2013 with no ** "bugs" is amazing. Certainly, in 2013, I think there was competition for shit in a different direction from bugs.
The next word to make a graph is ... ** "Year-end" **.
!?!?!??!??!? I feel a lot of regularity! ?? !! ?? It's a roller coaster type, isn't it? It ’s a little to go up in an instant Feeling that it takes time I wonder if the appearance of "year-end monsters" has some periodicity.
it was fun
[^ 1]: 2/5 Channel's abbreviation for "Kusoge of the Year", the thread that decides "the most fucking game of the year". [^ 2]: The reason for using multiple sentences for one year is that it is a little off the point and that we are a minor.
Recommended Posts