[PYTHON] Analyze the topic model of becoming a novelist with GensimPy3

There is a Python library that handles topic models called gensim. Officially, it only supports Python version 2.5 <= Python <3.0.

However, Samantp has released a library called gensimPy3. It's a fork of gensim for Python 3.3.

This time, using this gensimPy3, it is the same as Shoto's Analyzing the ranking of becoming a novelist with a topic model (gensim) I experimented to see if I could do it.

Installation of GensimPy3

https://github.com/samantp/gensimPy3

Clone the source code from.

git clone [email protected]:samantp/gensimPy3.git

so,

python setup.py test

if you do

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 3973: invalid start byte

I got an error and became anxious. But I ignored it and installed it.

python setup.py install

I got a Syntax Error for some reason, but the installation was successful.

Differences in environment

The differences from Shoto's article http://sucrose.hatenablog.com/entry/2013/04/27/225218 are described below.

--Using Python3.3.1 on pyenv --gensim uses gensimPy3 -Use pyquery instead of BeautifulSoup --Use requests instead of urllib2

Since pyquery can use the same notation selector as JQuery, it is easy to make trial and error using the Chrome developer tool Console. Convenient.

Source code

topic_model_in_narou.py


# -*- coding: utf-8 -*-
import requests
from pyquery import PyQuery as pq
import gensim
import pdb


def fetch_narou_ranking_html():
    r = requests.get('http://yomou.syosetu.com/rank/list/type/total_total/')
    r.encoding = 'utf-8'
    return r.text


def collect_tags(d):
    d_novels = d('.s')
    tags = []
    for d_novel in d_novels:
        d_tag_name_list = d_novel.findall('a')
        tags_in_a_novel = [d_tag_name.text for d_tag_name in d_tag_name_list]
        tags.append(tags_in_a_novel)
    return tags


if __name__ == "__main__":
    html = fetch_narou_ranking_html()
    d = pq(html.encode('utf-8'))
    tags = collect_tags(d)

    dictionary = gensim.corpora.Dictionary(tags)
    dictionary.filter_extremes(3)
    corpus = [dictionary.doc2bow(text) for text in tags]
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)
    for x in lda.show_topics(-1, 5):
        print(x)

I made a model with this code and was able to display topics. The results are as follows.

0.115*Fantasy+ 0.068*magic+ 0.041*cheat+ 0.034*Harem+ 0.028*Reincarnation
0.173*magic+ 0.095*Fantasy+ 0.039*dark+ 0.033*Trip+ 0.027*Reincarnation
0.106*Reincarnation+ 0.087*Harem+ 0.074*The strongest hero+ 0.063*cheat+ 0.052*Fantasy
0.079*Fantasy+ 0.069*cheat+ 0.062*love+ 0.059*Reincarnation+ 0.041*Another world trip
0.088*Fantasy+ 0.063*Another world trip+ 0.051*Harem+ 0.039*adventure+ 0.039*OVL Bunko Grand Prize entry
0.105*cheat+ 0.103*Fantasy+ 0.062*The strongest hero+ 0.058*magic+ 0.044*Reincarnation
0.099*Fantasy+ 0.089*Reincarnation+ 0.045*The strongest hero+ 0.045*strongest+ 0.034*magic
0.051*magic+ 0.051*Upstart+ 0.051*monster+ 0.039*VRMMO + 0.039*serious
0.140*Fantasy+ 0.077*cheat+ 0.054*Reincarnation+ 0.043*magic+ 0.038*adventure
0.168*Fantasy+ 0.073*magic+ 0.052*love+ 0.026*adventure+ 0.026*war

Yeah, just fantasy ... There are too many of the same genres to divide into topics. It seems better to get the data from around the pixiv novel.

Postscript

dictionary.filter_extremes (no_below = 5, no_above = 0.5, keep_n = 100000) I thought that it might be possible to change the fantasy-only situation by changing the value of the function and filtering, so I modified it.

topic_model_in_narou.py


# -*- coding: utf-8 -*-
import requests
from pyquery import PyQuery as pq
import gensim
import pdb


def fetch_narou_ranking_html():
    r = requests.get('http://yomou.syosetu.com/rank/list/type/total_total/')
    r.encoding = 'utf-8'
    return r.text


def collect_tags(d):
    d_novels = d('.s')
    tags = []
    for d_novel in d_novels:
        d_tag_name_list = d_novel.findall('a')
        tags_in_a_novel = [d_tag_name.text for d_tag_name in d_tag_name_list]
        tags.append(tags_in_a_novel)
    return tags


if __name__ == "__main__":
    html = fetch_narou_ranking_html()
    d = pq(html.encode('utf-8'))
    tags = collect_tags(d)

    dictionary = gensim.corpora.Dictionary(tags)
    dictionary.filter_extremes(no_below=5, no_above=0.05, keep_n=10000)  #Change
    corpus = [dictionary.doc2bow(text) for text in tags]
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=20, id2word=dictionary)
    for x in lda.show_topics(-1, 5):
        print(x)

The value of no_above was set to 0.05. We decided not to count tags that appear in more than 5% of the total.

gensim official website http://radimrehurek.com/gensim/corpora/dictionary.html

Here are the results.

0.166*growth+ 0.133*comedy+ 0.100*battle+ 0.100*Upstart+ 0.067*Narokon Grand Prize
0.106*SF + 0.054*Trip+ 0.054*VRMMO + 0.054*Narokon Grand Prize+ 0.054*Upstart
0.142*Dragon+ 0.142*dark+ 0.073*slave+ 0.073*Aristocrat+ 0.073*battle
0.120*Wizard/witch+ 0.081*spirit+ 0.081*Elf+ 0.081*Beastman+ 0.081*Upstart
0.136*Brave+ 0.136*comedy+ 0.092*Labyrinth+ 0.092*strongest+ 0.092*VRMMO
0.140*Summon another world+ 0.106*Nation/People+ 0.071*VRMMO + 0.053*doting+ 0.036*slave
0.153*Knight+ 0.078*Misunderstanding+ 0.078*middle Ages+ 0.078*Wizard/witch+ 0.078*Summon another world
0.125*Summon+ 0.125*Adventurer+ 0.125*Beastman+ 0.125*strongest+ 0.125*Convenience
0.151*Brave+ 0.091*monster+ 0.091*Beautiful+ 0.061*war+ 0.061*doting
0.163*monster+ 0.122*friendship+ 0.082*Aristocrat+ 0.082*Upstart+ 0.082*strongest
0.260*spirit+ 0.054*Aristocrat+ 0.054*comedy+ 0.054*serious+ 0.054*Nation/People
0.143*Adventurer+ 0.096*battle+ 0.096*serious+ 0.096*Domestic affairs+ 0.049*Upstart
0.189*comedy+ 0.143*Misunderstanding+ 0.096*VRMMORPG + 0.096*comedy+ 0.049*High school student
0.147*Dragon+ 0.118*strongest+ 0.060*Elf+ 0.060*battle+ 0.060*war
0.173*slave+ 0.088*Trip+ 0.088*Magic+ 0.045*growth+ 0.045*monster
0.173*Domestic affairs+ 0.088*Brave+ 0.088*Trip+ 0.045*strongest+ 0.045*slave
0.130*serious+ 0.088*battle+ 0.088*High school student+ 0.088*Senki+ 0.088*Transfer to another world
0.143*skill+ 0.107*Template+ 0.072*war+ 0.072*Upstart+ 0.072*Magic
0.206*war+ 0.070*Domestic affairs+ 0.070*middle Ages+ 0.070*Summon+ 0.070*Nation/People
0.130*slave+ 0.130*guild+ 0.088*Aristocrat+ 0.088*Summon+ 0.045*war

After all it was just fantasy ...

But if you look closely, it's a little like "Serious, Battle, High School Student, Senki, Different World Transfer", "SF, Trip, VRMMO, Narurokon Grand Prize, Rise", "War, Domestic Affairs, Middle Ages, Summon, Nation / Ethnicity" Seems to be a different genre, so it's better than the default dictionary.filter_extremes () process.

Recommended Posts

Analyze the topic model of becoming a novelist with GensimPy3
I tried to create a model with the sample of Amazon SageMaker Autopilot
Take a screenshot of the LCD with Python-LEGO Mindstorms
Visualize the characteristic vocabulary of a document with D3.js
Calculate the product of matrices with a character expression?
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
Verify the effect of leave as a countermeasure against the new coronavirus with the SEIR model
A network diagram was created with the data of COVID-19.
Measure the importance of features with a random forest tool
Get the id of a GPU with low memory usage
Get UNIXTIME at the beginning of today with a command
Implement the mathematical model "SIR model" of infectious diseases with OpenModelica
The story of a Django model field disappearing from a class
I made a function to check the model of DCGAN
Calibrate the model with PyCaret
Severe Acute Respiratory Syndrome: Understanding the Role of Social Distance Strategy with a Simple Model
The story of making a question box bot with discord.py
I tried to predict the number of domestically infected people of the new corona with a mathematical model
Process the contents of the file in order with a shell script
A story stuck with the installation of the machine learning library JAX
Save the result of the life game as a gif with python
Find the optimal value of a function with a genetic algorithm (Part 2)
[Statistics] Grasp the image of the central limit theorem with a graph
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
If you give a list with the default argument of the function ...
The story of making a standard driver for db with python.
Count the maximum concatenated part of a random graph with NetworkX
Get the URL of a JIRA ticket created with the jira-python library
Evaluate the performance of a simple regression model using LeaveOneOut cross-validation
The story of making a module that skips mail with python
Create a compatibility judgment program with the random module of python.
Python-Simulation of the Epidemic Model (Kermack-McKendrick Model)
Inversely analyze a machine learning model
Make a model iterator with PySide
Validate the learning model with Pylearn2
The story of writing a program
Building a distributed environment with the Raspberry PI series (Part 1: Summary of availability of diskless clients by model)
A story that visualizes the present of Qiita with Qiita API + Elasticsearch + Kibana
The story of a Parking Sensor in 10 minutes with GrovePi + Starter Kit
The story of making a university 100 yen breakfast LINE bot with Python
[AtCoder explanation] Control the A, B, C problems of ABC182 with Python!
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
Memorandum of introduction of EXODUS, a data model of the finite element method (FEM)
Get the number of searches with a regular expression. SeleniumBasic VBA Python
[AtCoder explanation] Control the A, B, C problems of ABC186 with Python!
Generate a list packed with the number of days in the current month.
A series of amateur infrastructure engineers touching Django with Docker (2): Creating a model
[Introduction to Python] How to sort the contents of a list efficiently with list sort
Learn the basics of document classification by natural language processing, topic model
[AtCoder explanation] Control the A, B, C problems of ABC185 with Python!
[NNabla] How to add a quantization layer to the middle layer of a trained model
Calculate the probability of being a squid coin with Bayes' theorem [python]
The story of making a sound camera with Touch Designer and ReSpeaker
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I made a GAN with Keras, so I made a video of the learning process.
[AtCoder explanation] Control the A, B, C problems of ABC187 with Python!
Get the average salary of a job with specified conditions from indeed.com
I made a mistake in fetching the hierarchy with MultiIndex of pandas
I tried to predict the behavior of the new coronavirus with the SEIR model.
[AtCoder explanation] Control the A, B, C problems of ABC184 with Python!
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①