[PYTHON] Voice actor network analysis (using word2vec and networkx) (1/2)

The rough flow introduced in this article

・ Scraping and obtaining a list of voice actors' names and genders ・ Learn the text information of the voice actor's Wikipedia included in the list by dividing it with mecab and learning with word2vec ・ Play with word analysis using the model learned in word2vec

Rough flow of analysis to be introduced next time

・ Build and visualize a female voice actor network ・ Use network information to cluster and categorize similar voice actors.

Nice to meet you, my name is bamboo-nova. Qiita will mainly throw in ** material analysis **. Please refer to the "hatena blog" for serious stories and analysis. Also, NLP is not a specialty at all, so I think it's a childish code, but please watch with warm eyes ...!

http://yukr.hatenablog.com/

This is the first post, but I tried ** voice actor network analysis ** as the first material analysis.

This time, I brought in a Wikipedia article by a voice actor, created a network of similar or related voice actors from the text information there, and decided to cluster further from there.

This time, we will create a learning model and analyze it only for female voice actors in their teens and 30s.

** Addition) ** I have put the source code that summarizes the series of steps so far on Github, so please refer to it if you like!

bamboo-nova/seiyu_network

First, get a list of female voice actors.

Scrap the following sites to get a list of voice actors' names. It seems that voice actors whose ages are not disclosed are not included in the list on this site, so even if they are famous, they may not be displayed in the analysis.

http://lain.gr.jp/voicedb/individual

First, call the required module.

import MeCab
import codecs
import urllib
import urllib.parse as parser
import urllib.request as request
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

from gensim.models import word2vec

import re

# .I divided it into those who do parseToNode and those who do not.
mecab = MeCab.Tagger('-Owakati -d /usr/local/mecab/lib/mecab/dic/mecab-ipadic-neologd')
title = MeCab.Tagger('-d /usr/local/mecab/lib/mecab/dic/mecab-ipadic-neologd')

Gets the information of the specified URL.

url = "http://lain.gr.jp/voicedb/individual/age/avg/10"
 
response = requests.get(url)
response.encoding = response.apparent_encoding
 
soup10 = BeautifulSoup(response.text, 'html.parser')
 
url = "http://lain.gr.jp/voicedb/individual/age/avg/20"
 
response = requests.get(url)
response.encoding = response.apparent_encoding
 
soup20 = BeautifulSoup(response.text, 'html.parser')

url = "http://lain.gr.jp/voicedb/individual/age/avg/30"
 
response = requests.get(url)
response.encoding = response.apparent_encoding
 
soup30 = BeautifulSoup(response.text, 'html.parser')

First, get the name of the voice actor. This time, in the URL of the above source code, all the tags that have the href attribute including "voicedb / profile" will be acquired by voice actors in their teens and 30s. The image below is when one of the above source code URLs is actually verified with Chrome, but the voice actor's name is actually stored in the list format in the href attribute including "voicedb / profile". I think you can confirm that.

スクリーンショット 2020-02-07 16.30.48.png

Below, the source code to get the name

corpus=[]
name10 = soup10.find_all(href=re.compile("voicedb/profile"))
name20 = soup20.find_all(href=re.compile("voicedb/profile"))
name30 = soup30.find_all(href=re.compile("voicedb/profile"))
for p in name10:
            corpus.append(p.text)
for p in name20:
            corpus.append(p.text)
for p in name30:
            corpus.append(p.text)

Then get the gender. Looking at the verified image above, it seems that it is okay to get the src and alt in the img tag, so get it as follows. By the way, it is the part of ʻalt = img.attrs.get ('alt','N') `, but if the alt attribute exists, its value is set, and if it does not exist,'N' is set.

data = []
for img in soup10.find_all('img'):
    data.append(dict(src=img['src'],
                     alt=img.attrs.get('alt', 'N')))
for img in soup20.find_all('img'):
    data.append(dict(src=img['src'],
                     alt=img.attrs.get('alt', 'N')))
for img in soup30.find_all('img'):
    data.append(dict(src=img['src'],
                     alt=img.attrs.get('alt', 'N')))

gender = []
for res in data:
    for k, v in res.items():
        if v == 'Female' or v == 'male':
            gender.append(v)

Then, save the data frame that integrates the name and gender in csv format.

name = pd.DataFrame(corpus)
gen = pd.DataFrame(gender)
res = pd.concat([name, gen],axis=1)
res.columns = ['Name','Gender']
res.to_csv('seiyu.csv')

Get the Wikipedia profile information of the female voice actor included in the obtained list.

Now, select only female voice actors from the saved data frame and get the profile information of the target voice actor on Wikipedia. Save the retrieved information as pwiki.txt.

df = pd.read_csv('seiyu.csv')

#Wikipedia link
link = "https://ja.wikipedia.org/wiki/"
#Target only female voice actors
df_women = df[df.Gender=='Female']
keyword = df_women.Name
keyword = list(keyword)
corpus = []
for word in keyword:
    #Download voice actor articles
    try:
        with request.urlopen(link + parser.quote_plus(word)) as response:
            #response is in html format
            html = response.read().decode('utf-8')
            soup = BeautifulSoup(html, "lxml")
            # <p>Get tag
            p_tags = soup.find_all('p')
            for p in p_tags:
                corpus.append(p.text.strip())
    except urllib.error.HTTPError as err:
        #For voice actors who are not listed on Wikipedia, an error will occur, so exception handling is added.
        if err.code == 404:
            continue
        else:
            raise

with codecs.open("pwiki.txt", "w", "utf-8") as f:
    f.write("\n".join(corpus))

Actually divide the text file and save the learning model as word2vec.

Now, let's divide the actually acquired text data. Since there are many names of animations this time, it is supported so that proper nouns are also properly divided (for example, high school fleet is treated as proper nouns and organizations, so if you normally parseToNode * Will be). Also, as shown in the example below, it seems that ** for some reason, if you divide "Pretty Cure" with mecab + neologd, it will be converted to English **, so also respond to this with the conditional branch of the if statement so that it responds to English. (Since there were so many voice actors who posted the appearance in Precure as a status on their profile, it could not be ignored as feature amount information, so I made it so that the word-separation is properly reflected).

node = title.parse('Precure')

node.split(",")[6]
#Output result: 'PulCheR'

Below is the actual source code. First, preprocess the text data.

fi = codecs.open('pwiki.txt')
result = []
fo = open('try.csv', 'w')

lines = fi.readlines()
for line in lines:
    line = re.sub('[\n\r]',"",line)
    line = re.sub('[ ]'," ",line) #Some nouns cannot be extracted as compound nouns unless full-width spaces are converted to half-width.
    line = re.sub('(Year|Month|Day)',"",line)
    line = re.sub('[0-9_]',"",line)
    line = re.sub('[#]',"",line)
    line = re.sub('[!]',"",line)
    line = re.sub('[*]',"",line)
    fo.write(line + '\n')


fi.close()
fo.close()

Then, actually divide the preprocessed text data and learn with word2vec. This time I learned with the default parameters. The explanation of word2vec will be omitted here, so please refer to the following URL for the mechanism etc.

Word2Vec: The amazing power of the word vector that the inventor is surprised at

fi = open('try.csv', 'r')
fo = open('res.csv', 'w')

#line = fi.readline()
lines = fi.readlines()
result=[]
mecab.parse("")


for line in lines:
    node = mecab.parseToNode(line)
    node_org = title.parse(line)
    
    while node:
        hinshi = node.feature.split(",")[0]
        if hinshi == 'adjective' or hinshi == 'noun' or hinshi == 'adverb' or (len(node.feature.split(",")[6])>1):
            fo.write(node.feature.split(",")[6] + ' ')
        if node_org != 'EOS\n' and node_org.split(",")[1] == 'Proper noun':
            fo.write(node_org.split(",")[6] + ' ')
        if node_org != 'EOS\n' and node_org.split(",")[1] == 'Proper noun' and node_org.split(",")[6].isalpha()==True:
            fo.write(node_org.split(",")[7] + ' ')
        node = node.next

fi.close()
fo.close()

print('Wakati phase completed!')

sentences = word2vec.LineSentence('res.csv')
model = word2vec.Word2Vec(sentences,
                          sg=1,
                          size=200,
                          min_count=5,
                          window=5,
                          hs=1,
                          iter=100,
                          negative=0)


#Save with pickle
import pickle
with open('mecab_word2vec_seiyu.dump', mode='wb') as f:
    pickle.dump(model, f)

Try out the word2vec model results

Let's play with word analysis using the model actually learned with word2vec. As a test, I will try to put out even a voice actor who seems not to sell idols.

ret = model.wv.most_similar(negative=['Idol'],topn=1000) 

for item in ret:
    if len(item[0])>2 and (item[0] in list(df.Name)):
        print(item[0],item[1])

Output result

Minako Kotobuki 0.13453319668769836
Shouta Aoi 0.1175239235162735
Aki Toyosaki 0.1002458706498146
Tomoka Tamura 0.08929911255836487
Akane Yamaguchi 0.05830669403076172
Ayaka Mori 0.056574173271656036
Saki Fujita 0.05241834372282028
Yui Kano 0.051871318370103836
Saori Hayami 0.04932212829589844
Mika Kikuchi 0.04044754058122635
Anne Suzuki 0.034879475831985474
Ryota Osaka 0.029612917453050613
Yuka Iguchi 0.02767171896994114
Aoi Yuki 0.02525651454925537
Chieko Higuchi 0.022603293880820274

Next, I will take out a voice actor who seems to have a strong idol color.

ret = model.wv.most_similar(positive=['Idol'],topn=300) 

for item in ret:
    if len(item[0])>2 and (item[0] in list(df.Name)):
        print(item[0],item[1])

Output result

Machico 0.1847614347934723
Kaori Fukuhara 0.1714700609445572
Sachika Misawa 0.1615515947341919
Mai Nakahara 0.15694507956504822
Yui Ogura 0.1562490165233612
Shoko Nakagawa 0.1536223590373993
Nao Toyama 0.15278896689414978
Yui Sakakibara 0.14662891626358032
Ai Shimizu 0.14592087268829346
Kaori Ishihara 0.14554426074028015

Next, let's extract the voice actors who are likely to have a connection with the "award".

ret = model.wv.most_similar(positive=['Award'],topn=500) 

for item in ret:
    if len(item[0])>2 and (item[0] in list(df.Name)):
        print(item[0],item[1])

Output result


Ibuki Kido 0.19377963244915009
Kaori Fukuhara 0.16889861226081848
Minami Tsuda 0.16868139803409576
Maaya Uchida 0.1677364706993103
Kaori Nazuka 0.1669023633003235
Ai Kayano 0.16403883695602417
Maaya Sakamoto 0.16285887360572815
Yui Makino 0.14633819460868835
Erii Yamazaki 0.1392231583595276
Eri Kitamura 0.13390754163265228
Kana Asumi 0.13131362199783325
Arisa Noto 0.13084736466407776
Ayaka Ohashi 0.1297467052936554
Ryota Osaka 0.12972146272659302

Ah, I wonder ... I feel like it's really like that (** By the way, even if I say "marriage", it came out lol **).

Summary

This time, I got the information of the voice actor's Wikipedia, saved the learning model with word2vec using the obtained text information, and tried to actually move the model. Since there is a limit to the text information of the profile on Wikipedia because there is no detailed information, it is necessary to verify a large amount of data and modeling in order to actually analyze it in earnest. However, please forgive me because I am posting here only for material analysis. ** Actually, the amount of text data is small, and the result will change slightly depending on the parameters of word2vec, so it may not always be the result of the blog **. The general result itself doesn't change that dramatically, but ... sweat

Next, we will visualize it as a network using the actually learned model, and actually cluster the network to categorize similar voice actors.

Recommended Posts

Voice actor network analysis (using word2vec and networkx) (1/2)
Voice actor network analysis (using word2vec and networkx) (2/2)
Network Analysis with NetworkX --- Community Detection Volume
Global sensitivity analysis using sensitivity analysis libraries salib and oacis
Author estimation using neural network and Doc2Vec (Aozora Bunko)