[PYTHON] Voice actor network analysis (using word2vec and networkx) (2/2)

Rough flow of analysis introduced this time

・ Build and visualize a female voice actor network ・ Cluster and categorize similar voice actors using network information ・ Visualize categorized results with Cytoscape

Hi, this is bamboo-nova. At Qiita, we mainly throw in material analysis.

If you are interested in my serious story or analysis, please refer to here (evacuation site)

Bamboo shoot blog

Also, I put the source code that summarizes the series of steps so far on Github, so please refer to it if you like!

bamboo-nova/seiyu_network

Last time, as a preliminary step to perform voice actor network analysis, we scraped the voice actor's name and gender list and created a learning model with word2vec from the voice actor's Wikipedia text information on the list.

Visualize and cluster similar voice actors by network analysis (1/2)

This time, we will create a network of voice actors using the model that we actually learned, and further categorize the voice actors from there. For network visualization and clustering, this article analyzes the relationship only for female voice actors in their 20s, but it is also possible to analyze all voice actors in their teens and 30s, so if you are interested, try it!

Also, the learning model itself was created based on the profile information of female voice actors in their teens and 30s, but at first I would like to visualize and categorize the network of female voice actors in their 20s.

Build and visualize female voice actor network

First of all, load the required modules and load the model learned in word2vec in the previous article.

#Load the required module
import numpy as np
import pandas as pd
import pickle
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

#Read the learning model created last time using word2vec
with open('mecab_word2vec_seiyu.dump', mode='rb') as f:
     model = pickle.load(f)

#Devise a way to display the network
plt.style.use('seaborn')
#Measures against garbled characters that make Japanese tofu
plt.rcParams['font.family'] = 'IPAexGothic'

Regarding the garbled characters written in the above part, the countermeasures are listed on the following site, so please refer to the URL below and take measures against garbled characters (By the way, I answered) Mojibake).

Please tell me how to avoid graphviz tofu.

Next, load seiyu20.csv obtained by scraping in the previous article and extract the list of names of female voice actors in their 20s.

df = pd.read_csv('seiyu20.csv')
name = df[df.Gender=='Female']
white_list = name.Name

Then, the word vector of word2vec corresponding to each voice actor's name is extracted, and the correlation coefficient of the word vector is calculated for all the combinations of voice actors.

output = []
label = []
for name in white_list:
    try:
        vector = model.wv[name]
        output.append(vector)
        label.append(name)
    except:
        continue
res = np.corrcoef(output)

This completes the correlation matrix between each voice actor. Now, let's finally visualize the voice actor network.

#Prepare edge list to pass to networkx
edge_lists = []
df = pd.DataFrame(res)

edge_name = label
df.index = df.columns = edge_name


#Triangular matrix on the upper right of the correlation coefficient DF and 0.05 Mask the following data
tmp_df = df.mask(np.triu(np.ones(df.shape)).astype(bool) | (df < 0.05))
#Generate edge list
edge_lists = tmp_df.stack().reset_index().apply(tuple, axis=1).values

G = nx.Graph()
G.add_weighted_edges_from(edge_lists)




#Preparing for drawing
plt.figure(figsize=(8,8))  #Set according to the drawing target

pos = nx.circular_layout(G)

line_width = [d['weight']*10 for u,v,d in G.edges(data=True)]
nx.draw_networkx(G, pos=pos, font_size=10, node_color='gray', width=line_width, font_family='IPAexGothic')
plt.savefig('netres.png')

Output result:

スクリーンショット 2020-02-08 19.32.32.png

The edges are visualized so that the edges are thicker for similar voice actors (voice actors with a higher correlation coefficient), but if you look closely, there are some places where the thick parts are clear, so there seems to be some relationship, so random It doesn't look like it!

Correlation coefficient threshold, such as tmp_df = df.mask (np.triu (np.ones (df.shape)). Astype (bool) | (df <0.05)) in the middle of the above source code I am filtering with. This was adjusted based on the clustering factor of the network.

The clustering coefficient (coefficient between 0 and 1) is the relational expression of (the number of links between a node and adjacent nodes) / v (v-1) / 2 for each node. It will be the average of them on all nodes. The higher the clustering coefficient, the higher the network density **. As for the clustering coefficient, it is said that the clustering coefficient confirmed in the real world network is about 0.1 to 0.7 according to the explanation of Wikipedia below (Complex network. / wiki /% E8% A4% 87% E9% 9B% 91% E3% 83% 8D% E3% 83% 83% E3% 83% 88% E3% 83% AF% E3% 83% BC% E3% 82% AF) ). However, this time the number of nodes in the network is not so large and the amount of original text data is small, so I think it will work better if you adjust it to about 0.4 to 0.7. When I actually perform network analysis, I often tune and output thresholds based on the clustering coefficient.

This time, the following clustering coefficients are used for adjustment.

print(nx.average_clustering(G))
0.610103228782055

I was able to visualize it as a network, and I will try it with PageRank analysis. Since I am applying the text data trained with word2vec, I think that what I am doing is closer to the LexRank algorithm than the PageRank analysis. As an interpretation, I think that ** the female voice actor who summarizes the visualized network (which is indispensable for the composition) has a higher value **.

The mechanism of LexRank is described in detail at the following URL, so please refer to it if you like. Python: Summarize Japanese articles with LexRank

pr = nx.pagerank(G)
pos1 = nx.spring_layout(G)


#Visualization
plt.figure(figsize=(30, 30))
nx.draw_networkx_edges(G, pos1)
nx.draw_networkx_nodes(G, pos=pos1, node_color=list(pr.values()), cmap=plt.cm.Reds, font_family='IPAexGothic', node_size=[100000*v for v in pr.values()])
nx.draw_networkx_labels(G,pos1,font_size=20, font_family='IPAexGothic')

plt.axis('off')
plt.show()

#Let's display the top 10 voice actors with the highest PageRank values.
score_sorted = sorted(pr.items(), key=lambda x:-x[1])
print(score_sorted[0:20])

Output result:

Top 20 voice actors with high PageRank values
[('Ayaka Ohashi', 0.04555164528689211), 
('Erii Yamazaki', 0.039995796742290854), 
('Kido Ibuki', 0.03944625539215552), 
('Nao Toyama', 0.03748530983574666),
('Yui Ogura', 0.036855264907774486), 
('Shiina Natsukawa', 0.03677562557911518), 
('Asakura Momo', 0.034883881713288184), 
('Sachika Misawa', 0.03461691295075832),
('Azusa Tadokoro', 0.03458330057908542), 
('Maaya Uchida', 0.034217274567621005), 
('Moe Toyota', 0.033018028594089754), 
('Minako Kotobuki', 0.03235768418454441),
('Kaori Ishihara', 0.03065891275211937), 
('Sora Amamiya', 0.030072491383579297), 
('Sumire Uesaka', 0.02920698033366847), 
('Machico', 0.028396698868023), 
('Aimi', 0.02759156041213482), 
('Inori Minase', 0.027163302263303015), 
('Miku Ito', 0.026500204152951186), 
('Haruka Yamazaki', 0.02511585403440591)]

Try clustering and categorizing similar voice actors using network information

Finally, we will categorize voice actors. This time, we will use a clustering technique called ** Louvain **. Louvain is a modularity-based clustering technique for defining and optimizing a value that represents the "degree of network coupling." Actually, it is a method for finding local optimization, and it is realized by calculating Expression: Percentage of Edges in Cluster-Ratio of Edges Between Clusters.

Techniques such as network indicators and clustering are roughly described in the explanation of Github in Matsuo Laboratory of the University of Tokyo, so please refer to this if you like.

[Network Analysis](https://github.com/matsuolab/Tutorial/wiki/%E3%83%8D%E3%83%83%E3%83%88%E3%83%AF%E3%83%BC% E3% 82% AF% E5% 88% 86% E6% 9E% 90)

The analysis of Louvain can be easily done by installing the community with pip, so use this module to perform network clustering with Louvain. Also, I would like to save the clustered results in the form of .graphml and visualize them neatly with cytoscape.

First, perform clustering in Leuvain.

import community
partition = community.best_partition(G)
size = float(len(set(partition.values())))
pos = nx.spring_layout(G)
count = 0.
for com in set(partition.values()):
    count += 1.
    list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
    nx.draw_networkx_nodes(G, pos, list_nodes, node_size=20, node_color = str(count/size))

partition = community.best_partition(G)
labels = dict([(i, str(i)) for i in range(nx.number_of_nodes(G))])
nx.set_node_attributes(G, labels,'label')
nx.set_node_attributes(G, partition, 'community')
#nx.write_gml(G, "community.gml")Since the characters are garbled, save it in the following format.
nx.write_graphml(G, "community.graphml", encoding='utf-8')

Before displaying it in cytoscape, let's take a look at the voice actors for each cluster. First, the first cluster is ...

#First cluster
for k,v in partition.items():
    if int(v)==0:
        print(k)
#Output result:
Erii Yamazaki
Kido Ibuki
Ayaka Ohashi
Azusa Tadokoro
Machico
M ・ A ・ O

It was confirmed that HoriPro voice actor + α was brilliantly solidified! https://moca-news.net/article/20140226/2014022614590a_/01/

Next, let's look at the second cluster.

#Second cluster
for k,v in partition.items():
    if int(v)==1:
        print(k)
#Output result:
Sumire Morohoshi
Shiina Natsukawa
Akina
Asakura Momo
Sora Amamiya
Aimi
Mikoi Sasaki
Momo Kuraguchi
Ayaka Mori
Tokui blue sky

I have the impression that this is mainly solidified by TrySail and Milky Holmes. Apparently, clustering is working! !!

Finally, I would like to make a beautiful visualization of the voice actor network categorized by Cytoscape.

Visualize categorized results with Cytoscape

First, install Cytoscape.

Cytoscape

Cytoscape is a network visualization tool, and I would like to see a tool made mainly for bioinformatics, but it can also be fully utilized for purposes other than bioinformatics. It's very convenient because it gives you great freedom.

After installation, actually open it and go to [File-> Import-> Network-> File] on the toolbar to open the network file you saved earlier.

スクリーンショット 2020-02-08 20.53.06.png

In the above state, it cannot be said that it is beautiful, so we will start by color-coding each cluster. There, click the Style tab on the left. Then open "Fill Color" there. Then, Column and Mapping Type will be displayed. Select "community" for Column and "Disctete Mapping" for Mapping Type.

スクリーンショット 2020-02-08 20.54.09.png

Then, the following screens will appear, so specify the colors for each.

スクリーンショット 2020-02-08 20.56.22.png

Once you have specified it, let's roughly change the layout. You can choose from a variety of layouts from Layout-> y Files Layouts on the toolbar and other options there, so choose a layout that suits you best and then drag it with your mouse to move and adjust.

By the way, the result of my voice actor network is like this (I corrected it because there was a request that it was hard to see before).

fourpath.graphml.png

As a result, female voice actors in their twenties are roughly divided into five clusters when categorized based on Wikipedia's profile information. ・ Cluster centered on HoriPro voice actors (Ayaka Ohashi, Azusa Tadokoro, etc.) ・ Cluster centered on TrySail and Milky Holmes (Sora Amamiya, Aimi, etc.) ・ A cluster consisting of King Record trio and people related to them (Inori Minase, Sumire Uesaka, etc.) ・ A cluster made up of talented people (Saori Hayami, Nao Toyama, etc.) ・ A cluster that seems to have a strong idol color? (Aoi Yuki, Miku Ito, etc.)

I think that it was roughly divided in the form of.

Summary

This time, based on the profile information on Wikipedia of each voice actor, we visualized the relationship between female voice actors in their 20s by network analysis **. In addition, ** clustering was performed to categorize female voice actors in their 20s, and the results were visualized with Cytoscape **. It was my first time to actually bring data as a hobby and analyze it, so I was worried, but I'm glad I got a result like that!

Qiita plans to introduce this kind of material analysis when we have time, so if you have any requests, please comment!

Also, ** It will be an encouragement for the future, so if you find it interesting, please like it lol **

Other reference materials

Clustering with python-louvain Visualize Twitter followers with NetworkX, Cytoscape