Overview

I would like to show the final result first. The representative characters are the members selected during the training of the learning device, and the positive characters are surrounded by a red frame and the yin characters are surrounded by a blue frame (all other members are verified as independent data). What I actually did was "** I classified each Hololive member as a yin or yang character **". For details, please refer to the following chapters.

結果.png

The reason for starting the analysis

Are you a yin yang? Or is it positive?

When I was looking at the cutouts of Hololive on youtube at the time of GW, I found something that looked interesting. Hololive members yin and yang classification digest that starts suddenly at midnight

Click here for the original video) [[#Holo Midnight Girls'Association] ♡ GW's Dubbing Evening Drink Chat Girls' Association ♡ [Kiryu Coco / Sakuramiko / Amane Kanata / Heitsuki Choco]](https://www.youtube.com/watch?v= HytCW6Yi8IM)

It seems that it was verified at the Hololive Midnight Girls' Association, and the Yin and Yang characters of Hololive members were classified in the above form. At this time, I suddenly thought.

As a result, can it be reproduced well using machine learning?

So, I actually tried to verify the classification of Yin and Yang characters of Hololive members **.

Usage data

** The data used is the text data of the members' Twitter **. The text data of the tweet is acquired using Tweepy. Based on the classification results actually performed in the distribution, we selected the Yin character representative and the Yang character representative in the following form and collected tweet data. ** This time, we will proceed on the premise of the results shown in the Yin and Yang classification results in the delivery **.

-** Yin character representative ... Usada Pekora, Sakuramiko, Amane Kanata, Minato Akua, Inugami Korone ** -** Yang character representative ... Choco Heitsuki, Subaru Ozora, Tokino Sora, Fubuki Shirakami, AZKi, Mio Okami, Friend A **

We have collected 8000 tweet data for each member in the latest order (if it is less than that, we will get the maximum amount). Also, of the tweet data, data related to RT and reply is not acquired. So, ** only the tweets of the people themselves ** are acquired.

In addition, the directory structure is built and executed on Google Colab in the following form.

図1.png

Analytical method

Preprocessing

Are you really an idol or a woman ... ** The tweet data contained a lot of emoticons, emojis, and URLs ** (screamed when I saw the data). So, after eliminating it as much as possible in the pretreatment, I divide it with mecab (unfortunately, some sweat remains). Also, ** training data and test data were randomly divided at a ratio of 8: 2 **.

Specified analysis method

The analysis method was LSTM. It is implemented in PyTorch and supports parallel GPUs and batches.

The total number of vocabulary words divided by mecab was 17,462. If the text contains words that are not included in this vocabulary for the validation data, an error will occur.

Training results

As a result of training with 100 epochs, the prediction accuracy was 76% in the test data. I have the impression that the amount of data on Twitter is quite high.

The result of the loss per epoch is as follows.

スクリーンショット 2020-05-12 17.02.13.png

Validate with independent data

Are the other members yin characters? Yang character?

スクリーンショット 2020-05-04 21.33.16.png

Earlier, we clearly set representatives of positive and negative characters on the distribution and conducted training. Next, the tweet data of Hololive members other than the members set as representatives are classified as yin characters or yang characters as independent data. I verified what happens when I acquire 8000 tweet data of each member by the same method and analyze it.

This is the actual classification standard for yin and yang characters. ** A learning device that divides each member's tweet data into sentences for each line break and trains each sentence to make yin or yang characters Classify using **. Then, all the text data of each member is classified, and the following is calculated as an index for classifying how much the remark is a positive character or a negative character.

Classification index = (number of positive character remarks) / (number of positive character remarks + number of yin character remarks)

However, for sentences for which the classification result was not returned by the learner, that is, for words that were not learned during training, the result is not returned for any member (for the time being, for more than half of the sentence data). The classification result is returned)

Final result

As a result of verification with independent data, it is as follows.

結果.png

Below are the actual results such as evaluation indicators for each member. The output of the positive character or the negative character is judged by whether the index value is greater than or equal to 0.5 or less.

#------------------------------
#Example of output result
#hogehoge is ○
# (Classification index value) (The total number of sentences for which the classification result is returned by the learner) (Number of positive character remarks) (Total number of sentence data extracted for each line break)
#------------------------------

Tokino Sora is a positive character
0.7045305318450427 3046 2146 4870
----------------------------
Suisei Hoshimachi is a shadow
0.4129251700680272 2940 1214 4634
----------------------------
Yozora Mel is positive
0.5901213171577123 1154 681 1844
----------------------------
Shirakami Fubuki is a positive character
0.5638173302107728 1708 963 3570
----------------------------
Summer color festival is positive
0.5016304347826087 1840 923 2562
----------------------------
Himemori Luna is Yin and Yang
0.36826524570751923 1689 622 2306
----------------------------
Silver Noel is Yin and Yang
0.42934293429342935 3333 1431 4976
----------------------------
Akirose is a yin yang
0.470281124497992 2490 1171 4158
----------------------------
AZK is positive
0.862909090909091 2750 2373 2821
----------------------------
Shiranui flare is positive
0.5693251533742332 1630 928 2525
----------------------------
Roboco-san is a positive character
0.5026868588177821 2047 1029 3153
----------------------------
Nekomata Okayu is Yinka
0.41079199303742386 2298 944 3219
----------------------------
Kiryu Coco is a positive character
0.5164619164619164 2035 1051 2676
----------------------------
Towa is a shadow
0.41897720271102895 1623 680 2307
----------------------------
Akai is the rest
0.542777970211292 2887 1567 4144
----------------------------
Shisaki Zion is a yin yang
0.3823224468636599 3858 1475 4662
----------------------------
Hyakuki Ayame is a positive character
0.6027054108216433 1996 1203 2961
----------------------------
Treasure bell marine is a yin yang
0.40594059405940597 1515 615 2230
----------------------------
Uruha Rushia is a yin yang
0.4146341463414634 861 357 1421
----------------------------

This time, we classified Yin and Yang characters using the tweet data of Hololive members. I think about half of them have a similar shape to the result of the distribution, but 7 or 8 people have been classified as different characters. The possible causes and speculations are as follows. ・ ** Yin and Yang characters may be more influenced by collaboration rate and friendship ** (For example, 3rd generation students. I personally think Captain Marin is a Yang character) ・ ** The amount of tweets is still smaller for new members ** (In particular, the total amount of tweets for 4th gen members is only 2000-3000 including RT, so the amount of data has a greater effect than other members) ・ ** Although it is influenced by positive and negative characters, there is a high possibility that there are some other confounding variables ** (otaku elements, genres of games being distributed, activity time, etc.) ・ ** Yin character representative's yin character remark rate is high in the first place ...? ** (sweat with no comment)

In addition, as an item to be performed for future effect verification, ・ ** For example, introduce new labels such as "otaku / non-otaku" ** ・ ** Get listener tweet data from hashtags as well as Twitter, and incorporate listener chats on youtube as training data ** ・ ** I will try again after the number of tweets of the members has increased a little more ** ・ Perform performance evaluation properly ... (lazy) Etc. were considered.

Summary

This time, we verified the classification result based on the member's tweet data, referring to the member's yin and yang classification results that Hololive members actually went to during distribution.

In the previous material analysis, we visualized the network of voice actors, but we would like to continue to provide various material analysis, so thank you.

Click here for the previous material analysis: Voice actor network analysis (using word2vec and networkx) (1/2) Voice actor network analysis (using word2vec and networkx) (2/2)

[PYTHON] I tried to verify the yin and yang classification of Hololive members by machine learning