[PYTHON] I tried to verify the yin and yang classification of Hololive members by machine learning

Overview

I would like to show the final result first. The representative characters are the members selected during the training of the learning device, and the positive characters are surrounded by a red frame and the yin characters are surrounded by a blue frame (all other members are verified as independent data). What I actually did was "** I classified each Hololive member as a yin or yang character **". For details, please refer to the following chapters.

結果.png

The reason for starting the analysis

Are you a yin yang? Or is it positive?

When I was looking at the cutouts of Hololive on youtube at the time of GW, I found something that looked interesting. Hololive members yin and yang classification digest that starts suddenly at midnight

Click here for the original video) [[#Holo Midnight Girls'Association] ♡ GW's Dubbing Evening Drink Chat Girls' Association ♡ [Kiryu Coco / Sakuramiko / Amane Kanata / Heitsuki Choco]](https://www.youtube.com/watch?v= HytCW6Yi8IM)

9a82f8e7-s.jpg

It seems that it was verified at the Hololive Midnight Girls' Association, and the Yin and Yang characters of Hololive members were classified in the above form. At this time, I suddenly thought.

As a result, can it be reproduced well using machine learning?

So, I actually tried to verify the classification of Yin and Yang characters of Hololive members **.

Usage data

** The data used is the text data of the members' Twitter **. The text data of the tweet is acquired using Tweepy. Based on the classification results actually performed in the distribution, we selected the Yin character representative and the Yang character representative in the following form and collected tweet data. ** This time, we will proceed on the premise of the results shown in the Yin and Yang classification results in the delivery **.

-** Yin character representative ... Usada Pekora, Sakuramiko, Amane Kanata, Minato Akua, Inugami Korone ** -** Yang character representative ... Choco Heitsuki, Subaru Ozora, Tokino Sora, Fubuki Shirakami, AZKi, Mio Okami, Friend A **

We have collected 8000 tweet data for each member in the latest order (if it is less than that, we will get the maximum amount). Also, of the tweet data, data related to RT and reply is not acquired. So, ** only the tweets of the people themselves ** are acquired.

In addition, the directory structure is built and executed on Google Colab in the following form.

図1.png

Analytical method

Preprocessing

Are you really an idol or a woman ... ** The tweet data contained a lot of emoticons, emojis, and URLs ** (screamed when I saw the data). So, after eliminating it as much as possible in the pretreatment, I divide it with mecab (unfortunately, some sweat remains). Also, ** training data and test data were randomly divided at a ratio of 8: 2 **.

Specified analysis method

The analysis method was LSTM. It is implemented in PyTorch and supports parallel GPUs and batches.

The total number of vocabulary words divided by mecab was 17,462. If the text contains words that are not included in this vocabulary for the validation data, an error will occur.

Training results

As a result of training with 100 epochs, the prediction accuracy was 76% in the test data. I have the impression that the amount of data on Twitter is quite high.

The result of the loss per epoch is as follows.

スクリーンショット 2020-05-12 17.02.13.png

Validate with independent data

Are the other members yin characters? Yang character?

スクリーンショット 2020-05-04 21.33.16.png

Earlier, we clearly set representatives of positive and negative characters on the distribution and conducted training. Next, the tweet data of Hololive members other than the members set as representatives are classified as yin characters or yang characters as independent data. I verified what happens when I acquire 8000 tweet data of each member by the same method and analyze it.

This is the actual classification standard for yin and yang characters. ** A learning device that divides each member's tweet data into sentences for each line break and trains each sentence to make yin or yang characters Classify using **. Then, all the text data of each member is classified, and the following is calculated as an index for classifying how much the remark is a positive character or a negative character.

Classification index = (number of positive character remarks) / (number of positive character remarks + number of yin character remarks)

However, for sentences for which the classification result was not returned by the learner, that is, for words that were not learned during training, the result is not returned for any member (for the time being, for more than half of the sentence data). The classification result is returned)

Final result

As a result of verification with independent data, it is as follows.

結果.png

Below are the actual results such as evaluation indicators for each member. The output of the positive character or the negative character is judged by whether the index value is greater than or equal to 0.5 or less.

#------------------------------
#Example of output result
#hogehoge is ○
# (Classification index value) (The total number of sentences for which the classification result is returned by the learner) (Number of positive character remarks) (Total number of sentence data extracted for each line break)
#------------------------------

Tokino Sora is a positive character
0.7045305318450427 3046 2146 4870
----------------------------
Suisei Hoshimachi is a shadow
0.4129251700680272 2940 1214 4634
----------------------------
Yozora Mel is positive
0.5901213171577123 1154 681 1844
----------------------------
Shirakami Fubuki is a positive character
0.5638173302107728 1708 963 3570
----------------------------
Summer color festival is positive
0.5016304347826087 1840 923 2562
----------------------------
Himemori Luna is Yin and Yang
0.36826524570751923 1689 622 2306
----------------------------
Silver Noel is Yin and Yang
0.42934293429342935 3333 1431 4976
----------------------------
Akirose is a yin yang
0.470281124497992 2490 1171 4158
----------------------------
AZK is positive
0.862909090909091 2750 2373 2821
----------------------------
Shiranui flare is positive
0.5693251533742332 1630 928 2525
----------------------------
Roboco-san is a positive character
0.5026868588177821 2047 1029 3153
----------------------------
Nekomata Okayu is Yinka
0.41079199303742386 2298 944 3219
----------------------------
Kiryu Coco is a positive character
0.5164619164619164 2035 1051 2676
----------------------------
Towa is a shadow
0.41897720271102895 1623 680 2307
----------------------------
Akai is the rest
0.542777970211292 2887 1567 4144
----------------------------
Shisaki Zion is a yin yang
0.3823224468636599 3858 1475 4662
----------------------------
Hyakuki Ayame is a positive character
0.6027054108216433 1996 1203 2961
----------------------------
Treasure bell marine is a yin yang
0.40594059405940597 1515 615 2230
----------------------------
Uruha Rushia is a yin yang
0.4146341463414634 861 357 1421
----------------------------

This time, we classified Yin and Yang characters using the tweet data of Hololive members. I think about half of them have a similar shape to the result of the distribution, but 7 or 8 people have been classified as different characters. The possible causes and speculations are as follows. ・ ** Yin and Yang characters may be more influenced by collaboration rate and friendship ** (For example, 3rd generation students. I personally think Captain Marin is a Yang character) ・ ** The amount of tweets is still smaller for new members ** (In particular, the total amount of tweets for 4th gen members is only 2000-3000 including RT, so the amount of data has a greater effect than other members) ・ ** Although it is influenced by positive and negative characters, there is a high possibility that there are some other confounding variables ** (otaku elements, genres of games being distributed, activity time, etc.) ・ ** Yin character representative's yin character remark rate is high in the first place ...? ** (sweat with no comment)

In addition, as an item to be performed for future effect verification, ・ ** For example, introduce new labels such as "otaku / non-otaku" ** ・ ** Get listener tweet data from hashtags as well as Twitter, and incorporate listener chats on youtube as training data ** ・ ** I will try again after the number of tweets of the members has increased a little more ** ・ Perform performance evaluation properly ... (lazy) Etc. were considered.

Summary

This time, we verified the classification result based on the member's tweet data, referring to the member's yin and yang classification results that Hololive members actually went to during distribution.

In the previous material analysis, we visualized the network of voice actors, but we would like to continue to provide various material analysis, so thank you.

Click here for the previous material analysis: Voice actor network analysis (using word2vec and networkx) (1/2) Voice actor network analysis (using word2vec and networkx) (2/2)

Recommended Posts

I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to verify and analyze the acceleration of Python by Cython
I tried to predict the presence or absence of snow by machine learning.
[Machine learning] I tried to summarize the theory of Adaboost
I tried to verify the result of A / B test by chi-square test
I tried to predict the change in snowfall for 2 years by machine learning
I tried to process and transform the image and expand the data for machine learning
I tried to compress the image using machine learning
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
Try to evaluate the performance of machine learning / classification model
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
[Qiita API] [Statistics • Machine learning] I tried to summarize and analyze the articles posted so far.
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried moving the image to the specified folder by right-clicking and left-clicking
I tried to visualize the age group and rate distribution of Atcoder
I tried calling the prediction API of the machine learning model from WordPress
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to extract and illustrate the stage of the story using COTOHA
I tried to visualize the model with the low-code machine learning library "PyCaret"
[Linux] I tried to verify the secure confirmation method of FQDN (CentOS7)
I tried to classify Oba Hana and Emiri Otani by deep learning
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
Classification of guitar images by machine learning Part 1
I tried to touch the API of ebay
Classification of guitar images by machine learning Part 2
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
I tried to organize the evaluation indexes used in machine learning (regression model)
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried to pass the G test and E qualification by training from 50
I tried to classify Oba Hana and Emiri Otani by deep learning (Part 2)
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to summarize the basic form of GPLVM
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried the simplest method of multi-label document classification
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried "Lobe" which can easily train the machine learning model published by Microsoft.
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
Try to evaluate the performance of machine learning / regression model
I tried to find the average of the sequence with TensorFlow
I tried to implement anomaly detection by sparse structure learning
[Pokemon Sword Shield] I tried to visualize the judgment basis of deep learning using the three family classification as an example.
I tried to tabulate the number of deaths per capita of COVID-19 (new coronavirus) by country
I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)