[PYTHON] I tried to visualize the common condition of VTuber channel viewers

Samari

--We have created a network that connects VTuber channels. --The weight of the edge of the network is how much the viewer of the channel overlaps with another channel. --The offices considered are Nijisanji, hololive, 774 .inc, upd8, Nori Pro, and individuals. (Personal selection is 100% of my hobbies and tastes ...) ――This time, I just visualized it. I haven't analyzed it. There are still many things I want to do, so I will do it when the power pro calms down. ――July 12th is the 3D unveiling of hololive Kakumaki Watame-san. Let's see.

Of the edges that connect the distributors, the network that displays only the edges with the top 10% weight is as follows, and viewers are often seen among distributors belonging to the same office such as Nijisanji and hololive. I found that I was wearing it. graph_author_union_(90, 0).png

Motivated

Currently, there are many offices in the VTuber industry such as Nijisanji, hololive, 774 inc., Nori Pro, upd8, etc., and each office has many distributors. Of course, there are also individual distributors who do not belong to the office. The distributor posts videos for about an hour at a pace of about once every day to a few days. If you add up the video time for each office, the total time of videos posted per day can easily exceed 24 hours. Therefore, it is practically impossible to watch all the videos of multiple offices. The following is my registered channel column one day, but it's hard to see everything. .. .. ある日の登録チャンネル欄.png Perhaps many people are in the same situation. Therefore, each person chooses a video and a channel according to their hobbies and tastes. At this time, I thought it would be fun to visualize what channels are easy to watch at the same time.

Method

Many VTubers broadcast live, and you can post comments during the broadcast. For example, if you play the video (https://www.youtube.com/watch?v=Ypc_xKz--fY) of hololive's Luna Himemori (https://www.youtube.com/channel/UCa9Y57gfeY0Zro_noHRVrnw), You can see the following comment section at the time of live broadcasting aside. YouTube画面.png This comment log contains comment date, viewer name, spacha information, and more. This time, we will use viewer name information to evaluate the relationships between channels. Let $ U_i $ be the set of viewers who commented on a channel $ i $, and define how much the viewers suffered between channels $ w_ {ij} $, which is the common ratio of channels $ i $ and $ j . .. $ w_{ij} := \frac{|U_i \cap U_j|}{|U_i \cup U_j|} $$

This is used as an edge weight when visualizing as a network.

Calculation and visualization of interchannel similarity

The data acquisition period is from January 1, 2020 to June 30, 2020. In addition, we have not validated whether the comments of all the videos with comments were obtained correctly. .. ..

Get comment data

There is a data acquisition method in the following article, so I will use it almost as it is.

As an example, I save each video in the following format.

AuthorName BaseDate ChannelId Timestamp VideoId VideoLength
Moss Max 2020-05-07 UC--A2dwZW7-M2kID0N6_lfA 2020-05-07 19:53:23 -Alnw7B1GBo 2953
Chocolate cornet 2020-05-07 UC--A2dwZW7-M2kID0N6_lfA 2020-05-07 19:54:58 -Alnw7B1GBo 2953
Black dog 2020-05-07 UC--A2dwZW7-M2kID0N6_lfA 2020-05-07 19:55:08 -Alnw7B1GBo 2953
Oguna 2020-05-07 UC--A2dwZW7-M2kID0N6_lfA 2020-05-07 19:55:56 -Alnw7B1GBo 2953
High tension friday 2020-05-07 UC--A2dwZW7-M2kID0N6_lfA 2020-05-07 19:56:05 -Alnw7B1GBo 2953

Only VideoLength is texto. .. .. I won't use it this time. .. ..

Creating a dataset

First, create a list of viewers who commented on the data period on each channel. This can be done by merging all the comment lists obtained above. The code below tries to count the number of comments, but this is not really relevant to this work for the convenience of another work.

df = pd.concat([pd.read_pickle(path) for path in comment_paths])
counts = df.groupby(['AuthorName', 'ChannelId', 'VideoId', 'BaseDate']).size().to_frame('Count').reset_index()

The format is as follows.

AuthorName ChannelId VideoId BaseDate Count
chro nicle UCwrjITPwG4q71HzihV2C7Nw H7wgvBbxo1U 2020-06-30T00:00:00 1
Fazias UChAnqc_AY5_I3Px5dig3X1Q Q7DS6uaInMA 2020-06-30T00:00:00 26
Dream eating UCuvk5PilcvDECU7dDZhQiEw 6uiQOEDmD6U 2020-06-30T00:00:00 91
Fatin Thifal UCOmjciHZ8Au3iKMElKXCF_g ZrFJpafDKVw 2020-06-30T00:00:00 3
Snail state of the futon UC6oDys1BGgBsIC3WhG1BovQ QHTLzahEiX4 2020-06-30T00:00:00 1

Adjacency matrix

As mentioned earlier, this time we will calculate the viewer's coverage between channels. This can be easily obtained by creating a user list for each channel and performing a set operation.

def corr_by_author_set_union(counts, channels):
    corr = pd.DataFrame().assign(Channel=channels).set_index('Channel')
    tmp = counts.loc[:, ['ChannelId', 'AuthorName']].drop_duplicates()
    channelId_to_set = {ch: set(tmp[tmp.ChannelId == ch].AuthorName) for ch in channels}
    for  ch1 in channels: 
        corr[ch1] = [(len(channelId_to_set[ch1] & channelId_to_set[ch2]) / \
                    len(channelId_to_set[ch1] | channelId_to_set[ch2])) for ch2 in channels]
    return corr

Graph depiction

Now, let's draw the graph. The code is almost the same as the following site. --I tried to visualize the national surname network at https://datumstudio.jp/blog/networkx

def create_graph(df, threshold=0.5, is_directed=True):
    assert set(df.index) == set(df.columns)

    #Create a graph
    if is_directed:
        graph = nx.DiGraph()
    else:
        graph = nx.Graph()

    #Add node
    for col in df.columns:
        if not graph.has_node(col):
            graph.add_node(col)

    #Add edge
    for a, b in itertools.combinations(df.columns, 2):
        if a == b or graph.has_edge(a, b):
            continue
        val = df.loc[a, b]
        if abs(val) < threshold:
            continue
        graph.add_edge(a, b, weight=val)

    return graph

def draw_char_graph(G, fname, edge_cmap=plt.cm.Greys, figsize=(16, 8)):
    plt.figure(figsize=figsize)
    weights = [G[u][v]['weight'] for u, v, in G.edges()]
    pos = nx.spring_layout(G, k=16)

    nodes = pos.keys()
    colors = list(set([channel_to_color[n] for n in nodes]))
    color_to_id = {colors[i]: i for i in range(len(colors))}
    angs = np.linspace(0, 2*np.pi, 1+len(colors))
    repos = []
    rad = 3.5
    for ea in angs:
        repos.append(np.array([rad*np.cos(ea), rad*np.sin(ea)]))
    for ea in pos.keys():
        posx = 0
        posx = color_to_id[channel_to_color[ea]]
        pos[ea] += repos[posx]

    nx.draw(G,
            pos, 
            node_color=[channel_to_color[n] for n in G.nodes()],
            edge_cmap=edge_cmap,
            edge_vmin=-3e4,
            width=weights,
            with_labels=True,
            font_family='Yu Gothic',
            font_size=8,
            font_color='green')
    plt.savefig(fname, dpi=128)
    plt.show()

Create and draw a graph using these.

Whole network

The line thickness corresponds to a high percentage of viewers in common. .. ..

union_corr = corr_by_author_set_union(channels)
#It is difficult to understand if it is ChannelId, so rewrite it to ChannnelName
union_corr = rename_ChannelId_to_ChannelName(union_corr)
graph = create_graph(union_corr, threshold=0, is_directed=False)
draw_char_graph(graph, 'fig/graph_author_union.png', figsize=(16, 16))

――Overall, the direction of Nijisanji is facing, and the percentage of people who are looking at Nijisanji and other offices at the same time is high. --Mr. Shigure Ui and Mr. Tamaki Inuyama have a strong edge not only in the direction of Nijisanji but also in the direction of hololive.

graph_author_union.png

Whole network (only the top 10% of edge weights)

Since the number of displays in the previous graph is too large, consider reducing the number of edges. 10% is a sense. Since only the top 10% is plotted, if a line is drawn here, it can be interpreted that the viewer's coverage between the channels is very high. .. ..

# (Edge th, betweenness_centrality)
pairs = [(90, 0)]
df = union_corr.copy()
for pair in pairs:
    th = np.percentile(df.fillna(0).values.ravel(), pair[0])
    print(pair, th)
    graph = create_graph(df, threshold=th, is_directed=False)
    draw_char_graph(graph , 'fig/graph_author_union_{}.png'.format(pair), figsize=(16, 16))

――High common rate of viewers in the same office --Most of Tamaki Inuyama's connections are to hololive, which has a stronger connection to hololive than Nijisanji. --Same as Shigure Ui

graph_author_union_(90, 0).png

Office network

Here, only the top 10% of the edges are plotted.   As a personal impression, if the connection with the office is weak, the following can be considered.

――The entire office is connected, and the audience is weakly covered. --The connection with the outside of the office is strong, and when the inside of the office is visualized, the connection appears weak.

Nijisanji network

――I don't understand because the lines overlap too much. graph_author_union_btw_Nijisanji Japan_and_Nijisanji Japan.png

Lower and upper 3 channels of weights connected to each node

--The bottom 3 channels of the weight average of the edges to which the top 3 are connected --The bottom three are the top three channels

index Mean kind
Azuchi peach 0.02581 Nijisanji Japan
♥ ️♠️ Alice Mononobe ♦ ️♣️ 0.03546 Nijisanji Japan
Gilzaren III Season 2 0.04463 Nijisanji Japan
Akina Saegusa/ Saegusa Akina 0.13969 Nijisanji Japan
Amamiya Kokoro/Kokoro Amamiya [Nijisanji affiliation] 0.14043 Nijisanji Japan
Gweru male girl/Gwelu Os Gar [Nijisanji] 0.14336 Nijisanji Japan

Network in hololive

――As personally felt ――It is conspicuous that a triangle is formed by the cover of the viewer layer. .. .. I feel --Noefure

graph_author_union_btw_Hololive Japan_and_Hololive Japan.png

index Mean kind
Mel Channel Night sky Mel channel 0.1314 Hololive Japan
SoraCh.Tokino Sora Channel 0.1653 Hololive Japan
Nakiri Ayame Ch.Hyakuki Ayame 0.1859 Hololive Japan
Kanata Ch.Amane Kanata 0.2664 Hololive Japan
Watame Ch.For square winding 0.2684 Hololive Japan
Shion Ch.Shisaki Zion 0.2699 Hololive Japan

Network within holostars

――As personally felt --Kaoru Tsukishita is good

graph_author_union_btw_Holostars_and_Holostars.png

index Mean kind
Izuru Ch.Player Izuru 0.1544 Holostars
Kira Ch.Mirror Kira 0.1688 Holostars
Rikka ch.Ritsumei 0.1748 Holostars
astel ch.Astel 0.2173 Holostars
Shien Ch.Kageyama Cien 0.2178 Holostars
Temma Ch.Nobuo Kishi 0.2222 Holostars

774 Network in .inc

-Is the audience divided by Sugariri, Honeystrap, and AniMare?

graph_author_union_btw_774 inc._and_774 inc..png

index Mean kind
Patra Channel /Suo Patra [Honeystrap] 0.1307 774 inc.
Haneru Channel /Haneru Inaba [AniMare] 0.1335 774 inc.
CAMOMI Camomi Channel [Kamomi Camomi] 0.1369 774 inc.
Izumi Channel /Izumi Yuzuhara [AniMare] 0.1931 774 inc.
Anna Channel /Anna Torajo [Sugariri] 0.1949 774 inc.
Rene Channel /Ryugasaki Rin [Sugariri] 0.2055 774 inc.

Network in upd8

――The line is thin and there is not much coverage of the viewer group --The line between Babiniku uncle is thick

graph_author_union_btw_upd8_and_upd8.png

index Mean kind
Engine Kazumi 0.03281 upd8
Yuuki Channel [Fucking sex education] 0.03323 upd8
Cheri High Homecoming Department 0.03345 upd8
Nora Cat Channel 0.04661 upd8
Tomari Mari channel /Tomari Mari Channel 0.04728 upd8
Tuna channel 0.04752 upd8

Nori Pro Network

――Since the line disappears, draw the top 25% of the line only here --Mr. Yuzuru Himesaki and Mr. Takuma Kumagai haven't posted any videos yet.

graph_author_union_btw_Noripuro_and_Noripuro.png

index Mean kind
Norio Tsukudani [Tamaki Inuyama] 0.2353 Noripuro
Aimiya Milk Milk Enomiya 0.2453 Noripuro
Shirayuki Mishiro 0.2591 Noripuro
Norio Tsukudani [Tamaki Inuyama] 0.2353 Noripuro
Aimiya Milk Milk Enomiya 0.2453 Noripuro
Shirayuki Mishiro 0.2591 Noripuro

Personal network

――I noticed after plotting, but Yui Yui and Shia Minase belong to the office. It is also obvious that the viewers overlap

graph_author_union_btw_Other VTubers_and_Other VTubers.png

index Mean kind
Kobana 0.08867 Other VTubers
Kazenomiya Festival/ Matsuri Channel 0.09249 Other VTubers
Heavenly Hiyo 0.09265 Other VTubers
Makio [Individual] 0.10765 Other VTubers
Sia Minase [Sia Channel] 0.11933 Other VTubers
Musubime Yui 〖YouTube〗 0.12053 Other VTubers

Network with another office

――If you do all the combinations, there will be a lot of images, so only between Nijisanji and hololive. ――Is it the influence of the Ozora family that Subaru Ozora and Keisuke Maimoto are in the top of the weight of the connected edge?

graph_author_union_btw_Hololive Japan_and_Nijisanji Japan.png

Among the hololive channels, the bottom 3 of the average weight of the edges connected to Nijisanji

index Mean kind
Mel Channel Night sky Mel channel 0.02613 Hololive Japan
SoraCh.Tokino Sora Channel 0.03727 Hololive Japan
Towa Ch.Everlasting Towa 0.03840 Hololive Japan
Kanata Ch.Amane Kanata 0.05254 Hololive Japan
Marine Ch.Treasure bell marine 0.05539 Hololive Japan
Subaru Ch.Ozora Subaru 0.05971 Hololive Japan

Of the Nijisanji channels, the bottom 3 of the average weight of the edges connected to hololive

index Mean kind
Azuchi peach 0.003585 Nijisanji Japan
Harusaki Air 0.009090 Nijisanji Japan
Gilzaren III Season 2 0.009123 Nijisanji Japan
[3rd grade 0 group] Mirei Gunmichi's classroom 0.085043 Nijisanji Japan
Keisuke Maimoto 0.087824 Nijisanji Japan
Lulu Suzuhara [Nijisanji affiliation] 0.096265 Nijisanji Japan

Impressions

――If you improve the collaboration, the viewers will be overwhelmed, that's right. ――It seems interesting to do core extraction and cluster analysis.

Recommended Posts

I tried to visualize the common condition of VTuber channel viewers
[Python] I tried to visualize the follow relationship of Twitter
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to summarize the basic form of GPLVM
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to display the infection condition of coronavirus on the heat map of seaborn
I tried to summarize the string operations of Python
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to move the ball
I tried to estimate the interval.
I tried to visualize the power consumption of my house with Nature Remo E lite
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Linux] I tried to summarize the command of resource confirmation system
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
I tried to build the SD boot image of LicheePi Nano
I tried to visualize the Beverage Preference Dataset by tensor decomposition.
I tried to visualize Boeing of violin performance by pose estimation
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python
I tried the asynchronous server of Django 3.0
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
[Python] I tried to visualize the prize money of "ONE PIECE" over 100 million characters with matplotlib.
I read the implementation of golang channel
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
I tried to get the batting results of Hachinai using image processing
I tried transcribing the news of the example business integration to Amazon Transcribe
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried how to improve the accuracy of my own Neural Network
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to automatically extract the movements of PES players with software
I tried to summarize the logical way of thinking about object orientation.
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to extract and illustrate the stage of the story using COTOHA
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to streamline the standard role of new employees with Python
I tried to get the RSS of the top song of the iTunes store automatically
I tried to get the movie information of TMDb API with Python
I tried to visualize all decision trees of random forest with SVG
I tried to display the altitude value of DTM in a graph
Using COTOHA, I tried to follow the emotional course of Run, Melos!