[PYTHON] Evaluation of cluster coefficient of VTuber channel

Samari

--A brief summary of cluster coefficients, which is one of the network analysis methods --I tried to apply the cluster coefficient to the network where the edge was drawn by the common ratio of the viewers created last time. --Result --Calculate the cluster coefficient of Nijisanji and hololive by Zhang's method --The top 5 channels with cluster coefficients in each office are as follows.

name c_coeff kind
Ibrahim [Nijisanji] 0.4308 Zhang
Himawari Honma- Himawari Honma - 0.4316 Zhang
Kanae Channel 0.4317 Zhang
Ars Almar-ars almal-[Nijisanji] 0.4338 Zhang
Kuzuha Channel 0.4424 Zhang
name c_coeff kind
Fubuki Ch. Shirakami Fubuki 0.6696 Zhang
Aqua Ch.Minato Aqua 0.6708 Zhang
Coco Ch.Kiryu Coco 0.6719 Zhang
Pekora Ch.Usada Pekora 0.6748 Zhang
Korone Ch.Inugami Korone 0.6758 Zhang

Cluster coefficient

What is the cluster coefficient?

How cluster-centric are the nodes in the network in the network? Is one of the quantitative expressions of. As you can see from the definition, if it is a cluster, there is a desire that the surrounding area be a creek. .. .. The definition of the cluster coefficient differs depending on whether the weight of the edge of the graph is taken into consideration or not. In addition, all the definitions introduced below are definitions in undirected graphs.

For undirected graphs without weights

Definition

The cluster factor $ C_i $ for node $ i $ is defined below. However, if the number of adjacent nodes is 1 or less, it is set to 0.

\displaystyle{
\begin{aligned}
C_i &:= \frac{\sum_{j, k \in \Pi(i), j\neq k} a_{ij}a_{jk}a_{ki}}{k_i (k_i - 1)}\\
k_i &:= \sum_{j\in \Pi(i)} a_{ij}\\
a_{ij} &:Adjacency matrix\ A\of\ (i, j)\component\\
\Pi(i) &:node\ i\A set of nodes adjacent to
\end{aligned}
}

Since we are dealing with an undirected graph without weights, the components of the adjacency matrix have only $ 1 $ or $ 0 $ as values.

As for feelings, the more cluster-centric --The self and the two nodes connected to it are more closely connected --In many cases, you and the two nodes connected to you become a 3-creek (in other words).

Concrete example

Abbreviation

It will come out soon if you go around

For weighted undirected graphs

There seem to be various definitions of cluster coefficients in the case of weighted undirected graphs. Here are some of them. A weighted adjacency matrix is represented by $ W , and an adjacency matrix with only connection information ( 1 $ or $ 0 $) is represented by $ A $. Also, let $ s_i: = \ sum_ {j \ in \ Pi (i)} w_ {ij} $.

Note that if the weight matrix component is 0 or 1 (that is, an unweighted adjacency matrix), they match the cluster coefficients with no weights (see the reference link because the calculation is crappy).

reference

Definition

Zhang

\displaystyle{
\begin{aligned}
C_{i}^Z := \frac{\sum_{j, k \in \Pi(i), j\neq k} w_{ij}w_{jk}w_{ki}}{\left((\sum_{j\in\Pi(i)} w_{ij})^2 - \sum_{j\in\Pi(i)} w_{ij}^2\right) \max(w_{jk})}
\end{aligned}
}

--A simple extension of the cluster coefficient, where the numerator simply replaces $ a $ in the cluster coefficient with $ w $ and has a similar format if the denominator is also rearranged. -Useful when you want to consider all the weights of $ ij, jk, ki $ --The weight matrix $ w $ is normalized by $ \ max (w_ {jk}) $

Lopez-Fernandez

\displaystyle{
\begin{aligned}
C_{i}^L := \frac{\sum_{j, k \in \Pi(i), j\neq k} w_{jk}}{k_i (k_i-1)}
\end{aligned}
}

--Useful when you want to focus only on the weight between adjacent nodes $ j, k $ -I don't care about the weights of $ ij and ik $ --The weight matrix is not standardized

Onnela

\displaystyle{
\begin{aligned}
C_{i}^O := \frac{\sum_{j, k \in \Pi(i), j\neq k} (w_{ij}w_{jk}w_{ki})^{1/3}}{k_i (k_i-1) \max(w_{jk})}
\end{aligned}
}

-Useful when you want to consider all the weights of $ ij, jk, ki $ ――Because it is 1/3 powered, the effect of individual weights is weaker than that of Zhang. --The weight matrix is normalized by $ \ max (w_ {jk}) $ --This is implemented by the clustering method of NetworkX, one of the python libraries.

Barrat

\displaystyle{
\begin{aligned}
C_{i}^B := \frac{\sum_{j, k \in \Pi(i), j\neq k} (w_{ij} + w_{ki})a_{jk}}{2s_i (k_i-1)}
\end{aligned}
}

--The weights are not products, but sums. -Does not consider weights between $ jk $ --Lopez-The opposite of Fernandez. Useful when you want to consider the connection strength of $ ij, ik $ --The weight is standardized by $ \ max (s_i) $

Serrano

\displaystyle{
\begin{aligned}
C_{i}^S &:= \frac{\sum_{j, k \in \Pi(i), j\neq k} w_{ij}a_{jk}w_{ki}}{s_i^2 (1-Y_i)}\\
Y_i &:= \sum_{j\in\Pi(i)} \left(\frac{w_{ij}}{s_i}\right)^2
\end{aligned}
}

-Does not consider weights between $ jk $ --Barrat and Nori are the same --The weight is standardized by $ \ max (s_i) $

Application to VTuber network

As a motivation, I would like to find out whether the core distributor can be found with these indicators, and whether these indicators work in the network defined last time.

Last defined network

The edges between each channel are weighted by a common percentage of the set of commented viewers.

\displaystyle{
\begin{aligned}
w_{ij} &:= \frac{|U_i \cap U_j|}{U_i \cup U_j|}\\
U_i &:A set of users who commented on channel i
\end{aligned}
}

Network features

--Almost fully connected (users almost never suffer) ――The strength of the bond is basically strong inside the same office and weak outside the office. --However, except for those who have a strong relationship with another office, such as Tamaki Inuyama and Shigure Ui. ――In a large office, the fluctuation of weight is small in the same office

Data used, conditions, etc. --Data: Comments obtained from YouTube archive broadcast --Period: 2020/1/1 ~ 2020/6/30

The whole network looks like this (I added a few channels from the last time). The line thickness corresponds to the high percentage of viewers in common.

graph_author_union.png

Calculation of cluster coefficient

Here, the calculation is performed for the network of Nijisanji only and the network of hololive only. The reason for not calculating the cluster coefficient that mixes multiple offices is that the number of nodes in the cluster is imbalanced. For nodes in a large cluster, a calculation result with a large cluster coefficient can be obtained. I don't know (currently I) how to calculate the appropriate cluster coefficient for a network between clusters with an imbalanced number of nodes, so I will narrow down the calculation once.

For each network, calculate the cluster coefficient and list the top 5, bottom 5, and percentiles. Also, the code used will be posted at the end of the article.

Adaptation to Nijisanji

The graph of Nijisanji looks like this. The number of nodes and the number of edges are both large, and the graph is not clear. .. .. Nijisanji Japan.png

Zhang

name c_coeff kind
Gilzaren III Season 2 0.3171 Zhang
Azuchi peach 0.3174 Zhang
Rine- Rine Yaguruma - 0.3246 Zhang
Naruse Naru 0.3265 Zhang
Amemori Saya 0.3284 Zhang
name c_coeff kind
Ibrahim [Nijisanji] 0.4308 Zhang
Himawari Honma- Himawari Honma - 0.4316 Zhang
Kanae Channel 0.4317 Zhang
Ars Almar-ars almal-[Nijisanji] 0.4338 Zhang
Kuzuha Channel 0.4424 Zhang
count mean std min 10% 50% 90% max
98 0.3907 0.02992 0.3171 0.3458 0.3945 0.4229 0.4424

Lopez-Fernandez

name c_coeff kind
Gweru male girl/Gwelu Os Gar [Nijisanji] 0.1184 Lopez_Fernandez
Rion Takamiya 0.1184 Lopez_Fernandez
Watch at night/yorumi rena [Nijisanji affiliation] 0.1185 Lopez_Fernandez
Debidebi Debi 0.1185 Lopez_Fernandez
Akina Saegusa/ Saegusa Akina 0.1186 Lopez_Fernandez
name c_coeff kind
Amemori Saya 0.1207 Lopez_Fernandez
Rine- Rine Yaguruma - 0.1208 Lopez_Fernandez
Naruse Naru 0.1208 Lopez_Fernandez
Azuchi peach 0.1209 Lopez_Fernandez
Gilzaren III Season 2 0.1211 Lopez_Fernandez
count mean std min 10% 50% 90% max
98 0.1193 0.000656 0.1184 0.1186 0.1191 0.1203 0.1211

Onnela

name c_coeff kind
Gilzaren III Season 2 0.1460 Onnela
Azuchi peach 0.1708 Onnela
Naruse Naru 0.1752 Onnela
Rine- Rine Yaguruma - 0.1798 Onnela
Amemori Saya 0.1865 Onnela
name c_coeff kind
Ryushen channel 0.3967 Onnela
Debidebi Debi 0.3989 Onnela
Watch at night/yorumi rena [Nijisanji affiliation] 0.3998 Onnela
Rion Takamiya 0.4060 Onnela
Gweru male girl/Gwelu Os Gar [Nijisanji] 0.4093 Onnela
count mean std min 10% 50% 90% max
98 0.3295 0.06136 0.1460 0.2361 0.3478 0.3917 0.4093

Barrat

name c_coeff kind
Kou Uzuki 1.0000 Barrat
Mahiro Yukishiro/Yukishiro Mahiro [Nijisanji affiliation] 1.0000 Barrat
Haruka Onomachi ♨ Onomachi Haruka Nijisanji 1.0000 Barrat
Rine- Rine Yaguruma - 1.0000 Barrat
Fren E. Lustario 1.0000 Barrat
name c_coeff kind
Naruse Naru 1.000 Barrat
Ellie Conifer/Eli Conifer [Nijisanji] 1.000 Barrat
Rion Takamiya 1.000 Barrat
Watch at night/yorumi rena [Nijisanji affiliation] 1.000 Barrat
Aiba Uiha 〖Aiba Uiha〗 Nijisanji affiliation 1.000 Barrat
count mean std min 10% 50% 90% max
98 1.0000 0.000000 1.0000 1.0000 1.0000 1.000 1.000

Serrano

name c_coeff kind
Hanasaki Morinaka 1.0000 Serrano
Aiba Uiha 〖Aiba Uiha〗 Nijisanji affiliation 1.0000 Serrano
Amamiya Kokoro/Kokoro Amamiya [Nijisanji affiliation] 1.0000 Serrano
Haru Kaida/Kaida Haru [Nijisanji] 1.0000 Serrano
Gilzaren III Season 2 1.0000 Serrano
name c_coeff kind
Kou Uzuki 1.000 Serrano
Yoko Akabane 1.000 Serrano
Keisuke Maimoto 1.000 Serrano
Quarter moon Fujishiro/Genzuki Tojiro [Nijisanji] 1.000 Serrano
Kana Sukoya [Nijisanji] Kana Sukoya 1.000 Serrano
count mean std min 10% 50% 90% max
98 1.0000 0.000000 1.0000 1.0000 1.000 1.000 1.000

Adaptation to hololive

The hololive network is below. It is easy to see because it has fewer nodes than Nijisanji. .. .. The cluster coefficients of Barrat and Serrano are omitted here because all cluster coefficients are 1 for the same reason. The reason will be described later.

Hololive Japan.png

Zhang

name c_coeff kind
Mel Channel Night sky Mel channel 0.6304 Zhang
hololive hololive- VTuber Group 0.6325 Zhang
SoraCh.Tokino Sora Channel 0.6331 Zhang
Akiroze Ch. Vtuber/Hololive affiliation 0.6366 Zhang
Choco Ch.Choco Heitsuki 0.6373 Zhang
name c_coeff kind
Fubuki Ch. Shirakami Fubuki 0.6696 Zhang
Aqua Ch.Minato Aqua 0.6708 Zhang
Coco Ch.Kiryu Coco 0.6719 Zhang
Pekora Ch.Usada Pekora 0.6748 Zhang
Korone Ch.Inugami Korone 0.6758 Zhang
count mean std min 10% 50% 90% max
28 0.6521 0.01336 0.6304 0.6355 0.6516 0.6711 0.6758

Lopez-Fernandez

name c_coeff kind
Shion Ch.Shisaki Zion 0.2024 Lopez_Fernandez
Watame Ch.For square winding 0.2025 Lopez_Fernandez
Kanata Ch.Amane Kanata 0.2025 Lopez_Fernandez
Mio Channel Ogami Mio 0.2028 Lopez_Fernandez
Flare Ch.Shiranui flare 0.2031 Lopez_Fernandez
name c_coeff kind
Choco Ch.Choco Heitsuki 0.2069 Lopez_Fernandez
hololive hololive- VTuber Group 0.2071 Lopez_Fernandez
Nakiri Ayame Ch.Hyakuki Ayame 0.2092 Lopez_Fernandez
SoraCh.Tokino Sora Channel 0.2108 Lopez_Fernandez
Mel Channel Night sky Mel channel 0.2133 Lopez_Fernandez
count mean std min 10% 50% 90% max
28 0.2052 0.002504 0.2024 0.2027 0.2049 0.2077 0.2133

Onnela

name c_coeff kind
Mel Channel Night sky Mel channel 0.3797 Onnela
SoraCh.Tokino Sora Channel 0.4607 Onnela
Nakiri Ayame Ch.Hyakuki Ayame 0.5097 Onnela
hololive hololive- VTuber Group 0.5653 Onnela
Choco Ch.Choco Heitsuki 0.5694 Onnela
name c_coeff kind
Flare Ch.Shiranui flare 0.6669 Onnela
Mio Channel Ogami Mio 0.6743 Onnela
Kanata Ch.Amane Kanata 0.6800 Onnela
Watame Ch.For square winding 0.6804 Onnela
Shion Ch.Shisaki Zion 0.6832 Onnela
count mean std min 10% 50% 90% max
28 0.6097 0.06734 0.3797 0.5486 0.6185 0.6760 0.6832

Impressions

--Lower (higher in Lopez-Fernandez) broadcasts less frequently, has not been broadcast for a long time, and has many channels with few live broadcasts in the first place When the broadcast period is free, the viewer suffers less. obvious.

――The upper (lower) differs greatly depending on the method For Lopez_Fernandez, it depends only on the weight $ w_ {jk} $ of the nodes $ j, k $ adjacent to the node $ i $. Therefore, in the case of a fully connected network, the one with the smaller weight of $ w_ {ij}, w_ {ik} $ comes to the top of the cluster coefficient. Looking at hololive, for example, Mr. Yozora Mel was ranked high because he had a long pause, and the official channel was ranked high because the number of live broadcasts was small in the first place. Is it considered that the other high-ranking channels (Choco-san, Tokino Sora-san, etc.) have weak cluster formation with other nodes?

--Lopez_Fernandez and Onnela have similar top (and bottom) and bottom (and top) The main difference between Zhang and Onnela is whether the third-order term of the weight is 1/3 power. When multiplied by 1/3, the fluctuation of weight becomes smaller. Calculations show that the first-order contributions of fluctuations are the same, but the second-order contributions are kept small (in Zhang ratio) by the contributions from the 1 / 3rd power.

--Barrat and Serrano cluster coefficients are all 1 In the case of a fully connected network, all cluster coefficients are 1 by definition. obvious.

――What kind of cluster coefficient is good? Since this network is a weighted fully connected network, Lopez_Fernandez will calculate a large cluster coefficient even if a node is isolated (in the sense that the weight is light). So is Lopez_Fernandez suitable? There is a big difference between Zhang and Onnela with or without 1/3 power. Since the weight is leveled by the presence of 1/3, it is only necessary to select how sensitive the fluctuation of the edge weight should be. This time, it is a fully connected network and the fluctuation of weight is small, so I want you to be sensitive to the small fluctuation. So Zhang's model looks good?

――Which is cluster-centric after all? With Zhang down, the top 5 channels with cluster coefficients in Nijisanji and hololive are: As for hololive, it's intuitive, but I'm still ignorant of Nijisanji, so I'm not sure if this result is intuitive. .. .. Please let me know. .. ..

name c_coeff kind
Ibrahim [Nijisanji] 0.4308 Zhang
Himawari Honma- Himawari Honma - 0.4316 Zhang
Kanae Channel 0.4317 Zhang
Ars Almar-ars almal-[Nijisanji] 0.4338 Zhang
Kuzuha Channel 0.4424 Zhang
name c_coeff kind
Fubuki Ch. Shirakami Fubuki 0.6696 Zhang
Aqua Ch.Minato Aqua 0.6708 Zhang
Coco Ch.Kiryu Coco 0.6719 Zhang
Pekora Ch.Usada Pekora 0.6748 Zhang
Korone Ch.Inugami Korone 0.6758 Zhang

(Fucking) code

like this

class clustering_coefficient:
    def __init__(self, df):
        self.df = df.copy()
        self.names = list(self.df.columns)
        for c in self.names: self.df.loc[c, c] = 0
        self.ki = {name: (self.df.loc[self.df.index[self.df.index != name], :][name] > 0).sum() for name in self.names}
        self.si = {name: self.df.loc[self.df.index[self.df.index != name], :][name].sum() for name in self.names}
        self.max_w = self.df.values.ravel().max()

    def Zhang(self):
        return {name: sum([(self.df.loc[n1, n2] * self.df.loc[n1, name] * self.df.loc[name, n2]) \
            for n1, n2 in itertools.permutations(self.names, 2)])\
            / sum([(self.df.loc[n1, name] * self.df.loc[name, n2]) \
            for n1, n2 in itertools.permutations(self.names, 2)]) / self.max_w\
            for name in self.names}

    def Lopez_Fernandez(self):
        return {name: sum([(self.df.loc[n1, n2] * (self.df.loc[n1, name] > 0) * (self.df.loc[name, n2] > 0)) \
            for n1, n2 in itertools.permutations(self.names, 2)]) / (self.ki[name]* (self.ki[name] - 1)) for name in self.names}

    def Onnela(self):
        return {name: sum([(self.df.loc[n1, n2] * self.df.loc[n1, name] * self.df.loc[name, n2]) ** (1/3.)\
                for n1, n2 in itertools.permutations(self.names, 2)]) \
                / (self.ki[name]* (self.ki[name] - 1) * self.max_w) for name in self.names}

    def Barrat(self):
        return {name: sum([(self.df.loc[n1, name] + self.df.loc[name, n2]) * (self.df.loc[name, n1] > 0) * \
                (self.df.loc[name, n2] > 0) * (self.df.loc[n1, n2] > 0)for n1, n2 in itertools.permutations(self.names, 2)]) \
         / ((self.ki[name] - 1)* self.si[name] * 2) for name in self.names}

    def Serrano(self):
        return {name: sum([((self.df.loc[n1, n2] > 0) * self.df.loc[n1, name] * self.df.loc[name, n2]) \
                for n1, n2 in itertools.permutations(self.names, 2)]) \
                / (self.si[name] ** 2 * (1 - ((self.df[name] / self.si[name]) ** 2).sum()))\
                for name in self.names}

Recommended Posts

Evaluation of cluster coefficient of VTuber channel
I tried to visualize the common condition of VTuber channel viewers