[PYTHON] Network analysis is a web link structure ②

Introduction

The link structure of the web is a large-scale network that you can easily play. Network analysis is a link structure on the web ① We will actually analyze the network acquired. This may take more than 3 hours to spare, so please try it with your heart. Source code etc. are available on Author GitHub.

Program overview

  1. Create a directed graph with networkx from an adjacency matrix
  2. Play with Page rank, centrality, and strong connection component decomposition
  3. Fun! !! !!

NetworkX Network analysis module. --Multiple information can be given to the node edge of the graph. --You can easily perform network analysis such as central / strongly connected component decomposition. --Beautiful visualization.

Reference site

~~ I couldn't afford to put together the recipes ~~, so I will put together the websites that I referred to.

NetworkX Official Documentation Qiita: [Python] Summary of basic usage of NetworkX 2.0 PyQ: Graph Theory and NetworkX [DATUM STUDIO: I tried to visualize the national surname network with networkx](https://datumstudio.jp/blog/networkx%E3%81%A7%E5%85%A8%E5%9B%BD%E5%90% 8D% E5% AD% 97% E3% 83% 8D% E3% 83% 83% E3% 83% 88% E3% 83% AF% E3% 83% BC% E3% 82% AF% E3% 82% 92% E5% 8F% AF% E8% A6% 96% E5% 8C% 96% E3% 81% 97% E3% 81% A6% E3% 81% BF% E3% 81% 9F)

Preparation

import.py


from urllib.request import urlopen
from bs4 import BeautifulSoup

import networkx as nx

from tqdm import tqdm_notebook as tqdm
import numpy as np
from scipy import stats
import pandas as pd
pd.options.display.max_colwidth = 500

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import re

Network analysis is a link structure on the web ①.

Graph creation

make_graph.py


G = nx.DiGraph()

for i in tqdm(range(len(adj))):
    for j in range(len(adj)):
        if adj[i][j]==1:
            G.add_edge(i,j)

nx.DiGraph() Create a directed graph.

G.add_edge(i,j) Add an edge to the graph G. ~~ There is a smarter way to add it. Yup. ~~

Graph feature calculation

cal_centrality.py


degree = nx.degree_centrality(G)
print("degree completed")

closeness = nx.closeness_centrality(G)
print("closeness completed")

betweenness = nx.betweenness_centrality(G)
print("betweenness completed")

pagerank = nx.pagerank(G)
print("pagerank completed")

strong_connection = sorted(nx.strongly_connected_components(G), key=lambda x: len(x), reverse=True)
print("strongly connected components completed")

nx.degree_centrality(G) Degree centrality. The sum of the out-degree and in-degree of each node. nx.degree_centrality (G) returns the degree centrality value divided by $ n-1 . ( N $: number of vertices in the graph)

nx.closeness_centrality(G) Proximity centrality. The reciprocal of the average shortest path length from each node to all other nodes.

nx.betweenness_centrality(G) Mediation centrality. Percentage of the number of times the shortest path from all other nodes to reachable nodes passes through each node.

nx.strongly_connected_components(G) Strongly linked component decomposition. sorted(nx.strongly_connected_components(G), key=lambda x: len(x), reverse=True) By doing so, an array sorted in descending order of the number of vertices of the strongly connected component is returned.

Create DataFrame for pandas

make_df.py


df = pd.DataFrame({"ID": range(len(url_list)), "URL": url_list)

df["category"] = df["URL"].apply(lambda s: s[s.index(":")+3:s.index("/", 8)])

df.loc[df['URL'] ==start_url, 'category'] = "start_page"

df["degree_centrality"] = df["ID"].map(degree)
df["closeness_centrality"] = df["ID"].map(closeness)
df["betweenness_centrality"] = df["ID"].map(betweenness)
df["Pagerank"] = df["ID"].map(pagerank)

df = df.assign(Pagerank_rank=len(df)-stats.mstats.rankdata(df["Pagerank"])+1)
df = df.assign(degree_centrality_rank=len(df)-stats.mstats.rankdata(df["degree_centrality"])+1)
df = df.assign(closeness_centrality_rank=len(df)-stats.mstats.rankdata(df["closeness_centrality"])+1)
df = df.assign(betweenness_centrality_rank=len(df)-stats.mstats.rankdata(df["betweenness_centrality"])+1)

df.to_csv(f"./{fname}.csv")

df["category"] Classify each page of the network by top page. [Network analysis is a link structure on the web ①](https://qiita.com/takahiro_hasegawa/items/2d2a979ead0522c112a2#%E3%83%AA%E3%83%B3%E3%82%AF%E6%A7%8B % E9% 80% A0% E8% A7% A3% E6% 9E% 90% E3% 81% AE% E9% 96% A2% E6% 95% B0), and in ʻurl_list, enter the URL that starts with http. Stored. Classify these URLs by top page and call them category`. Useful for visualization.

df.assign(Pagerank_rank=len(df)-stats.mstats.rankdata(df["###"])+1) Add rankings of various centralities as new columns. stats.mstats.rankdata takes the rank of the same value as their average value. stats.mstats.rankdata ranks from 0 in ascending order. Take the reverse order of them and add the order from the 1st place to the DataFrame.

Graph visualization function

draw_graph.py


pos = nx.spring_layout(G)

nx.spring_layout(G) Calculate the position to plot the graph nicely. If you do not define pos outside the function, a graph with a different shape will be drawn each time.

draw_graph.py


def draw_char_graph(G, pos, title, node_type):
    plt.figure(figsize=(15, 15))

    nx.draw_networkx_edges(G,
                           pos=pos,
                           edge_color="gray",
                           edge_cmap=plt.cm.Greys,
                           edge_vmin=-3e4,
                           width=0.3,
                           alpha=0.2,
                           arrows=False)
    
    if node_type=="centrality":
        node1=nx.draw_networkx_nodes(G,
                                     pos=pos,
                                     node_color="blue",
                                     alpha=0.8,
                                     node_size=[ d["closeness"]*300 for (n,d) in G.nodes(data=True)])

        node2=nx.draw_networkx_nodes(G,
                                     pos=pos,
                                     node_color="green",
                                     alpha=0.8,
                                     node_size=[ d["degree"]*2000 for (n,d) in G.nodes(data=True)])

        node3=nx.draw_networkx_nodes(G,
                                     pos=pos,
                                     node_color="yellow",
                                     alpha=0.8,
                                     node_size=[ d["betweenness"]*5000 for (n,d) in G.nodes(data=True)])

        node4=nx.draw_networkx_nodes(G,
                                     pos=pos,
                                     node_color="red",
                                     alpha=0.8,
                                     node_size=[ d["pagerank"]*10000 for (n,d) in G.nodes(data=True)])
        
        plt.legend([node1, node2, node3,node4], ["closeness", "degree","betweenness","pagerank"],markerscale=1,fontsize=18)
        plt.title(f"centrality: {start_url}\n {nx.number_of_nodes(G)} nodes,{nx.number_of_edges(G)} edges",fontsize=25)
        
    elif node_type=="simple":
        nx.draw_networkx_nodes(G,
                               pos=pos,
                               node_color="blue",
                               node_size=5)
        plt.title(f"{start_url}\n {nx.number_of_nodes(G)} nodes,{nx.number_of_edges(G)} edges",fontsize=25)
        
    elif node_type=="strong_connection":
        nx.draw_networkx_nodes(G,
                               pos=pos,
                               node_color="black",
                               node_size=10)
        
        node1=nx.draw_networkx_nodes(G,
                                     pos=pos,
                                     node_color="blue",
                                     nodelist=strong_connection[0],
                                     node_size=30)
        
        node2=nx.draw_networkx_nodes(G,
                                     pos=pos,
                                     node_color="yellow",
                                     nodelist=strong_connection[1],
                                     node_size=30)
        
        node3=nx.draw_networkx_nodes(G,
                                     pos=pos,
                                     node_color="red",
                                     nodelist=strong_connection[2],
                                     node_size=30)
        
        plt.title(f"strongly connected nodes: {title}\n {nx.number_of_nodes(G)} nodes,{nx.number_of_edges(G)} edges",fontsize=25)
        plt.legend([node1, node2, node3], [f"elements: {len(strong_connection[0])} ({round(len(strong_connection[0])/nx.number_of_nodes(G)*100,2)}%)",
                                           f"elements: {len(strong_connection[1])} ({round(len(strong_connection[1])/nx.number_of_nodes(G)*100,2)}%)",
                                           f"elements: {len(strong_connection[2])} ({round(len(strong_connection[2])/nx.number_of_nodes(G)*100,2)}%)"],markerscale=1,fontsize=18)

    plt.axis('off')
    plt.savefig(f"{title}_graph_{node_type}", dpi=300)
    plt.show()

draw_char_graph(G, pos, title, node_type) A function that draws a graph. title The title of the graph to draw and the file name to save node_type

node_type Graph drawn
"simple" Normal graph
"centrality" Graph with each centrality painted separately
"strong_connection" Top 3 vertices of strongly connected components

Visualization

Scatter plot

visualize.py


sns.pairplot(df.drop("ID", axis=1), hue='category')
plt.savefig(f'{fname}_pairplot.png')

zozo_pairplot.png Everyone loves pair plot.

network

visualize.py


draw_char_graph(G, pos, fname, node_type="simple")

zozo_graph_simple.png

visualize.py


draw_char_graph(G, pos, fname, node_type="centrality")

zozo_graph_centrality.png

visualize.py


draw_char_graph(G, pos, fname, node_type="strong_connection")

zozo_graph_strong_connection.png I'm not sure because it's too large. If it is a small network, it will be drawn neatly.

Centrality ranking

visualize.py


df_important = df[(df["Pagerank_rank"]<=10) | (df["degree_centrality_rank"]<=10) | (df["closeness_centrality_rank"]<=10) | (df["betweenness_centrality_rank"]<=10)]
df_important = df_important[["URL", "Pagerank_rank", "degree_centrality_rank", "closeness_centrality_rank", "betweenness_centrality_rank"]]

df_important.to_csv(f"./{fname}_important.csv")
## Pagerank Order centrality Proximity centrality Mediation centrality URL
2167 1.0 3.0 1.0 1.0 https://zozo.jp/
7043 2.0 1.0 2.0 2.0 https://zozo.jp/brand/
6492 3.0 2.0 3.0 3.0 https://zozo.jp/shop/
2612 4.0 4.0 4.0 4.0 https://zozo.jp/category/
4618 5.0 39.5 14.5 16.5 https://zozo.jp/category/jacket-outerwear/no-collar-jacket/
801 8.0 39.5 14.5 16.5 https://zozo.jp/category/jacket-outerwear/mods-coat/
2803 8.0 39.5 14.5 16.5 https://zozo.jp/category/jacket-outerwear/jacket/
19707 8.0 39.5 14.5 16.5 https://zozo.jp/category/jacket-outerwear/ma1/
20142 8.0 39.5 14.5 16.5 https://zozo.jp/category/jacket-outerwear/pea-coat/
27308 8.0 39.5 14.5 16.5 https://zozo.jp/category/jacket-outerwear/riders-jacket/
15378 198.0 6.0 27029.0 5.0 https://wear.jp/
3828 313.0 7.0 315.0 52.0 https://zozo.jp/fashionnews/
23684 315.0 8.0 317.0 319.0 https://zozo.jp/coordinate/
19611 316.0 9.0 318.0 320.0 https://zozo.jp/welcome/
21862 330.0 5.0 27030.0 6.0 https://corp.zozo.com/ir/
11315 334.0 10.0 27033.0 315.0 https://corp.zozo.com/about/profile/

It's fun! !! !!

Recommended Posts

Network analysis is a web link structure ①
Network analysis is a web link structure ②
What is a Convolutional Neural Network?
What is a distribution?
What is a terminal?
What is a hacker?
What is a pointer?
What is the difference between a symbolic link and a hard link?
Create a web application that recognizes numbers with a neural network
Analysis by Bayesian inference (1) ... Which is better, A or B?