Introduction

The link structure of the web is a large-scale network that you can easily play. Repeatedly retrieving the linked URL from HTML with ʻurllibandBeautifulSoup` to create an adjacency matrix for the web page. It may take more than 12 hours to spare, so please try it with your heart. Source code etc. are available on Author GitHub. Analysis using NetworkX is described in Network analysis is a link structure on the web (2).

Program overview

Specify the start page to start following the link
Specify the number of times to follow the link (how many times you can go from the start page at the shortest)
Follow the link the specified number of times
Get all links limited to the URLs obtained the specified number of times
Create an adjacency matrix

Preparation

`import.py`


from urllib.request import urlopen
from bs4 import BeautifulSoup

import networkx as nx

from tqdm import tqdm_notebook as tqdm
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 500

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import re

urllib.request A library that retrieves data on a website (There is no good reference site ...) BeautifulSoup Module that parses html file from tag information Reference: Qiita: Beautiful Soup in 10 minutes networkx Network analysis module. Will be explained in the next article. tqdm Bring up a progress bar with a for statement. When using jupyter notebook, note that tqdm_notebook is imported. Reference: Progress bar on Jupyter Notebook pd.options.display.max_colwidth = 500 In pandas, increase the maximum width of each column. Don't omit very long URLs.

`url_prepare.py`


start_url = "https://zozo.jp/"
# the page to begin with

explore_num = 2
# how many times do you explore new links

url_list = [start_url]
# list of the URL of all the pages. The components will be added.
link_list=[]
# list of lists [out_node, in_node]. The components will be added.

#prepare a file name to save figures and csv files
fname = re.split('[/.]', start_url)
if fname[2]=="www":
    fname = fname[3]
else:
    fname = fname[2]

start_url Specify the page to start following the link.

explore_num Specify the number of times to follow the link (how many times you can go from the start page at the shortest).

url_list An array that stores all the URLs of the websites you visit. Corresponds to the index of the later adjacency matrix.

link_list An array that stores all the URL pairs of links. An array of arrays whose elements are [URL where the link goes out, URL where the link comes in]. Corresponds to each element of the later adjacency matrix.

fname File name for saving the seaborn graph and pandas table data later.

Link structure analysis function

The following is the function that actually follows the link. link_explore is a function that searches all links. Takes an array of URLs to search as an argument. link_cruise is a function that searches for links to only given sites. Takes an adjacency matrix as an argument.

`link_explore.py`


def link_explore(link_list, url_list, now_url_list):
    # link_list: list of the URL of all the pages
    # url_list: list of lists [out_node, in_node]
    # next_url_list: list of the URL to explore in this function
    print(f"starting explorting {len(now_url_list)} pages")
    next_url_list=[]
    
    for url in now_url_list:
        
        try:
            with urlopen(url, timeout=10) as res:
                html = res.read().decode('utf-8', 'ignore')
                soup = BeautifulSoup(html, "html.parser")

        except:
            print("x", end="")
            continue
            #print(f"\n{url}")
            
        else:
            for a in soup.find_all("a"):
                link = a.get("href")

                if link!=None and len(link)>0:
                    if link[0]=="/":
                        link = url+link[1:]

                    if link[0:4] == "http":
                        if link[-1]=="/":
                            next_url_list.append(link)
                            link_list.append([url,link])
                            
            print("o", end="")
        
    next_url_list = list(set(next_url_list))
        
    url_list += next_url_list
    url_list = list(set(url_list))
        
    return link_list, url_list, next_url_list

`link_cruise.py`


def link_cruise(adj, url_list, now_url_list):
    # adj: adjacency matrix
    # next_url_list: list of the URL to explore in this function
    #print(f"starting cruising {len(now_url_list)} pages")
    next_url_list=[]
    
    for url in tqdm(now_url_list):
        
        try:
            with urlopen(url, timeout=10) as res:
                html = res.read().decode('utf-8', 'ignore')
                soup = BeautifulSoup(html, "html.parser")
                
        except:
            continue
            
        else:
            for a in soup.find_all("a"):
                link = a.get("href")

                if link!=None and len(link)>0:
                    if link[0]=="/":
                        link = url+link[1:]

                    if link[0:4] == "http":
                        if link[-1]=="/":
                            if link in url_list:
                                if adj[url_list.index(url),url_list.index(link)] == 0:
                                    next_url_list.append(link)
                                    adj[url_list.index(url),url_list.index(link)] = 1
            #print("o", end="")
        
    next_url_list = list(set(next_url_list))
        
    #print("")
    return adj, next_url_list

Run

Follow the link the number of times given by ʻexplore_num. Display ʻo if the linked HTML decode is successful, and display x if it fails.

`explore_exe.py`


next_url_list = url_list
for i in range(explore_num):
    print(f"\nNo.{i+1} starting")
    link_list, url_list, next_url_list = link_explore(link_list, url_list, next_url_list)
    print(f"\nNo.{i+1} completed\n")

↓ It looks like this ↓. スクリーンショット 2019-11-09 16.00.31.png

Create an adjacency matrix.

`make_adj.py`


adj = np.zeros((len(url_list),len(url_list)))

for link in tqdm(link_list):
    try:
        adj[url_list.index(link[0]),url_list.index(link[1])] = 1
    except:
        pass

Search after ʻexplore_num` is limited to pages that have already been visited. Repeat the search until all pages are visited.

`cruise_exe.py`


while (len(next_url_list)>0):
    adj, next_url_list = link_cruise(adj, url_list, next_url_list)

Complete!

Sequel

-Network analysis is a link structure on the web ②

[PYTHON] Network analysis is a web link structure ①

Introduction

Program overview

Preparation

import.py

url_prepare.py

Link structure analysis function

link_explore.py

link_cruise.py

Run

explore_exe.py

make_adj.py

cruise_exe.py