The link structure of the web is a large-scale network that you can easily play.
Repeatedly retrieving the linked URL from HTML with ʻurlliband
BeautifulSoup` to create an adjacency matrix for the web page.
It may take more than 12 hours to spare, so please try it with your heart.
Source code etc. are available on Author GitHub.
Analysis using NetworkX is described in Network analysis is a link structure on the web (2).
import.py
from urllib.request import urlopen
from bs4 import BeautifulSoup
import networkx as nx
from tqdm import tqdm_notebook as tqdm
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 500
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import re
urllib.request
A library that retrieves data on a website
(There is no good reference site ...)
BeautifulSoup
Module that parses html file from tag information
Reference: Qiita: Beautiful Soup in 10 minutes
networkx
Network analysis module.
Will be explained in the next article.
tqdm
Bring up a progress bar with a for
statement.
When using jupyter notebook, note that tqdm_notebook
is imported.
Reference: Progress bar on Jupyter Notebook
pd.options.display.max_colwidth = 500
In pandas
, increase the maximum width of each column.
Don't omit very long URLs.
url_prepare.py
start_url = "https://zozo.jp/"
# the page to begin with
explore_num = 2
# how many times do you explore new links
url_list = [start_url]
# list of the URL of all the pages. The components will be added.
link_list=[]
# list of lists [out_node, in_node]. The components will be added.
#prepare a file name to save figures and csv files
fname = re.split('[/.]', start_url)
if fname[2]=="www":
fname = fname[3]
else:
fname = fname[2]
start_url
Specify the page to start following the link.
explore_num
Specify the number of times to follow the link (how many times you can go from the start page at the shortest).
url_list
An array that stores all the URLs of the websites you visit.
Corresponds to the index of the later adjacency matrix.
link_list
An array that stores all the URL pairs of links.
An array of arrays whose elements are [URL where the link goes out, URL where the link comes in]
.
Corresponds to each element of the later adjacency matrix.
fname
File name for saving the seaborn
graph and pandas
table data later.
The following is the function that actually follows the link.
link_explore
is a function that searches all links. Takes an array of URLs to search as an argument.
link_cruise
is a function that searches for links to only given sites. Takes an adjacency matrix as an argument.
link_explore.py
def link_explore(link_list, url_list, now_url_list):
# link_list: list of the URL of all the pages
# url_list: list of lists [out_node, in_node]
# next_url_list: list of the URL to explore in this function
print(f"starting explorting {len(now_url_list)} pages")
next_url_list=[]
for url in now_url_list:
try:
with urlopen(url, timeout=10) as res:
html = res.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(html, "html.parser")
except:
print("x", end="")
continue
#print(f"\n{url}")
else:
for a in soup.find_all("a"):
link = a.get("href")
if link!=None and len(link)>0:
if link[0]=="/":
link = url+link[1:]
if link[0:4] == "http":
if link[-1]=="/":
next_url_list.append(link)
link_list.append([url,link])
print("o", end="")
next_url_list = list(set(next_url_list))
url_list += next_url_list
url_list = list(set(url_list))
return link_list, url_list, next_url_list
link_cruise.py
def link_cruise(adj, url_list, now_url_list):
# adj: adjacency matrix
# next_url_list: list of the URL to explore in this function
#print(f"starting cruising {len(now_url_list)} pages")
next_url_list=[]
for url in tqdm(now_url_list):
try:
with urlopen(url, timeout=10) as res:
html = res.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(html, "html.parser")
except:
continue
else:
for a in soup.find_all("a"):
link = a.get("href")
if link!=None and len(link)>0:
if link[0]=="/":
link = url+link[1:]
if link[0:4] == "http":
if link[-1]=="/":
if link in url_list:
if adj[url_list.index(url),url_list.index(link)] == 0:
next_url_list.append(link)
adj[url_list.index(url),url_list.index(link)] = 1
#print("o", end="")
next_url_list = list(set(next_url_list))
#print("")
return adj, next_url_list
Follow the link the number of times given by ʻexplore_num. Display ʻo
if the linked HTML decode is successful, and display x
if it fails.
explore_exe.py
next_url_list = url_list
for i in range(explore_num):
print(f"\nNo.{i+1} starting")
link_list, url_list, next_url_list = link_explore(link_list, url_list, next_url_list)
print(f"\nNo.{i+1} completed\n")
↓ It looks like this ↓.
Create an adjacency matrix.
make_adj.py
adj = np.zeros((len(url_list),len(url_list)))
for link in tqdm(link_list):
try:
adj[url_list.index(link[0]),url_list.index(link[1])] = 1
except:
pass
Search after ʻexplore_num` is limited to pages that have already been visited. Repeat the search until all pages are visited.
cruise_exe.py
while (len(next_url_list)>0):
adj, next_url_list = link_cruise(adj, url_list, next_url_list)
-Network analysis is a link structure on the web ②
Recommended Posts