[PYTHON] Test code to check for broken links in the page

It takes time to check the broken link check by hand because the accuracy is uncertain. However, external tools are heavy to execute and cannot be done in the development environment. So I made it myself. It supports relative links and absolute links.

Operation flow of the broken link check tool

  1. HTTP GET the specified URL and parse it with Beautiful Soup.
  2. Classify links into external site links, relative links, and absolute links
  3. Extract the destination URL from the links of the same domain on the page
  4. Eliminate duplication Make an HTTP request to the link destination URL generated in 5.4 and confirm that the HTTP status is 200.

Python 3.5 async / await version

It runs 60% faster than the Python2 version described below with non-blocking HTTP requests. Python2 version is at the bottom of the page. With this code, 100 links will be confirmed in 1-3 seconds.

tests_url.py


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals

import random
from urllib.parse import urlparse

from bs4 import BeautifulSoup
import requests
import asyncio
import aiohttp
from module.site.site import Site


def tests_urls():
    urls = [
        "http://www.disney.co.jp/home.html",
        "http://starwars.disney.co.jp/home.html"
    ]

    for test_url in urls:
        parse_and_request(test_url)


def parse_and_request(url):
    """
Download the url and parse the bs4
Check the status of all links
    """
    #parse url
    o = urlparse(url)
    host = o.netloc

    #GET and parse the specified URL
    response = requests.get(url, timeout=2)
    assert response.status_code == 200
    soup = BeautifulSoup(response.text, "lxml")
    test_urls = []
    for a in soup.find_all("a"):
        href = a.get("href")
        if href[0] == '#':
            pass
        elif href[0] == '/':
            #Relative link
            test_url = 'http://{}{}'.format(host, href)
            test_urls.append(test_url)
        elif host in href:
            #Absolute link and same domain
            test_urls.append(href)
        else:
            #Do not test external site links
            print('IGNORE:{}'.format(href))

    #Deduplication
    test_urls = list(set(test_urls))
    for test_url in test_urls:
        print(test_url)

    #Check if the link is alive by running asynchronously
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait([check_url(url) for url in test_urls]))


async def check_url(url):
    """
Check the URL asynchronously and check that HTTP STATUS responds 200
    :param url: str
    """
    response = await aiohttp.request('GET', url)
    status_code = response.status
    assert status_code == 200, '{}:{}'.format(str(status_code), url)
    response.close()

Execution method


>>>py.test tests_url.py

Python 2 series version

tests_url.py


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals

from urllib.parse import urlparse

from bs4 import BeautifulSoup
import requests


def tests_urls():
    urls = [
        "http://www.disney.co.jp/home.html",
        "http://starwars.disney.co.jp/home.html"
    ]

    for test_url in urls:
        parse_and_request(test_url)


def parse_and_request(url):
    """
Download the url and parse the bs4
Check the status of all links
    """
    #parse url
    o = urlparse(url)
    host = o.netloc

    #GET and parse the specified URL
    response = requests.get(url, timeout=2)
    assert response.status_code == 200
    soup = BeautifulSoup(response.text, "lxml")
    test_urls = []
    for a in soup.find_all("a"):
        href = a.get("href")
        if href[0] == '#':
            pass
        elif href[0] == '/':
            #Relative link
            test_url = 'http://{}{}'.format(host, href)
            test_urls.append(test_url)
        elif host in href:
            #Absolute link and same domain
            test_urls.append(href)
        else:
            #Do not test external site links
            print('IGNORE:{}'.format(href))

    #Deduplication
    test_urls = list(set(test_urls))
    for test_url in test_urls:
        print(test_url)

        #Check if the link is alive
        response = requests.get(test_url, timeout=2)
        assert response.status_code == 200

Recommended Posts

Test code to check for broken links in the page
I want to write in Python! (1) Code format check
Check for the existence of BigQuery tables in Java
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
I tried porting the code written for TensorFlow to Theano
[python] How to check if the Key exists in the dictionary
Check the operation of Python for .NET in each environment
For the G test 2020 # 2 exam
Tool to check code style
Test code for evaluating decorators
I tried to summarize the code often used in Pandas
Let's statically check and format the code of E2E automatic test written in Python [VS Code]
To write a test in Go, first design the interface
I can't log in to the admin page with Django3
Check the code with flake8
Reference reference for those who want to code in Rhinoceros / Grasshopper
Check the Check button in Tkinter to allow Entry to be edited
I made a function to check if the webhook is received in Lambda for the time being
How to set the output resolution for each keyframe in Blender
[For beginners] How to implement O'reilly sample code in Google Colab
How to implement Java code in the background of RedHat (LinuxONE)
How to check the memory size of a variable in Python
Explain in detail the magical code for IQ Bot table items
I wrote the code to write the code of Brainf * ck in python
[Introduction to Python] How to use the in operator in a for statement?
How to check the memory size of a dictionary in Python
[TensorFlow 2] How to check the contents of Tensor in graph mode
LINEbot development, I want to check the operation in the local environment
Check the memory protection of linux kerne with the code for ARM
I want to pass the G test in one month Day 1
Get the IPv4 address assigned to the network interface in code (Linux)
Programming to fight in the world ~ 5-1
Programming to fight in the world ~ 5-5,5-6
Check for memory leaks in Python
Programming to fight in the world 5-3
Check for external commands in python
Programming to fight in the world-Chapter 4
In the python command python points to python3.8
Write selenium test code in python
To maintain code quality in PyCharm
Check the data summary in CASTable
Cython to try in the shortest
How to run TensorFlow 1.0 code in 2.0
Programming to fight in the world ~ 5-2
How to automatically check if the code you wrote in Google Colaboratory corresponds to the python coding standard "pep8"
How to study for the Deep Learning Association G test (for beginners) [2020 version]
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Add syntax highlighting for the Kv language to Spyder in the Python IDE
How to check local GAE from iPhone browser in the same LAN
Click the Selenium links in order to get the elements of individual pages
Sample code to get the Twitter API oauth_token and oauth_token_secret in Python 2.7
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
For the first time in Numpy, I will update it from time to time
Switch the module to be loaded for each execution environment in Python
I made a program to check the size of a file in Python
Tips for Python beginners to use the Scikit-image example for themselves 6 Improve Python code
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
View the implementation source code in iPython
Check the behavior of destructor in Python
How to check the version of Django
Write the test in a python docstring