Create a tool to check scraping rules (robots.txt) in Python

Introduction

The first thing to keep in mind when starting scraping is to comply with laws and rules (including explicit and implicit things). I think scraping is so convenient that such rules tend to be despised (** especially beginners **). This article is not about scraping rules, so please refer to the following articles for them.

Reference article

About robots.txt

I won't touch on the rules, but I'll briefly touch on robots.txt, which is also the subject of the article. robots.txt is a document that contains instructions for the scraping program. robots.txt is customarily placed directly under the URL, but since this is not a separate obligation, there are cases where it is not placed in the first place. For example, Qiita is located in here.

Qiita robots.txt


User-agent: *
Disallow: /*/edit$
Disallow: /api/*
Allow:    /api/*/docs$

I will briefly explain with reference to the above Qiita's robots.txt, but User-Agent indicates the type of target crawler. * is an instruction for everyone. Next, Allow / Disallow indicates whether to allow or prohibit crawling to the specified path. In the above example, you can see that https://qiita.com/api/* is prohibited from crawling, but https://qiita.com/api/*/docs$ is allowed. Also, depending on the site, Crawl-delay may be set, but even if it is not set, it is desirable to allow ** 1 second ** until the next request as a tacit understanding. If you want to know the specifications about robots.txt in more detail, please refer to here.

Program creation

Step 1. Read robots.txt

The standard python library urllib provides urllib.robotparser for reading robots.txt. This time, we will use this to create a program. For urllib.robotparser, see here.

import urllib.robotparser

# robots.Read txt
robots_txt_url = 'https://qiita.com/robots.txt'
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_txt_url)
rp.read()

# robots.URL to investigate from txt information, User-Check if the Agent can crawl
user_agent = '*'
url = 'https://qiita.com/api/*'
result = rp.can_fetch(user_agent, url)
print(result)

Execution result


False

In the above program, first create a RobotFileParser object, specify the URL of robots.txt with the set_url function, and read with read based on it. Next, by giving the User-Agent and URL you want to investigate to the can_fetch function, you can get the boolean value whether access is permitted or not. In the above program, as confirmed earlier, crawl to https://qiita.com/api/* is not allowed, so False is output.

Step 2. Get a link to robots.txt using a regular expression

Most of the basic part of the program is completed in Step 1, but this only uses the functions of the library and is not useful as a program. Therefore, I would like to automatically generate a link for robots.txt using a regular expression. That said, I don't think it will give you a lot of image, so I'll explain it using a concrete example. For example, if the URL you want to check if crawl is allowed is https://qiita.com/api/*, the link https://qiita.com/robots.txt is based on this URL. Is to generate. This is because robots.txt is customarily placed directly under the site as mentioned earlier, so from https://qiita.com/api/* to https://qiita.com If you can extract the part, you can create a link just by adding the characters /robots.txt to it. For the python regular expression re, refer to here.

import re

#Get site URL with regular expression
def get_root_url(url):
    pattern = r'(?P<root>https?://.*?)\/.*'
    result = re.match(pattern, url)
    if result is not None:
        return result.group('root')

#Robots from the site URL.Generate txt URL
def get_robots_txt_path(root_url):
    return root_url + '/robots.txt'

url = 'https://qiita.com/api/*'
root_url = get_root_url(url)
robots_txt_url = get_robots_txt_path(root_url)
print(f'ROOT URL -> {root_url}')
print(f'ROBOTS.TXT URL -> {robots_txt_url}')

Execution result


ROOT URL -> https://qiita.com
ROBOTS.TXT URL -> https://qiita.com/robots.txt

Step 3. Add functions and summarize

Based on Step1 and Step2, use the function of urllib.robotparser to add functions such as getting Crawl-Delay, and classify and summarize functions. You can put the code here, but it's a bit long, so put it on GitHub. There are about 60 lines, so if you want to check the contents, please.

Reference article

Recommended Posts

Create a tool to check scraping rules (robots.txt) in Python
I want to create a window in Python
How to create a JSON file in Python
Create a function in Python
Create a dictionary in Python
Create a plugin to run Python Doctest in Vim (2)
Create a plugin to run Python Doctest in Vim (1)
Create a DI Container in Python
Create a binary file in Python
[Python] Creating a scraping tool Memo
Create a Kubernetes Operator in Python
5 Ways to Create a Python Chatbot
Create a random string in Python
How to check the memory size of a variable in Python
How to create a heatmap with an arbitrary domain in Python
How to check the memory size of a dictionary in Python
Check if you can connect to a TCP port in Python
Create a command line tool to convert dollars to yen using Python
Create a JSON object mapper in Python
How to get a stacktrace in python
[GPS] Create a kml file in Python
I created a password tool in Python.
Create a shortcut to run a Python file in VScode on your terminal
Create a tool to automatically furigana with html using Mecab from Python3
I made a program to check the size of a file in Python
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
A note I looked up to make a command line tool in Python
What is the fastest way to create a reverse dictionary in python?
[Python] List Comprehension Various ways to create a list
Edit Excel from Python to create a PivotTable
Create a Vim + Python test environment in 1 minute
Create a GIF file using Pillow in Python
Try to calculate a statistical problem in Python
How to clear tuples in a list (Python)
To execute a Python enumerate function in JavaScript
Create a standard normal distribution graph in Python
Create a virtual environment with conda in Python
A clever way to time processing in Python
Steps to develop a web application in Python
To add a module to python put in Julialang
Create a simple momentum investment model in Python
How to notify a Discord channel in Python
Create a new page in confluence with Python
Create a datetime object from a string in Python (Python 3.3)
Create a package containing global commands in Python
[Python] How to draw a histogram in Matplotlib
How to create a Rest Api in Django
Create a MIDI file in Python using pretty_midi
Create a loop antenna pattern in Python in KiCad
[Docker] Create a jupyterLab (python) environment in 3 minutes!
I tried to create a class that can easily serialize Json in Python
Create a plugin that allows you to search Sublime Text 3 tabs in Python
I want to create a priority queue that can be updated in Python (2.7)
[Python] What to check when you get a Unicode Decode Error in Django
Create a setting in terraform to send a message from AWS Lambda Python3.8 to Slack
I tried to implement what seems to be a Windows snipping tool in Python
Create SpatiaLite in Python
Creating a scraping tool
Create a Python environment
I want to write in Python! (1) Code format check
Check if the string is a number in python