Things to keep in mind when developing crawlers in Python

Crawler characteristics

Crawler with state

HTTP is a statelessly designed protocol If you want to have a state, use cookies. It is not always necessary to implement the sending and receiving of cookies by creating a crawler. Use Session object of Request library. In addition, Referer can also express the state.

Used for implementing login etc.

Crawler interpreting JavaScript

For SPA etc., the content is not included in HTML. In that case, it is necessary to interpret JavaScript.

-Selenium (Tool for automatic browser qualification from a program) -Puppeteer (Node.js library for automatic operation of Google Chrome)

Etc. are available as automatic operation tools.

Crawler for an unspecified number of websites

Google bot etc.

There are these three characteristics, but you should be aware of the following points regardless of the pattern of the crawler.

Be careful when using the collected data

Notes on crawling load

--Number of simultaneous connections --Crawl interval You have to be aware of the load and be aware of the load.

robots.txt Robots.txt and robots meta tags are widely used to instruct website administrators not to crawl a particular page.

robots.txt: A text file located in the top directory of your website robots meta tag: Contains instructions to the crawler.

You can get information about robots.txt using a Python library called urllib.robotparser.

XML site map

An XML file for website administrators to present a list of URLs they want the crawler to crawl.

Crawling with reference to an XML sitemap is efficient because you only need to crawl the pages that need to be crawled.

Clarification of contact information

Enter an arbitrary character string in the User-Agent header to access it.

Status code and error handling

By changing the error processing depending on the status code, it is possible to perform processing such as retrying in the case of a network error (such as not being able to connect).

Recommended Posts

Things to keep in mind when developing crawlers in Python
Things to keep in mind when copying Python lists
Things to keep in mind when processing strings in Python2
Things to keep in mind when processing strings in Python3
Things to keep in mind when using Python with AtCoder
Things to keep in mind when using cgi with python.
Things to keep in mind when using Python for those who use MATLAB
Things to keep in mind when building automation tools for the manufacturing floor in Python
Things to keep in mind when deploying Keras on your Mac
Things to keep in mind when converting row vectors to column vectors with ndarray
Things to note when initializing a list in Python
Things to keep in mind when doing Batch Prediction on GCP ML Engine
Summary of points to keep in mind when writing a program that runs on Python 2.5
Error when trying to install psycopg2 in Python
How to exit when using Python in Terminal (Mac)
Things to do when you start developing with Django
I want to do something in Python when I finish
To flush stdout in Python
Login to website in Python
Attention when os.mkdir in Python
Speech to speech in python [text to speech]
How to develop in Python
Post to Slack in Python
Convenient writing method when appending to list continuously in Python
What to do when "SSL: CERTIFICATE_VERIFY_FAILED _ssl.c: 1056" appears in Python
[Subprocess] When you want to execute another Python program in Python code
How to not escape Japanese when dealing with json in python
[Python] How to do PCA in Python
Precautions when using pit in Python
Things to watch out for when naming dynamic routing in nuxt.js
Convert markdown to PDF in Python
How to collect images in Python
Behavior when listing in Python heapq
How to use SQLite in Python
Things to note when running Python on EC2 from AWS Lambda
In the python command python points to python3.8
Timezone specification when converting a string to datetime type in python
[Python] When you want to use all variables in another file
Try to calculate Trace in Python
Precautions when passing def to sorted and groupby functions in Python? ??
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
6 ways to string objects in Python
How to use PubChem in Python
Articles to read when Blender Python script code doesn't work in 2.80
What to do when ModuleNotFoundError: No module named'XXX' occurs in Python
Precautions when giving default values to arguments in Python function definitions
How to handle Japanese in Python
An alternative to `pause` in Python
What to do when the value type is ambiguous in Python?
When using regular expressions in Python
When writing a program in Python
Things to watch out for when creating a Python environment on a Mac
How to hide the command prompt when running python in visual studio 2015
How to write a string when there are multiple lines in python
When specifying multiple keys in python sort
I tried to implement PLSA in Python
[Introduction to Python] How to use class in Python?
Try logging in to qiita with Python
Install Pyaudio to play wave in python