Python Crawling & Scraping Chapter 4 Summary

Introduction

Learning summary of "Python crawling & scraping [enhanced revised edition] practical development guide for data collection and analysis" This Chapter 4 was titled "Methods for Practical Use" and focused on points to note when making crawlers.

4.1 Crawler characteristics

4.1.1 Crawler with state

--If you want to crawl a site that requires login, create a crawler that supports cookies. --Python's Requests library automatically sends a cookie to the server using a Session object

4.1.2 Crawler interpreting JavaScript

JavaScript to crawl sites created as SPA (Single Page Application) Need to be interpreted. To do this, use tools such as Selenium and Puppeteer to automatically operate the browser. In addition, browsers such as Chrome and FireFox have a headless mode that can be run without a GUI, which can be useful for creating crawlers.

4.1.3 Crawler for an unspecified number of websites

Something like Googlebot. It is more difficult than a crawler that targets a specific site. A mechanism that does not depend on the page structure is required.

4.2 Precautions regarding the use of collected data

4.2.1 Copyright

Copyrights to be aware of when creating crawlers → reproduction rights, adaptation rights, public transmission rights With the revision of the Copyright Law in 2009, copying for the purpose of information analysis and copying, adaptation, and automatic public transmission for the purpose of providing search engine services can be performed without the permission of the copyright holder.

4.2.2 Terms of use and personal information

A story about observing the terms of the site. Personal information will be managed based on the Personal Information Protection Law.

4.3 Precautions regarding the load at the crawl destination

How to not put a load on the crawl destination. [Okazaki Municipal Central Library Case-Wikipedia](https://ja.wikipedia.org/wiki/Okazaki Municipal Central Library Case) What happened like this

4.3.1 Number of simultaneous connections and crawl interval

--Number of simultaneous connections --Recent browsers have up to 6 simultaneous connections per host, but the crawler gets multiple pages for a long time, so it should be reduced. --Crawl interval --It is customary to set an interval of 1 second or more. Example: Crawler operated by the National Diet Library --If there is a means to acquire information other than HTML such as RSS and XML, use that.

4.3.2 Instructions to crawlers by robots.txt

Regarding netkeiba who is always scraping, there seems to be no particular instructions in robots.txt or meta tag.

4.3.3 XML sitemap

An XML file that tells the crawler the URL you want it to crawl More efficient than following links and crawling Describe in Sitemap directive of robots.txt

4.3.4 Clarification of contact information

Contact information such as email address and URL can be described in the User-Agent header of the request sent by the crawler.

4.3.5 Status Code and Error Handling

Error handling is important to avoid putting an extra load on the crawl destination When retrying when an error occurs, take measures such as increasing the retry interval exponentially. There are many standard descriptions for error handling, but it can be described concisely by using a library called tenacity.

4.4 Designed for repeated execution

4.4.1 Get only updated data

--HTTP cache policy --The HTTP server can specify the cache policy in detail by adding a header related to the cache in the response. --These headers can be divided into two types: "strong cache" and "weak cache". --Strong cache → Cache-Control (detailed specification such as whether to cache), Expires (content expiration date) The client side does not send a request while the cache is valid. Use the cached response during the expiration date. --Weak cache → Last-Modified (last modified date), ETag (identifier) The client side sends a request every time, but if it is not updated, it uses the cached response. --In Python, a library called CacheControl can handle cache-related processing concisely pip install" CacheControl [filecache] "

4.4.2 Detect changes in the crawl destination

--Validize with regular expression --Validate with JSON Schema --In Python, you can write validation rules in JSON format called JSON Schema by using a library called jsonschema pip install jsonschema

If a change can be detected in this way, the crawler will be terminated by notifying by e-mail.

4.5 Summary

abridgement

in conclusion

I wasn't motivated and the posting interval was vacant, but for the time being, it was an article that confirmed survival (?)

Recommended Posts

Python Crawling & Scraping Chapter 4 Summary
Summary about Python scraping
Python Summary
Python summary
[Scraping] Python scraping
Python scraping notes
Python Scraping get_ranker_categories
Scraping with Python
Scraping with Python
Python tutorial summary
Python Scraping eBay
Python Scraping get_title
Python: Scraping Part 1
python related summary
Scraping using Python
Python basics summary
Scraping with Python (preparation)
Try scraping with Python.
Python Django tutorial summary
UnicodeEncodeError:'cp932' during python scraping
Basics of Python scraping basics
Scraping with Python + PhantomJS
Summary about Python3 + OpenCV3
Python function argument summary
Python directory operation summary
Python AI framework summary
Python iteration related summary
Summary of Python arguments
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
[Python] Chapter 01-01 About Python (First Python)
Scraping RSS with Python
Scraping using Python 3.5 async / await
I tried scraping with Python
Web scraping with python + JupyterLab
Python Machine Learning Programming Chapter 2 Classification Problems-Machine Learning Algorithm Training Summary
Summary of python file operations
Scraping with Selenium + Python Part 1
Summary of Python3 list operations
Python for Data Analysis Chapter 4
[Python] Scraping in AWS Lambda
python super beginner tries scraping
Web scraping notes in python3
What's new in Python 3.10 (Summary)
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Standard input / summary / python, ruby
Scraping with Selenium in Python
Python class member scope summary
Python web programming article summary
100 Language Processing Knock Chapter 2 (Python)
Trade-offs in web scraping & crawling
Scraping with Tor in Python
Web scraping using Selenium (Python)
python pandas study recent summary
Python Competitive Programming Site Summary
Python data type summary memo
Scraping with Selenium + Python Part 2
[Python + Selenium] Tips for scraping
Web scraping beginner with python