[PYTHON] Scrapy-Redis is recommended for crawling a large number of domains

Scrapy-Redis

https://github.com/rolando/scrapy-redis

pip install scrapy_redis

And just replace the basic settings.py

Benefits of introducing

--You can use redis for scrapy scheduler, start_urls queue, pipeline (separate extended settings) --Easy for external collaboration and multiplexing --Since the scheduler queue is persistent, it can be restarted even if the crawler is stopped in the middle. --You can start multiple same Spiders in multiple processes or multiple servers and crawl in parallel at the same time. --By setting the output to redis in pipeline, the subsequent processing can be processed by another worker process. --Bulk write to DB --Talknize --Stream in machine learning --Spider's start_urls can also be redis, so you can push it from an external service to the start URL queue.

Depending on the tuning, Scrapy can go about 1000 pages / minute with 1 core (100% CPU can be eaten). By using Scrapy-Redis, you can crawl the number of cores * 1000 pages / minute.

Recommended Posts

Scrapy-Redis is recommended for crawling a large number of domains
Connect a large number of videos together!
ETL processing for a large number of GTFS Realtime files (Python edition)
Upload a large number of images to Wordpress
Organize a large number of files into folders
Accelerate a large number of simple queries with MySQL
[Python] Randomly generate a large number of English names
Executing a large number of Python3 Executor.submit may consume a lot of memory
TensorFlow To learn from a large number of images ... ~ (almost) solution ~
Convert a large number of PDF files to text files using pdfminer
[Python] Correlation is below a certain level ・ Maximum number of features
[Example of Python improvement] What is the recommended learning site for Python beginners?
TensorFlow To learn from a large number of images ... (Unsolved problem) → 12/18 Solved
Maximum average number of daily visitors (large)
Inject is recommended for DDD in Python
Mathematics is a graph of common tests.
Impressions of using Flask for a month
[python] [meta] Is the type of python a type?
One-liner to create a large number of test files at once on Linux