[PYTHON] Upload scraped artifacts in Scrapy Cloud to S3

Introduction

Previously, for the purpose of studying, I tried to make a zip code search API with AWS lambda + API Gateway, but the zip code data is Scrapy. I scraped it and uploaded it to S3 for use. This time, I will write about the points I was addicted to before deploying the project to Scrapy Cloud and executing the regular schedule.

For Scrapy Cloud, I referred to the following URL.

-Scrapy + Scrapy Cloud for a comfortable Python crawl + scraping life --Gunosy data analysis blog -Scraping with Python --Introduction to Scrapy 2nd step --Qiita

Overall workflow

The figure below shows the overall workflow.

jp-zip_scrapy構成図.png

Until deploying to `scraping hub`

From creating a scrapy project to deploying it to scrapinghub, the flow is as follows.

--Create a scrapy project

$ scrapy startproject {your project} --Implemented locally --Try scraping locally
$ scrapy runspider {spider_file.py} --Deploy to scraping hub
$ shub deploy

What I was addicted to when running a job on `scraping hub`

It worked fine until I deployed it to scrapinghub, but there were some addictive points when I ran the job on scrapinghub.

boto supports 2 systems

Boto is used for AWS operations, but be careful because Scrapy Cloud pre-installed boto (v2). is.

Supported libraries on Scrapy Cloud / Knowledge Base Forum / Scrapinghub

Additional installation of boto3 (updated 2016-12-19)

If you specify requirements_file in scrapinghub.yml, you can install the required libraries, so you can use boto3.

Scrapy Cloud — Scrapinghub documentation Dependencies and External Libraries

If requirements_file is processed normally at the time of deployment, you can check the additionally installed libraries in requirements of Code & Deploys.

What to do with AWS credentials

For AWS authentication information, go to Spider Settings-> Spider settigns and register the setting values as shown below.

Access from the code as follows.

from scrapy.conf import settings

s3 = boto.connect_s3(settings['AWS_ACCESS_KEY_ID'], settings['AWS_SECRET_ACCESS_KEY'])

https://doc.scrapy.org/en/latest/topics/settings.html#built-in-settings-reference

If you want to check locally, write the setting value in settings.py. However, if the authentication information exists in ~ / .aws / credentials, it is not necessary to describe the setting value.

`settings.py`


AWS_ACCESS_KEY_ID = 'xxxxxx'
AWS_SECRET_ACCESS_KEY = 'xxxxxx'

Task

--File upload to S3 is extremely slow, scraping takes about 20 minutes ――It's for study purposes, so you don't have to worry about it, but I want to improve it somehow. --Zip code data is updated once a month --When the zip code data is updated, I want the API on the ʻAWS Lambdaside to automatically refer to the latest data as well. --Currently, the update date of the zip code data is managed as version information by stage variables. --I want to update the stage variable after uploading toS3` is complete

in conclusion

Initially, I was trying to run a Scrapy project on ʻAWS Lambdaas well as the API. However, I had to compress the source code into a ZIP file including the library and upload it, and as a result, it did not work well, so I tried deploying it toScrapy Cloud` obediently.

I was also invited to Bot Crawler Advent Calendar 2016, so even there, I made some bots using Scrapy Cloud and wrote an article. I'm going to.

See you soon.