[PYTHON] Upload scraped artifacts in Scrapy Cloud to S3

Introduction

Previously, for the purpose of studying, I tried to make a zip code search API with AWS lambda + API Gateway, but the zip code data is Scrapy. I scraped it and uploaded it to S3 for use. This time, I will write about the points I was addicted to before deploying the project to Scrapy Cloud and executing the regular schedule.

For Scrapy Cloud, I referred to the following URL.

-Scrapy + Scrapy Cloud for a comfortable Python crawl + scraping life --Gunosy data analysis blog -Scraping with Python --Introduction to Scrapy 2nd step --Qiita

Overall workflow

The figure below shows the overall workflow.

jp-zip_scrapy構成図.png

Until deploying to scraping hub

From creating a scrapy project to deploying it to scrapinghub, the flow is as follows.

--Create a scrapy project

What I was addicted to when running a job on scraping hub

It worked fine until I deployed it to scrapinghub, but there were some addictive points when I ran the job on scrapinghub.

boto supports 2 systems

Boto is used for AWS operations, but be careful because Scrapy Cloud pre-installed boto (v2). is.

Additional installation of boto3 (updated 2016-12-19)

If you specify requirements_file in scrapinghub.yml, you can install the required libraries, so you can use boto3.

If requirements_file is processed normally at the time of deployment, you can check the additionally installed libraries in requirements of Code & Deploys.

requirements.png

What to do with AWS credentials

For AWS authentication information, go to Spider Settings-> Spider settigns and register the setting values as shown below.

scrapy_settings.png

Access from the code as follows.

from scrapy.conf import settings

s3 = boto.connect_s3(settings['AWS_ACCESS_KEY_ID'], settings['AWS_SECRET_ACCESS_KEY'])

If you want to check locally, write the setting value in settings.py. However, if the authentication information exists in ~ / .aws / credentials, it is not necessary to describe the setting value.

settings.py


AWS_ACCESS_KEY_ID = 'xxxxxx'
AWS_SECRET_ACCESS_KEY = 'xxxxxx'

Task

--File upload to S3 is extremely slow, scraping takes about 20 minutes ――It's for study purposes, so you don't have to worry about it, but I want to improve it somehow. --Zip code data is updated once a month --When the zip code data is updated, I want the API on the ʻAWS Lambdaside to automatically refer to the latest data as well. --Currently, the update date of the zip code data is managed as version information by stage variables. --I want to update the stage variable after uploading toS3` is complete

in conclusion

Initially, I was trying to run a Scrapy project on ʻAWS Lambdaas well as the API. However, I had to compress the source code into a ZIP file including the library and upload it, and as a result, it did not work well, so I tried deploying it toScrapy Cloud` obediently.

I was also invited to Bot Crawler Advent Calendar 2016, so even there, I made some bots using Scrapy Cloud and wrote an article. I'm going to.

See you soon.

Recommended Posts

Upload scraped artifacts in Scrapy Cloud to S3
PUT gzip directly to S3 in Python
How to switch python versions in cloud9
Upload what you got in request to S3 with AWS Lambda Python
Use boto to upload / download files to s3.
Scraping in Python-Introduction to Scrapy First 2nd Step
How to pass settings to Item Pipeline in Scrapy
Upload images to S3 with GUI using tkinter
Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Scrapy (2)
Introduction to Scrapy (4)
Resize multipart.File type image with golang ~ Upload to S3
Upload the image downloaded by requests directly to S3
How to upload files in Django generic class view
Publish / upload a library created in Python to PyPI