[PYTHON] Serverless scraping on a regular basis with AWS lambda + scrapy Part 1

First post! I really wanted to include serverless in one article, but I couldn't make it in time ... So this time it will be scraping.

Thing you want to do

I want to automatically scrape web pages whose information is updated regularly!

Target

Get Yahoo! Weather (Tokyo) data every 6 hours.

Method

Python + Scrapy + AWSlambda + CroudWatchEvents seems to be good ...?

I will try it for the time being

First from scraping

Follow the steps below to create the crawling and scraping parts.

Scrapy installation
Create a Scrapy project
Create spider
Run

1. Scrapy installation

$ python3 -V
Python 3.7.4

$ pip3 install scrapy
...
Successfully installed

$ scrapy version
Scrapy 1.8.0

2. Create a Scrapy project

The project folder is created in the hierarchy where you entered the command.

$ scrapy startproject yahoo_weather_crawl
New Scrapy project 'yahoo_weather_crawl'

$ ls
yahoo_weather_crawl

This time I will try to get this part of yahoo weather. Let's pick up the announcement date, date, weather, temperature, and probability of precipitation.

Scrapy has a command line shell, and you can enter commands to check if the acquisition target is properly taken, so let's proceed while checking it once.

Specify the acquisition target with xpath. You can easily get the xpath from the google chrome developer tools (the one that comes out when you press F12).

The xpath of the announcement date and time acquired this time is as follows //*[@id="week"]/p

Let's pull this out of the response.


#Launch scrapy shell
$ scrapy shell https://weather.yahoo.co.jp/weather/jp/13/4410.html

>>> announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first()
>>> announcement_date
'Announced at 18:00 on November 29, 2019'

If you specify text (), you can get only the text. See Resources (https://doc.scrapy.org/en/latest/index.html) for more information.

For the time being, the date and time have been set, so let's get the others in the same way.

Other information is in the table tag, so get all the contents of the table once.


>>> table = response.xpath('//*[@id="yjw_week"]/table')

You now have the elements in the table tag for id = "yjw_week" . We will get each element from here.


#date
>>> date = table.xpath('//tr[1]/td[2]/small/text()').extract_first()
>>> date
'December 1st'

#weather
>>> weather = table.xpath('//tr[2]/td[2]/small/text()').extract_first()
>>> weather
'Cloudy and sometimes sunny'

#temperature
>>> temperature = table.xpath('//tr[3]/td[2]/small/font/text()').extract()
>>> temperature
['14', '5']

#rainy percent
>>> rainy_percent = table.xpath('//tr[4]/td[2]/small/text()').extract_first()
>>> rainy_percent
'20'

Now that you know how to get each We will create a Spider (the main part of the process).

3. Create spider

The structure of the project folder created earlier is as follows.


.
├── scrapy.cfg
└── yahoo_weather_crawl
    ├── __init__.py
    ├── __pycache__
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── __pycache__

First, define the items to be acquired.

`items.py`



import scrapy

class YahooWeatherCrawlItem(scrapy.Item):
    announcement_date = scrapy.Field()  #Announcement date and time
    date = scrapy.Field()               #date
    weather = scrapy.Field()            #weather
    temperature = scrapy.Field()        #temperature
    rainy_percent = scrapy.Field()      #rainy percent

Next, create the body of the spider in the spiders folder.

`spider/weather_spider.py`


# -*- coding: utf-8 -*-
import scrapy
from yahoo_weather_crawl.items import YahooWeatherCrawlItem

# spider
class YahooWeatherSpider(scrapy.Spider):

    name = "yahoo_weather_crawler"
    allowed_domains = ['weather.yahoo.co.jp']
    start_urls = ["https://weather.yahoo.co.jp/weather/jp/13/4410.html"]

    #Extraction process for response
    def parse(self, response):
        #Announcement date and time
        yield YahooWeatherCrawlItem(announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first())
        table = response.xpath('//*[@id="yjw_week"]/table')

        #Date loop
        for day in range(2, 7):

            yield YahooWeatherCrawlItem(
                #Data extraction
                date=table.xpath('//tr[1]/td[%d]/small/text()' % day).extract_first(),
                weather=table.xpath('//tr[2]/td[%d]/small/text()' % day).extract_first(),
                temperature=table.xpath('//tr[3]/td[%d]/small/font/text()' % day).extract(),
                rainy_percent=table.xpath('//tr[4]/td[%d]/small/text()' % day).extract_first(),
                )

4. Now run!

scrapy crawl yahoo_weather_crawler

2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'announcement_date': 'Announced at 17:00 on December 1, 2019'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 3rd',
 'rainy_percent': '10',
 'temperature': ['17', '10'],
 'weather': 'Sunny'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 4th',
 'rainy_percent': '0',
 'temperature': ['15', '4'],
 'weather': 'Sunny'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 5th',
 'rainy_percent': '0',
 'temperature': ['14', '4'],
 'weather': 'Partially cloudy'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 6th',
 'rainy_percent': '10',
 'temperature': ['11', '4'],
 'weather': 'Cloudy'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 7th',
 'rainy_percent': '30',
 'temperature': ['9', '3'],
 'weather': 'Cloudy'}

It looks like it's taken well! It's a big deal, so let's output it to a file.

When outputting to a file, Japanese characters will be garbled by default, so Add the encoding settings to settings.py.

`settings.py`


FEED_EXPORT_ENCODING='utf-8'

$ scrapy crawl yahoo_weather_crawler -o weather_data.json
...

`weather_data.json`


[
{"announcement_date": "Announced at 17:00 on December 1, 2019"},
{"date": "December 3rd", "weather": "Sunny", "temperature": ["17", "10"], "rainy_percent": "10"},
{"date": "December 4th", "weather": "Sunny", "temperature": ["15", "4"], "rainy_percent": "0"},
{"date": "December 5th", "weather": "Partially cloudy", "temperature": ["14", "4"], "rainy_percent": "0"},
{"date": "December 6th", "weather": "Cloudy", "temperature": ["11", "4"], "rainy_percent": "10"},
{"date": "December 7th", "weather": "Cloudy", "temperature": ["9", "3"], "rainy_percent": "30"}
]

I was able to output!

Next time, I will combine this process with AWS to run it serverlessly.

References

Scrapy 1.8 documentation https://doc.scrapy.org/en/latest/index.html Understand in 10 minutes Scrapy https://qiita.com/Chanmoro/items/f4df85eb73b18d902739 Web scraping with Scrapy https://qiita.com/Amtkxa/items/4c1172c932264ae941b4