First post! I really wanted to include serverless in one article, but I couldn't make it in time ... So this time it will be scraping.
I want to automatically scrape web pages whose information is updated regularly!
Get Yahoo! Weather (Tokyo) data every 6 hours.
Python + Scrapy + AWSlambda + CroudWatchEvents seems to be good ...?
Follow the steps below to create the crawling and scraping parts.
$ python3 -V
Python 3.7.4
$ pip3 install scrapy
...
Successfully installed
$ scrapy version
Scrapy 1.8.0
The project folder is created in the hierarchy where you entered the command.
$ scrapy startproject yahoo_weather_crawl
New Scrapy project 'yahoo_weather_crawl'
$ ls
yahoo_weather_crawl
This time I will try to get this part of yahoo weather. Let's pick up the announcement date, date, weather, temperature, and probability of precipitation.
Scrapy has a command line shell, and you can enter commands to check if the acquisition target is properly taken, so let's proceed while checking it once.
Specify the acquisition target with xpath. You can easily get the xpath from the google chrome developer tools (the one that comes out when you press F12).
The xpath of the announcement date and time acquired this time is as follows
//*[@id="week"]/p
Let's pull this out of the response.
#Launch scrapy shell
$ scrapy shell https://weather.yahoo.co.jp/weather/jp/13/4410.html
>>> announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first()
>>> announcement_date
'Announced at 18:00 on November 29, 2019'
If you specify text (), you can get only the text. See Resources (https://doc.scrapy.org/en/latest/index.html) for more information.
For the time being, the date and time have been set, so let's get the others in the same way.
Other information is in the table tag, so get all the contents of the table once.
>>> table = response.xpath('//*[@id="yjw_week"]/table')
You now have the elements in the table tag for id = "yjw_week"
.
We will get each element from here.
#date
>>> date = table.xpath('//tr[1]/td[2]/small/text()').extract_first()
>>> date
'December 1st'
#weather
>>> weather = table.xpath('//tr[2]/td[2]/small/text()').extract_first()
>>> weather
'Cloudy and sometimes sunny'
#temperature
>>> temperature = table.xpath('//tr[3]/td[2]/small/font/text()').extract()
>>> temperature
['14', '5']
#rainy percent
>>> rainy_percent = table.xpath('//tr[4]/td[2]/small/text()').extract_first()
>>> rainy_percent
'20'
Now that you know how to get each We will create a Spider (the main part of the process).
The structure of the project folder created earlier is as follows.
.
├── scrapy.cfg
└── yahoo_weather_crawl
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__
First, define the items to be acquired.
items.py
import scrapy
class YahooWeatherCrawlItem(scrapy.Item):
announcement_date = scrapy.Field() #Announcement date and time
date = scrapy.Field() #date
weather = scrapy.Field() #weather
temperature = scrapy.Field() #temperature
rainy_percent = scrapy.Field() #rainy percent
Next, create the body of the spider in the spiders folder.
spider/weather_spider.py
# -*- coding: utf-8 -*-
import scrapy
from yahoo_weather_crawl.items import YahooWeatherCrawlItem
# spider
class YahooWeatherSpider(scrapy.Spider):
name = "yahoo_weather_crawler"
allowed_domains = ['weather.yahoo.co.jp']
start_urls = ["https://weather.yahoo.co.jp/weather/jp/13/4410.html"]
#Extraction process for response
def parse(self, response):
#Announcement date and time
yield YahooWeatherCrawlItem(announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first())
table = response.xpath('//*[@id="yjw_week"]/table')
#Date loop
for day in range(2, 7):
yield YahooWeatherCrawlItem(
#Data extraction
date=table.xpath('//tr[1]/td[%d]/small/text()' % day).extract_first(),
weather=table.xpath('//tr[2]/td[%d]/small/text()' % day).extract_first(),
temperature=table.xpath('//tr[3]/td[%d]/small/font/text()' % day).extract(),
rainy_percent=table.xpath('//tr[4]/td[%d]/small/text()' % day).extract_first(),
)
scrapy crawl yahoo_weather_crawler
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'announcement_date': 'Announced at 17:00 on December 1, 2019'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 3rd',
'rainy_percent': '10',
'temperature': ['17', '10'],
'weather': 'Sunny'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 4th',
'rainy_percent': '0',
'temperature': ['15', '4'],
'weather': 'Sunny'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 5th',
'rainy_percent': '0',
'temperature': ['14', '4'],
'weather': 'Partially cloudy'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 6th',
'rainy_percent': '10',
'temperature': ['11', '4'],
'weather': 'Cloudy'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 7th',
'rainy_percent': '30',
'temperature': ['9', '3'],
'weather': 'Cloudy'}
It looks like it's taken well! It's a big deal, so let's output it to a file.
When outputting to a file, Japanese characters will be garbled by default, so Add the encoding settings to settings.py.
settings.py
FEED_EXPORT_ENCODING='utf-8'
$ scrapy crawl yahoo_weather_crawler -o weather_data.json
...
weather_data.json
[
{"announcement_date": "Announced at 17:00 on December 1, 2019"},
{"date": "December 3rd", "weather": "Sunny", "temperature": ["17", "10"], "rainy_percent": "10"},
{"date": "December 4th", "weather": "Sunny", "temperature": ["15", "4"], "rainy_percent": "0"},
{"date": "December 5th", "weather": "Partially cloudy", "temperature": ["14", "4"], "rainy_percent": "0"},
{"date": "December 6th", "weather": "Cloudy", "temperature": ["11", "4"], "rainy_percent": "10"},
{"date": "December 7th", "weather": "Cloudy", "temperature": ["9", "3"], "rainy_percent": "30"}
]
I was able to output!
Next time, I will combine this process with AWS to run it serverlessly.
Scrapy 1.8 documentation https://doc.scrapy.org/en/latest/index.html Understand in 10 minutes Scrapy https://qiita.com/Chanmoro/items/f4df85eb73b18d902739 Web scraping with Scrapy https://qiita.com/Amtkxa/items/4c1172c932264ae941b4
Recommended Posts