I went to PyCon 2016 on 9/20 and 21st.
I was listening to the talk while looking for the subject of studying Python3 for various reasons, but Introduction to Web crawlers made with Python The "** Scrapy **" introduced in was seemingly interesting, and it was just compatible with Python3, so I decided to make a crawler with Python3.
Click here for the video of the talk
――I like the way of touching and understanding rather than theory, so it's a lot of miscellaneous things. ――Because it is illiterate, I think it will be difficult to understand. ――For that reason, I think there are many things that make you want to read it, but please understand!
--I made a ramen map with Scrapy and Django. --Scrapy Easy & Interesting. Be careful about the dosage. ――Python3 is quite so!
This time, I fetched the name of the ramen shop near the company **, the score **, and the ** coordinates ** from the ** tabelog ** with Scrapy, and it's not fun just to fetch them, so on Google Map I tried plotting on a map at once using Javascript API. (It has been confirmed that crawling is not prohibited in the tabelog ** robots.txt ** just in case)
The configuration is simple.
The flow is like this ① Use Scrapy to fetch ** store name **, ** points **, and ** latitude / longitude ** from ** tabelog **. ② Save the store information in a file. ③ Access the server from a browser. ④ django (+ djang-gmapi) reads the shop information, embeds it in the Jinja template so that it calls the Javascirpt API, and returns it. ⑤ Display Google Map with Javascript.
It turned out to be something like this! (The number of cases has been reduced to 1/5 for easy viewing.) Markers are red when the score is 3.5 or more, white when the score is 3.0 or less, and yellow in between.
As expected, there are many ramen shops with high scores in front of Ogikubo and Koenji stations ...! There is a shop just a short walk from Nakano station.
I'm crawling and scraping like this.
You can get the "store name" and "points" from the list displayed by 20 items, but only the "latitude and longitude" I can only get it from the detailed information of each store, so I will get it from there. After finishing from 20 cases, I will go to get the next 20 cases.
The collection interval is about the same as manual (10 seconds). ** Gentlemanly crawling **. The structure of the tabelog URL is simple, so if you change the base URL (start_urls), It seems that you can crawl and map with lists other than ramen, places other than Tokyo, and various patterns.
-Ubuntu 16.04.1 -Python 3.5.2 ・ Scrapy 1.2.0 * Install with pip Django 1.8 ・ Django-gmapi https://bitbucket.org/dbinit/django-gmapi/
According to the talk presenter Makabi, it seemed better to install Ubuntu with apt-get, but Can Python3 Scrapy still be dropped with apt? It seemed like, so I installed Scrapy with pip. For now, I can move it without any problems.
This is an excerpt of the Spider part, which is the heart of crawling. I used Beutiful Soup for parsing.
tabelogcrawl/spiders/tlspider.py
# -*- coding: utf-8 -*-
from urllib.parse import urlparse, parse_qs
from datetime import datetime
import pytz
import scrapy
from scrapy.contrib.spiders import CrawlSpider
from bs4 import BeautifulSoup
from tabelogcrawl.items import TabelogcrawlItem
#How many items to get per page(Set to 1 when checking the operation)
LIMIT_GET_PER_PAGE = 20
class TLSpider(CrawlSpider):
name = "tlspider"
allowed_domains = ["tabelog.com"]
start_urls = (
'https://tabelog.com/tokyo/A1319/rstLst/ramen/1/?Srt=D&SrtT=rt&sort_mode=1',
)
def parse(self, response):
#Extract store information and store scores from the list.
soup = BeautifulSoup(response.body, "html.parser")
summary_list = soup.find_all("a", class_="cpy-rst-name")
score_list = soup.find_all(
"span", class_="list-rst__rating-val", limit=LIMIT_GET_PER_PAGE)
for summary, score in zip(summary_list, score_list):
#Store the necessary information for each store in TabelogcrawlItem.
jstnow = pytz.timezone(
'Asia/Tokyo').localize(datetime.now()).strftime('%Y/%m/%d')
item = TabelogcrawlItem()
item['date'] = jstnow
item['name'] = summary.string
item['score'] = score.string
href = summary["href"]
item['link'] = href
#To get the latitude and longitude of the store
#The detail page is also crawled and stored in TabelogcrawlItem.
request = scrapy.Request(
href, callback=self.parse_child)
request.meta["item"] = item
yield request
#next page.
soup = BeautifulSoup(response.body, "html.parser")
next_page = soup.find(
'a', class_="page-move__target--next")
if next_page:
href = next_page.get('href')
yield scrapy.Request(href, callback=self.parse)
def parse_child(self, response):
#Extract the latitude and longitude of the store.
soup = BeautifulSoup(response.body, "html.parser")
g = soup.find("img", class_="js-map-lazyload")
longitude, latitude = parse_qs(
urlparse(g["data-original"]).query)["center"][0].split(",")
item = response.meta["item"]
item['longitude'] = longitude
item['latitude'] = latitude
return item
For Django, the view part is excerpted. The part where the color is decided based on the score is appropriate, so I would like to color it a little better by normalization etc.
tabelogmap/gmapi/views.py
# -*- coding: utf-8 -*-
import codecs
import ast
from django import forms
from django.shortcuts import render_to_response
from gmapi import maps
from gmapi.forms.widgets import GoogleMap
SAVE_FILE = "../tabelog_data.txt"
class MapForm(forms.Form):
map = forms.Field(
widget=GoogleMap(
attrs={'width': 1850, 'height': 900}))
def index(request):
json_path = SAVE_FILE
raw_list = codecs.open(json_path, "r", encoding="utf-8").read().split("\n")
gmap = maps.Map(opts={
'center': maps.LatLng(35.70361991852944, 139.64842779766255),
'mapTypeId': maps.MapTypeId.ROADMAP,
'zoom': 15,
'mapTypeControlOptions': {
'style': maps.MapTypeControlStyle.DROPDOWN_MENU
},
})
info = maps.InfoWindow({
'content': 'Ramen map',
'disableAutoPan': True
})
for raw_data in raw_list:
try:
json_data = ast.literal_eval(raw_data)
except:
continue
if float(json_data["score"]) > 3.5:
color = "FF776B"
elif float(json_data["score"]) > 3.0:
color = "FFBB00"
else:
color = "FFFFFF"
marker_info = {
'map': gmap,
'position': maps.LatLng(
float(json_data["longitude"]),
float(json_data["latitude"])),
"label": "%s(%s)" % (
json_data["name"],
json_data["score"]),
"color": color
}
marker = maps.Marker(opts=marker_info)
maps.event.addListener(marker, 'mouseover', 'myobj.markerOver')
maps.event.addListener(marker, 'mouseout', 'myobj.markerOut')
info.open(gmap, marker)
context = {'form': MapForm(initial={'map': gmap})}
return render_to_response('index.html', context)
Scrapy is easy to use. It feels like it hides the low layer processing in a good way. It will be fun if you can use the scheduling function etc. in a good way. It was so easy that I couldn't study Python3, but lol
Rather, django-gmapi didn't support Python3, so I moved it while converting it with 2to3, but it took a little longer ... I know Python3 isn't too scary, but I still need to study a little.
Recommended Posts