Shizuoka sells two types of meal tickets.
Red Fuji ticket | Aofuji ticket | |
---|---|---|
Amount of money | 1 book 8,000 Yen(10,000 yen available) | 1 book 10,000 Yen(12,500 yen available) |
URL | https://premium-gift.jp/fujinokunigotoeat/ | https://gotoeat-shizuoka.com/ |
robots meta tag | index,follow | noindex, follow |
The reason why I decided to scrape in the first place is that neither site can be viewed by the target stores in the list. Meal ticket purchasers have to narrow down the target stores by name or change the pager link many times. As far as checking the robots meta tag on the official website, Aofuji tickets are noindex, so only Red Fuji tickets are targeted.
urllib There are various methods and libraries for web scraping. First, I chose the urllib library, which has a lot of information and seems easy.
#Acquisition example
import urllib.request
html = urllib.request.urlopen(url).read()
However, the event that the same page is displayed only because the search condition is not specified occurs. Even if you open the same URL in the browser, the displayed contents will be different between the new one and the second one. Apparently it seems that Session is being judged.
Selenium As a result of various investigations, it was possible with Selenium. Selenium is a test automation library for web applications that allows you to operate your browser programmatically.
#Acquisition example
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
driver = webdriver.Chrome('chromedriver',options=options)
driver.get(url)
html = driver.page_source.encode('utf-8')
In the case of Google Colaboratory, you can move it by executing the following.
Installation
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
Source
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#Run browser in background
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
area_nm = 'Izu City'
df_all = pd.DataFrame(columns=['area_nm', 'shop_nm'])
#Launch browser
driver = webdriver.Chrome('chromedriver',options=options)
driver.implicitly_wait(10)
#initial screen
driver.get("https://premium-gift.jp/fujinokunigotoeat/use_store")
driver.implicitly_wait(10)
print(driver.current_url)
#Search execution
driver.find_element_by_id('addr').send_keys(area_nm)
driver.find_element_by_class_name('store-search__submit').click()
driver.implicitly_wait(10)
print(driver.current_url)
shouldLoop = True
while shouldLoop is True:
#search results
current_url = driver.current_url
shop_nm_list = driver.find_elements_by_class_name("store-card__title")
for idx, shop_item in enumerate(shop_nm_list):
row = pd.Series( [ area_nm, shop_item.text ], index=df_all.columns )
df_all = df_all.append(row, ignore_index=True )
print(shop_item.text)
#to the next page
link_list = driver.find_elements_by_class_name('pagenation__item')
for link_item in link_list:
if link_item.text == "next":
link_item.click()
driver.implicitly_wait(10)
print(driver.current_url)
shouldLoop = False
#If there is no page to display, exit
if current_url != driver.current_url:
shouldLoop = True
driver.close()
#CSV output
df_all.to_csv(f'shoplist.csv', index=False)
I hope that the sites of Red Fuji Ticket and Aofuji Ticket will be improved. The search uses only Izu City as a keyword, but you can change the conditions.