Shizuoka sells two types of meal tickets.
|Red Fuji ticket||Aofuji ticket|
|Amount of money||1 book 8,000 Yen(10,000 yen available)||1 book 10,000 Yen(12,500 yen available)|
|robots meta tag||index,follow||noindex, follow|
The reason why I decided to scrape in the first place is that neither site can be viewed by the target stores in the list. Meal ticket purchasers have to narrow down the target stores by name or change the pager link many times. As far as checking the robots meta tag on the official website, Aofuji tickets are noindex, so only Red Fuji tickets are targeted.
urllib There are various methods and libraries for web scraping. First, I chose the urllib library, which has a lot of information and seems easy.
#Acquisition example import urllib.request html = urllib.request.urlopen(url).read()
However, the event that the same page is displayed only because the search condition is not specified occurs. Even if you open the same URL in the browser, the displayed contents will be different between the new one and the second one. Apparently it seems that Session is being judged.
Selenium As a result of various investigations, it was possible with Selenium. Selenium is a test automation library for web applications that allows you to operate your browser programmatically.
#Acquisition example from selenium import webdriver from selenium.webdriver.chrome.options import Options driver = webdriver.Chrome('chromedriver',options=options) driver.get(url) html = driver.page_source.encode('utf-8')
In the case of Google Colaboratory, you can move it by executing the following.
!apt-get update !apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin !pip install selenium
import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.options import Options #Run browser in background options = webdriver.ChromeOptions() options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') area_nm = 'Izu City' df_all = pd.DataFrame(columns=['area_nm', 'shop_nm']) #Launch browser driver = webdriver.Chrome('chromedriver',options=options) driver.implicitly_wait(10) #initial screen driver.get("https://premium-gift.jp/fujinokunigotoeat/use_store") driver.implicitly_wait(10) print(driver.current_url) #Search execution driver.find_element_by_id('addr').send_keys(area_nm) driver.find_element_by_class_name('store-search__submit').click() driver.implicitly_wait(10) print(driver.current_url) shouldLoop = True while shouldLoop is True: #search results current_url = driver.current_url shop_nm_list = driver.find_elements_by_class_name("store-card__title") for idx, shop_item in enumerate(shop_nm_list): row = pd.Series( [ area_nm, shop_item.text ], index=df_all.columns ) df_all = df_all.append(row, ignore_index=True ) print(shop_item.text) #to the next page link_list = driver.find_elements_by_class_name('pagenation__item') for link_item in link_list: if link_item.text == "next": link_item.click() driver.implicitly_wait(10) print(driver.current_url) shouldLoop = False #If there is no page to display, exit if current_url != driver.current_url: shouldLoop = True driver.close() #CSV output df_all.to_csv(f'shoplist.csv', index=False)
I hope that the sites of Red Fuji Ticket and Aofuji Ticket will be improved. The search uses only Izu City as a keyword, but you can change the conditions.