[PYTHON] Scraping Shizuoka's GoToEat official website and listing the stores targeted for the Izu campaign

Shizuoka sells two types of meal tickets.

	Red Fuji ticket	Aofuji ticket
Amount of money	1 book 8,000 Yen(10,000 yen available)	1 book 10,000 Yen(12,500 yen available)
URL	https://premium-gift.jp/fujinokunigotoeat/	https://gotoeat-shizuoka.com/
robots meta tag	index,follow	noindex, follow

The reason why I decided to scrape in the first place is that neither site can be viewed by the target stores in the list. Meal ticket purchasers have to narrow down the target stores by name or change the pager link many times. As far as checking the robots meta tag on the official website, Aofuji tickets are noindex, so only Red Fuji tickets are targeted.

Web scraping

urllib There are various methods and libraries for web scraping. First, I chose the urllib library, which has a lot of information and seems easy.

    #Acquisition example
    import urllib.request
    html = urllib.request.urlopen(url).read()

However, the event that the same page is displayed only because the search condition is not specified occurs. Even if you open the same URL in the browser, the displayed contents will be different between the new one and the second one. Apparently it seems that Session is being judged.

Selenium As a result of various investigations, it was possible with Selenium. Selenium is a test automation library for web applications that allows you to operate your browser programmatically.

    #Acquisition example
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    driver = webdriver.Chrome('chromedriver',options=options)
    driver.get(url)
    html = driver.page_source.encode('utf-8')

In the case of Google Colaboratory, you can move it by executing the following.

Installation

    !apt-get update
    !apt install chromium-chromedriver
    !cp /usr/lib/chromium-browser/chromedriver /usr/bin
    !pip install selenium

Source

    import pandas as pd
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    #Run browser in background
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    area_nm = 'Izu City'
    df_all = pd.DataFrame(columns=['area_nm', 'shop_nm'])
    #Launch browser
    driver = webdriver.Chrome('chromedriver',options=options)
    driver.implicitly_wait(10)
    #initial screen
    driver.get("https://premium-gift.jp/fujinokunigotoeat/use_store")
    driver.implicitly_wait(10)
    print(driver.current_url)
    #Search execution
    driver.find_element_by_id('addr').send_keys(area_nm)
    driver.find_element_by_class_name('store-search__submit').click()
    driver.implicitly_wait(10)
    print(driver.current_url)
    shouldLoop = True
    while shouldLoop is True:
      #search results
      current_url = driver.current_url
      shop_nm_list = driver.find_elements_by_class_name("store-card__title")
      for idx, shop_item in enumerate(shop_nm_list):
        row = pd.Series( [ area_nm, shop_item.text ], index=df_all.columns )
        df_all = df_all.append(row, ignore_index=True )
        print(shop_item.text)
      
      #to the next page
      link_list = driver.find_elements_by_class_name('pagenation__item')
      for link_item in link_list:
        if link_item.text == "next":
          link_item.click()
          driver.implicitly_wait(10)
          print(driver.current_url)
      
      shouldLoop = False
      #If there is no page to display, exit
      if current_url != driver.current_url:
          shouldLoop = True
    driver.close()
    #CSV output
    df_all.to_csv(f'shoplist.csv', index=False)

Finally

I hope that the sites of Red Fuji Ticket and Aofuji Ticket will be improved. The search uses only Izu City as a keyword, but you can change the conditions.