Introduction

Scraping the official website of GoToEat in Shizuoka, listing the stores targeted for the Izu campaign

In the article

import urllib.request
html = urllib.request.urlopen(url).read()

The event that the same page is displayed only because the search condition is not specified occurs. Even if the same URL is opened in the browser, the displayed contents will be different between the new one and the second one. Apparently it seems that Session is being judged.

So I looked it up

URL changes

Page immediately after searching for "Izu City" https://premium-gift.jp/fujinokunigotoeat/use_store?events=search&id=&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=

Next (2nd page) https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=2&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=

Back (1st page) https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=1&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=

The difference between the page immediately after the search and the return (first page) seems to be that the URL has changed between "events = search" and "events = page", and "id =" and "id = 1".

The difference between back (1st page) and next (2nd page) is "id = 1" and "id = 2", so I found that id is the number of pages.

access

If you start from the URL of the return to trial (1st page), the search results are not reflected, so the displayed contents are different.

Access the URL of the page immediately after searching for "Izu City"
Access the URL of the back (1st page)

It seems that you can access in the order of

As for the URL of the next page, I found the URL in the link of the head, so use that. If you try to get the next page from the page immediately after the search, "id = 2", then "id = 22", then "id = 222" A page with 2 increased will be returned (w)

Scraping

import requests
from bs4 import BeautifulSoup

import time

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}

with requests.Session() as s:

    #No search required to get all
    # url = "https://premium-gift.jp/fujinokunigotoeat/use_store"

    #In the case of search, access after displaying the search page once
    s.get("https://premium-gift.jp/fujinokunigotoeat/use_store?events=search&id=&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=")
    url = "https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=1&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry="

    result = []

    while True:

        r = s.get(url, headers=headers)
        r.raise_for_status()

        soup = BeautifulSoup(r.content, "html.parser")

        for store in soup.select("div.store-card__item"):

            data = {}
            data["Store name"] = store.h3.get_text(strip=True)

            for tr in store.select("table.store-card__table > tbody > tr"):
                data[tr.th.get_text(strip=True).rstrip("：")] = tr.td.get_text(
                    strip=True
                )

            result.append(data)

        tag = soup.select_one("head > link[rel=next]")

        print(tag)

        if tag:

            url = tag.get("href")

        else:
            break

        time.sleep(3)

import pandas as pd

df = pd.DataFrame(result)

#Confirmation of the number of registrations
df.shape

df.to_csv("shizuoka.csv", encoding="utf_8_sig")

#Duplicate confirmation
df[df.duplicated()]

df

[PYTHON] Scraping Shizuoka's GoToEat official website

Introduction

URL changes

access

Scraping