In the article
import urllib.request
html = urllib.request.urlopen(url).read()
The event that the same page is displayed only because the search condition is not specified occurs. Even if the same URL is opened in the browser, the displayed contents will be different between the new one and the second one. Apparently it seems that Session is being judged.
So I looked it up
Page immediately after searching for "Izu City" https://premium-gift.jp/fujinokunigotoeat/use_store?events=search&id=&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=
Next (2nd page) https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=2&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=
Back (1st page) https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=1&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=
The difference between the page immediately after the search and the return (first page) seems to be that the URL has changed between "events = search" and "events = page", and "id =" and "id = 1".
The difference between back (1st page) and next (2nd page) is "id = 1" and "id = 2", so I found that id is the number of pages.
If you start from the URL of the return to trial (1st page), the search results are not reflected, so the displayed contents are different.
It seems that you can access in the order of
As for the URL of the next page, I found the URL in the link of the head, so use that. If you try to get the next page from the page immediately after the search, "id = 2", then "id = 22", then "id = 222" A page with 2 increased will be returned (w)
import requests
from bs4 import BeautifulSoup
import time
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}
with requests.Session() as s:
#No search required to get all
# url = "https://premium-gift.jp/fujinokunigotoeat/use_store"
#In the case of search, access after displaying the search page once
s.get("https://premium-gift.jp/fujinokunigotoeat/use_store?events=search&id=&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=")
url = "https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=1&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry="
result = []
while True:
r = s.get(url, headers=headers)
r.raise_for_status()
soup = BeautifulSoup(r.content, "html.parser")
for store in soup.select("div.store-card__item"):
data = {}
data["Store name"] = store.h3.get_text(strip=True)
for tr in store.select("table.store-card__table > tbody > tr"):
data[tr.th.get_text(strip=True).rstrip(":")] = tr.td.get_text(
strip=True
)
result.append(data)
tag = soup.select_one("head > link[rel=next]")
print(tag)
if tag:
url = tag.get("href")
else:
break
time.sleep(3)
import pandas as pd
df = pd.DataFrame(result)
#Confirmation of the number of registrations
df.shape
df.to_csv("shizuoka.csv", encoding="utf_8_sig")
#Duplicate confirmation
df[df.duplicated()]
df
Recommended Posts