Looking at the URL of each store, in the case of Tokyo, it is in the form of "//tabelog.com/tokyo/A ..../A....../......../" It has become. For example, in the case of a shop in Shibuya, the URL is "//tabelog.com/tokyo/A1303/A130301/......../" and "Tabelog / Tokyo / Shibuya / Ebisu / Daikanyama / It can be interpreted as "Shibuya / specific shop /". Looking at the top page of the tabelog, there is data up to "// tabelog.com/tokyo/A ..../" in the place of "Search by area", so I will get this first. As expected, we don't need all the data for the whole country, so after narrowing down a large area to some extent, we will get only the data we want.
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')
I parsed it with Beautiful Soup like this, and when I looked inside,
<h2 class="rsttop-heading1 rsttop-search__title">
Search by area
</h2>
</div>
<ul class="rsttop-area-search__list">
<li class="rsttop-area-search__item">
<a class="rsttop-area-search__target js-area-swicher-target" data-swicher-area-list='[{"areaName":"Ginza / Shinbashi / Yurakucho","url":"/tokyo/A1301/"},{"areaName":"Nihonbashi, Tokyo","url":"/tokyo/A1302/"},{"areaName":"Shibuya / Ebisu / Daikanyama","url":"/tokyo/A1303/"},...
↑ The data you want around here! !!
a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')
a[0].get('data-swicher-area-list')
If you do it like that
'[{"areaName":"Ginza / Shinbashi / Yurakucho","url":"/tokyo/A1301/"},{"areaName":"Nihonbashi, Tokyo","url":"/tokyo/A1302/"},{"areaName":"Shibuya / Ebisu / Daikanyama","url":"/tokyo/A1303/"},...
And so on. However, I thought that it was a dictionary list, but it was completely a character string. .. .. So I tried to find out how to do it, but I couldn't find it, so I'll fix it to the necessary form forcibly though it's not good. If you know how to handle the process here smoothly, please let me know!
splitted = a[0].get('data-swicher-area-list').split('"')
area_dict = {}
for i in range(int((len(splitted)-1)/8)):
area_dict[splitted[i*8+3]] = splitted[i*8+7]
With this, I managed to get the following dictionary.
{'Ueno / Asakusa / Nippori': '/tokyo/A1311/',
'Ryogoku, Kinshicho, Koiwa': '/tokyo/A1312/',
'Nakano-Nishi-Ogikubo': '/tokyo/A1319/',...
To be honest, Tokyo alone is fine, but if you want to get it comprehensively, it's as follows.
area_url = {}
for area in a:
area_dict = {}
splitted = area.get('data-swicher-area-list').split('"')
for i in range(int((len(splitted)-1)/8)):
area_dict[splitted[i*8+3]] = splitted[i*8+7]
area_url[area.get('data-swicher-city').split('"')[3]] = area_dict
What I was interested in on the way was len (area_url) = 47 for len (a) = 53. When I looked it up, the cause was that Tokyo, Kanagawa, Aichi, Osaka, Kyoto, and Fukuoka appeared twice, but the content was the same for the part I wanted, so the above code said that the purpose was achieved. It was judged. The URL can be obtained in the following form.
area_url
│
├──'Tokyo'
│ ├──'Ueno / Asakusa / Nippori' : '/tokyo/A1311/'
│ ├──'Ryogoku, Kinshicho, Koiwa' : '/tokyo/A1312/'
│ ⋮
│ └──'Ginza / Shinbashi / Yurakucho' : '/tokyo/A1301/'
│
├──'Kanagawa'
│ ├──'Around Odawara' : '/kanagawa/A1409/'
│ ⋮
⋮
Now that the major classification of the area has been obtained, the next step is to obtain the minor classification. In the same way as getting a major classification,
url = '/tokyo/A1304/'
res = requests.get(root_url + url[1:])
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='c-link-arrow')
area_dict = {}
for area in a:
href = area['href']
if href[-21:-8]!=url:
continue
else:
area_dict[area.text] = href
If you do
{'Yoyogi': 'https://tabelog.com/tokyo/A1304/A130403/',
'Okubo / Shin-Okubo': 'https://tabelog.com/tokyo/A1304/A130404/',
'Shinjuku': 'https://tabelog.com/tokyo/A1304/A130401/',
'Shinjuku Gyoen': 'https://tabelog.com/tokyo/A1304/A130402/'}
It feels good. In addition, the if statement was inserted because some advertisements have class = "c-link-arrow" when soup.find_all ('a', class_ ='c-link-arrow') is executed. This is to eliminate these.
Next, specify the area you want to go to and get the URL of that area.
visit_areas = ['Roppongi / Azabu / Hiroo', 'Harajuku / Omotesando / Aoyama', 'Yotsuya / Ichigaya / Iidabashi', 'Shinjuku / Yoyogi / Okubo',
'Nihonbashi, Tokyo', 'Shibuya / Ebisu / Daikanyama', 'Meguro / Platinum / Gotanda', 'Akasaka / Nagatacho / Tameike', 'Ginza / Shinbashi / Yurakucho']
url_dict = {}
for visit_area in visit_areas:
url = area_url['Tokyo'][visit_area]
time.sleep(1)
res = requests.get(root_url + url[1:])
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='c-link-arrow')
for area in a:
href = area['href']
if href[-21:-8]!=url:
continue
else:
url_dict[area.text] = href
We succeeded in getting the URL of 34 areas in the following form!
{'Marunouchi / Otemachi': 'https://tabelog.com/tokyo/A1302/A130201/',
'Kudanshita': 'https://tabelog.com/tokyo/A1309/A130906/',...
Now that you have the URL that points to the area ("// tabelog.com/tokyo/A ..../A....../"), the next step is to get the URL of the individual restaurant ("/". /tabelog.com/tokyo/A ..../A....../......../").
url = 'https://tabelog.com/tokyo/A1302/A130201/'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
The above url will show you the first 20 stores in this area. The URL of the next 20 stores is "//tabelog.com/tokyo/A1302/A130201/rstLst/2/", and you can specify rstLst / 3,4,5, ... You can see that. Since the maximum value of rstLst is required for loop processing here, the value obtained by dividing the total number of restaurants by 20 is rounded up to an integer as shown below.
import math
count = soup.find_all('span', class_='list-condition__count')
print(math.ceil(int(count[0].text)/20))
90
There are 1,784 shops in total, and if there are 20 shops on one page, the last page will be the 90th page. However, when I tried to display the 90th page, ...
Unable to display this page Thank you for using Tabelog. It cannot be displayed after page 60. Please narrow down the conditions and search again.
Is displayed! !! For the time being, it only displays up to 60 pages. So, you don't have to worry about whether you want to narrow down the conditions beyond the area so that the number of stores is 1,200 or less and then run the loop processing, or get the top 1,200 in the order of new opening and be satisfied. In a bad way. So, once check how many shops are listed in each area.
counts = {}
for key,value in url_dict.items():
time.sleep(1)
res = requests.get(value)
soup = BeautifulSoup(res.content, 'html.parser')
counts[key] = int(soup.find_all('span', class_='list-condition__count')[0].text)
print(sorted(counts.items(), key=lambda x:x[1], reverse=True)[:15])
[('Shinjuku', 5756),
('Shibuya', 3420),
('Shimbashi / Shiodome', 2898),
('Ginza', 2858),
('Roppongi / Nogizaka / Nishiazabu', 2402),
('Marunouchi / Otemachi', 1784),
('Iidabashi / Kagurazaka', 1689),
('Ebisu', 1584),
('Nihonbashi / Kyobashi', 1555),
('Akasaka', 1464),
('Ningyocho / Kodenmacho', 1434),
('Gotanda / Takanawadai', 937),
('Yurakucho / Hibiya', 773),
('Tameike Sanno / Kasumigaseki', 756),
('Kayabacho / Hatchobori', 744)]
As a result, it became clear that the number of publications exceeded 1,200 in 11 areas. As a result of various trials and errors on what to do, I decided to be satisfied because I limited the genre to restaurants in light of this purpose and acquired the data of the top 1,200 items in the order of new opening. For the time being, get the restaurant information displayed on a specific page.
url = 'https://tabelog.com/tokyo/A1301/A130101/rstLst/RC/1/?Srt=D&SrtT=nod'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
rc_list = []
for rc_div in soup.find_all('div', class_='list-rst__wrap js-open-new-window'):
rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
if rc_score is None:
rc_score = -1.
else:
rc_score = float(rc_score.text)
rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
rc_list.append([rc_name, rc_url, rc_score, rc_review_num])
I will add some explanations. First, for the first url, the genre is limited to restaurants with'/ rstLst / RC'. The'/1'that comes after that means the first page, that is, the first 20 cases. Furthermore,'/? Srt = D & SrtT = nod' is a new open order specification. In the for statement, 20 restaurant data are processed in order. It is the tabelog score that needs attention. You can get the score with the above find method, but if there is no score, this tag itself does not exist. Therefore, the if statement is None was used to classify the cases, and if there was no score, the score was once set to -1. Regarding the number of reviews, if there is no review,'-' will be returned. After that, you can get the URL of the restaurant by turning the loop for each area and each page! !!
Now that you can get the URL of each restaurant, the purpose is to get word-of-mouth information from the page of each restaurant. When I put the code first, I did the following.
url = 'https://tabelog.com/tokyo/A1301/A130101/13079232/dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=1'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
station = soup.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
genre = '/'.join([genre_.text for genre_ in soup.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
price = soup.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
score = [score_.next_sibling.next_sibling.text for score_ in soup.find_all('span', class_='c-rating__time c-rating__time--dinner')]
print(station, genre, price, score)
I will add commentary. First of all, regarding the url,'/ dtlrvwlst' is a review list,'/ COND-2' is a night review,'smp0' is a simple display,'lc = 2'is 100 items each, and'PG = 1'is 1 It means the page. The nearest station, genre, and budget are acquired in order because the data is in the'dl'tag whose class name is'rdheader-subinfo__item'. As for genres, in most cases, multiple genres are assigned to one store, so here, all genre names are combined once with'/'. The budget and the score of each word of mouth are a little complicated because I wanted only the one at night.
Now that we can get the necessary information individually, we just need to get the data we want by loop processing! !!
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import math
import time
from tqdm import tqdm
root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')
area_url = {}
for area in a:
area_dict = {}
splitted = area.get('data-swicher-area-list').split('"')
for i in range(int((len(splitted)-1)/8)):
area_dict[splitted[i*8+3]] = splitted[i*8+7]
area_url[area.get('data-swicher-city').split('"')[3]] = area_dict
visit_areas = ['Shibuya / Ebisu / Daikanyama']
url_dict = {}
for visit_area in visit_areas:
url = area_url['Tokyo'][visit_area]
time.sleep(1)
res = requests.get(root_url + url[1:])
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='c-link-arrow')
for area in a:
href = area['href']
if href[-21:-8]!=url:
continue
else:
url_dict[area.text] = href
max_page = 20
restaurant_data = []
for area, url in url_dict.items():
time.sleep(1)
res_area = requests.get(url)
soup_area = BeautifulSoup(res_area.content, 'html.parser')
rc_count = int(soup_area.find_all('span', class_='list-condition__count')[0].text)
print('There are ' + str(rc_count) + ' restaurants in ' + area)
for i in tqdm(range(1,min(math.ceil(rc_count/20)+1,max_page+1,61))):
url_rc = url + 'rstLst/RC/' + str(i) + '/?Srt=D&SrtT=nod'
res_rc = requests.get(url_rc)
soup_rc = BeautifulSoup(res_rc.content, 'html.parser')
for rc_div in soup_rc.find_all('div', class_='list-rst__wrap js-open-new-window'):
rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
if rc_score is None:
rc_score = -1.
else:
rc_score = float(rc_score.text)
rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
if rc_review_num != ' - ':
for page in range(1,math.ceil(int(rc_review_num)/100)+1):
rc_url_pg = rc_url + 'dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=' + str(page)
time.sleep(1)
res_pg = requests.get(rc_url_pg)
soup_pg = BeautifulSoup(res_pg.content, 'html.parser')
if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
continue
try:
station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
except:
try:
station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
except:
station = ''
genre = '/'.join([genre_.text for genre_ in soup_pg.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
price = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
score = [score_.next_sibling.next_sibling.text for score_ in soup_pg.find_all('span', class_='c-rating__time c-rating__time--dinner')]
restaurant_data.append([area, rc_count, rc_name, rc_url, rc_score, rc_review_num, station, genre, price, score])
Due to the large amount of data, the area was temporarily limited to visit_areas = ['Shibuya / Ebisu / Daikanyama']. Also, since max_page = 20, only 400 data of 20 items / page✖️20 (max_page) can be acquired for each location. In addition, after acquiring the number of reviews, 100 reviews / page are acquired, but the number of reviews acquired first includes the daytime evaluation, and the loop processing targets only the nighttime evaluation. Therefore, in the process of loop processing, there was a situation where there were not as many reviews as originally expected.
if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
continue
It was processed in. Also, most shops list the nearest station, but there are a few shops that list the area instead of the nearest station, in which case the tag names are different.
try:
station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
except:
try:
station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
except:
station = ''
Is handling this. As a result, we were able to obtain 895 data. Of these, 804 had even one scored review.
The scatter plot above shows the relationship between the average word-of-mouth evaluation and the tabelog score. Shops that do not have a tabelog score are treated as 2.9 points. As a whole, if the average word-of-mouth evaluation is high, you can see that the tabelog score is also high. Although the average word-of-mouth evaluation was high, the tabelog score was low, and it was also found that there are stores that are underrated.
So, I'm going to go to one of these underrated stores next time! !!