[PYTHON] Scraping and tabelog ~ I want to find a good restaurant! ~ (Work)

1. Motivation

I go to the restaurant about once a week to eat rice, but of course the restaurant is delicious if it is highly rated in the tabelog, but it is difficult to make a reservation and it is not busy because it is a two-hour system. Also the drawbacks of.
However, it was hypothesized that even if the word-of-mouth evaluation is high for a newly opened restaurant, the tabelog evaluation is not so high and it may be the aim. It is presumed that this is probably to prevent some people involved in the store from using methods such as writing positive reviews to make the score look high when there are few reviews at the new store. This is, of course, an important operation in ensuring the reliability of tabelog evaluation.
The shops that have just opened may have a mixture of boulders with little information, but all of them are considered to have low tabelog scores, and just searching for them does not catch much. Finding a good restaurant based on the word-of-mouth information, making reservations smoothly, satisfying the food, slowly without worrying about the two-hour system, because the restaurant wants to make repeaters, the customer service is polite. Ideal. (Of course, popular stores have the advantages of popular stores.)
Value investment, which has continued to Warren Buffett since Benjamin Graham, is still one of the royal roads of investment. Following them, the motivation for this time is to find a restaurant with high potential, although the tabelog rating is low, and to find a restaurant with high cost performance until it becomes a popular restaurant.

2. About tabelog

First of all, I would like to clarify that I have been using Tabelog for searching for shops for several years and I am very grateful for the service. I have no intention of selling a fight to Tabelog about this article either, and if there is a problem, I will delete it immediately so please let me know.
To get information by scraping, first check the terms of use. (Http://user-help.tabelog.com/rules/) As of December 2019, there is no wording that scraping is prohibited. What we should pay particular attention to this time is the prohibition of commercial use and the prohibition of unauthorized reproduction of word-of-mouth in 6. (1). Of course, I intend to use it within the scope of my personal hobby, and be careful not to display the word-of-mouth content after that.
The calculation logic for points and rankings is also easily disclosed. (Https://tabelog.com/help/score/) This is quite interesting. The following two are likely to be related this time.
1: The degree of user influence is taken into consideration. (Increase the influence of users who have a lot of experience in eating and walking)
2: The score will not increase unless the evaluations are collected.
After all, the store that has just opened seems to be left with an undervalue due to its ability.

3. Scraping area and URL acquisition

Looking at the URL of each store, in the case of Tokyo, it is in the form of "//tabelog.com/tokyo/A ..../A....../......../" It has become. For example, in the case of a shop in Shibuya, the URL is "//tabelog.com/tokyo/A1303/A130301/......../" and "Tabelog / Tokyo / Shibuya / Ebisu / Daikanyama / It can be interpreted as "Shibuya / specific shop /". Looking at the top page of the tabelog, there is data up to "// tabelog.com/tokyo/A ..../" in the place of "Search by area", so I will get this first. As expected, we don't need all the data for the whole country, so after narrowing down a large area to some extent, we will get only the data we want.

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')

I parsed it with Beautiful Soup like this, and when I looked inside,

<h2 class="rsttop-heading1 rsttop-search__title">
Search by area
                  </h2>
</div>
<ul class="rsttop-area-search__list">
<li class="rsttop-area-search__item">
<a class="rsttop-area-search__target js-area-swicher-target" data-swicher-area-list='[{"areaName":"Ginza / Shinbashi / Yurakucho","url":"/tokyo/A1301/"},{"areaName":"Nihonbashi, Tokyo","url":"/tokyo/A1302/"},{"areaName":"Shibuya / Ebisu / Daikanyama","url":"/tokyo/A1303/"},...

↑ The data you want around here! !!

a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')
a[0].get('data-swicher-area-list')

If you do it like that

'[{"areaName":"Ginza / Shinbashi / Yurakucho","url":"/tokyo/A1301/"},{"areaName":"Nihonbashi, Tokyo","url":"/tokyo/A1302/"},{"areaName":"Shibuya / Ebisu / Daikanyama","url":"/tokyo/A1303/"},...

And so on. However, I thought that it was a dictionary list, but it was completely a character string. .. .. So I tried to find out how to do it, but I couldn't find it, so I'll fix it to the necessary form forcibly though it's not good. If you know how to handle the process here smoothly, please let me know!

splitted = a[0].get('data-swicher-area-list').split('"')
area_dict = {}
for i in range(int((len(splitted)-1)/8)):
    area_dict[splitted[i*8+3]] = splitted[i*8+7]

With this, I managed to get the following dictionary.

{'Ueno / Asakusa / Nippori': '/tokyo/A1311/',
 'Ryogoku, Kinshicho, Koiwa': '/tokyo/A1312/',
 'Nakano-Nishi-Ogikubo': '/tokyo/A1319/',...

To be honest, Tokyo alone is fine, but if you want to get it comprehensively, it's as follows.

area_url = {}
for area in a:
    area_dict = {}
    splitted = area.get('data-swicher-area-list').split('"')
    for i in range(int((len(splitted)-1)/8)):
        area_dict[splitted[i*8+3]] = splitted[i*8+7]
    area_url[area.get('data-swicher-city').split('"')[3]] = area_dict

What I was interested in on the way was len (area_url) = 47 for len (a) = 53. When I looked it up, the cause was that Tokyo, Kanagawa, Aichi, Osaka, Kyoto, and Fukuoka appeared twice, but the content was the same for the part I wanted, so the above code said that the purpose was achieved. It was judged. The URL can be obtained in the following form.

area_url
  │
  ├──'Tokyo'
  │    ├──'Ueno / Asakusa / Nippori' : '/tokyo/A1311/'
  │    ├──'Ryogoku, Kinshicho, Koiwa' : '/tokyo/A1312/'
  │    　　　　　　⋮
  │    └──'Ginza / Shinbashi / Yurakucho' : '/tokyo/A1301/'
  │
  ├──'Kanagawa'
  │    ├──'Around Odawara' : '/kanagawa/A1409/'
  │    　　　　　　⋮
  ⋮

Now that the major classification of the area has been obtained, the next step is to obtain the minor classification. In the same way as getting a major classification,

url = '/tokyo/A1304/'
res = requests.get(root_url + url[1:])
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='c-link-arrow')
area_dict = {}
for area in a:
    href = area['href']
    if href[-21:-8]!=url:
        continue
    else:
        area_dict[area.text] = href

If you do

{'Yoyogi': 'https://tabelog.com/tokyo/A1304/A130403/',
 'Okubo / Shin-Okubo': 'https://tabelog.com/tokyo/A1304/A130404/',
 'Shinjuku': 'https://tabelog.com/tokyo/A1304/A130401/',
 'Shinjuku Gyoen': 'https://tabelog.com/tokyo/A1304/A130402/'}

It feels good. In addition, the if statement was inserted because some advertisements have class = "c-link-arrow" when soup.find_all ('a', class_ ='c-link-arrow') is executed. This is to eliminate these.

Next, specify the area you want to go to and get the URL of that area.

visit_areas = ['Roppongi / Azabu / Hiroo', 'Harajuku / Omotesando / Aoyama', 'Yotsuya / Ichigaya / Iidabashi', 'Shinjuku / Yoyogi / Okubo', 
               'Nihonbashi, Tokyo', 'Shibuya / Ebisu / Daikanyama', 'Meguro / Platinum / Gotanda', 'Akasaka / Nagatacho / Tameike', 'Ginza / Shinbashi / Yurakucho']
url_dict = {}
for visit_area in visit_areas:
    url = area_url['Tokyo'][visit_area]
    time.sleep(1)
    res = requests.get(root_url + url[1:])
    soup = BeautifulSoup(res.content, 'html.parser')
    a = soup.find_all('a', class_='c-link-arrow')
    for area in a:
        href = area['href']
        if href[-21:-8]!=url:
            continue
        else:
            url_dict[area.text] = href

We succeeded in getting the URL of 34 areas in the following form!

{'Marunouchi / Otemachi': 'https://tabelog.com/tokyo/A1302/A130201/',
 'Kudanshita': 'https://tabelog.com/tokyo/A1309/A130906/',...

4. Scraping individual restaurant URL acquisition

Now that you have the URL that points to the area ("// tabelog.com/tokyo/A ..../A....../"), the next step is to get the URL of the individual restaurant ("/". /tabelog.com/tokyo/A ..../A....../......../").

url = 'https://tabelog.com/tokyo/A1302/A130201/'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

The above url will show you the first 20 stores in this area. The URL of the next 20 stores is "//tabelog.com/tokyo/A1302/A130201/rstLst/2/", and you can specify rstLst / 3,4,5, ... You can see that. Since the maximum value of rstLst is required for loop processing here, the value obtained by dividing the total number of restaurants by 20 is rounded up to an integer as shown below.

import math
count = soup.find_all('span', class_='list-condition__count')
print(math.ceil(int(count[0].text)/20))

There are 1,784 shops in total, and if there are 20 shops on one page, the last page will be the 90th page. However, when I tried to display the 90th page, ...

Unable to display this page Thank you for using Tabelog. It cannot be displayed after page 60. Please narrow down the conditions and search again.

Is displayed! !! For the time being, it only displays up to 60 pages. So, you don't have to worry about whether you want to narrow down the conditions beyond the area so that the number of stores is 1,200 or less and then run the loop processing, or get the top 1,200 in the order of new opening and be satisfied. In a bad way. So, once check how many shops are listed in each area.

counts = {}
for key,value in url_dict.items():
    time.sleep(1)
    res = requests.get(value)
    soup = BeautifulSoup(res.content, 'html.parser')
    counts[key] = int(soup.find_all('span', class_='list-condition__count')[0].text)
print(sorted(counts.items(), key=lambda x:x[1], reverse=True)[:15])

[('Shinjuku', 5756),
 ('Shibuya', 3420),
 ('Shimbashi / Shiodome', 2898),
 ('Ginza', 2858),
 ('Roppongi / Nogizaka / Nishiazabu', 2402),
 ('Marunouchi / Otemachi', 1784),
 ('Iidabashi / Kagurazaka', 1689),
 ('Ebisu', 1584),
 ('Nihonbashi / Kyobashi', 1555),
 ('Akasaka', 1464),
 ('Ningyocho / Kodenmacho', 1434),
 ('Gotanda / Takanawadai', 937),
 ('Yurakucho / Hibiya', 773),
 ('Tameike Sanno / Kasumigaseki', 756),
 ('Kayabacho / Hatchobori', 744)]

As a result, it became clear that the number of publications exceeded 1,200 in 11 areas. As a result of various trials and errors on what to do, I decided to be satisfied because I limited the genre to restaurants in light of this purpose and acquired the data of the top 1,200 items in the order of new opening. For the time being, get the restaurant information displayed on a specific page.

url = 'https://tabelog.com/tokyo/A1301/A130101/rstLst/RC/1/?Srt=D&SrtT=nod'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

rc_list = []
for rc_div in soup.find_all('div', class_='list-rst__wrap js-open-new-window'):
    rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
    rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
    rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
    if rc_score is None:
        rc_score = -1.
    else:
        rc_score = float(rc_score.text)
    rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
    rc_list.append([rc_name, rc_url, rc_score, rc_review_num])

I will add some explanations. First, for the first url, the genre is limited to restaurants with'/ rstLst / RC'. The'/1'that comes after that means the first page, that is, the first 20 cases. Furthermore,'/? Srt = D & SrtT = nod' is a new open order specification. In the for statement, 20 restaurant data are processed in order. It is the tabelog score that needs attention. You can get the score with the above find method, but if there is no score, this tag itself does not exist. Therefore, the if statement is None was used to classify the cases, and if there was no score, the score was once set to -1. Regarding the number of reviews, if there is no review,'-' will be returned. After that, you can get the URL of the restaurant by turning the loop for each area and each page! !!

5. Scraping word-of-mouth and evaluation acquisition

Now that you can get the URL of each restaurant, the purpose is to get word-of-mouth information from the page of each restaurant. When I put the code first, I did the following.

url = 'https://tabelog.com/tokyo/A1301/A130101/13079232/dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=1'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

station = soup.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
genre = '/'.join([genre_.text for genre_ in soup.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
price = soup.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
score = [score_.next_sibling.next_sibling.text for score_ in soup.find_all('span', class_='c-rating__time c-rating__time--dinner')]
print(station, genre, price, score)

I will add commentary. First of all, regarding the url,'/ dtlrvwlst' is a review list,'/ COND-2' is a night review,'smp0' is a simple display,'lc = 2'is 100 items each, and'PG = 1'is 1 It means the page. The nearest station, genre, and budget are acquired in order because the data is in the'dl'tag whose class name is'rdheader-subinfo__item'. As for genres, in most cases, multiple genres are assigned to one store, so here, all genre names are combined once with'/'. The budget and the score of each word of mouth are a little complicated because I wanted only the one at night.

6. Run! !!

Now that we can get the necessary information individually, we just need to get the data we want by loop processing! !!

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import math
import time
from tqdm import tqdm

root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')

area_url = {}
for area in a:
    area_dict = {}
    splitted = area.get('data-swicher-area-list').split('"')
    for i in range(int((len(splitted)-1)/8)):
        area_dict[splitted[i*8+3]] = splitted[i*8+7]
    area_url[area.get('data-swicher-city').split('"')[3]] = area_dict

visit_areas = ['Shibuya / Ebisu / Daikanyama']
url_dict = {}
for visit_area in visit_areas:
    url = area_url['Tokyo'][visit_area]
    time.sleep(1)
    res = requests.get(root_url + url[1:])
    soup = BeautifulSoup(res.content, 'html.parser')
    a = soup.find_all('a', class_='c-link-arrow')
    for area in a:
        href = area['href']
        if href[-21:-8]!=url:
            continue
        else:
            url_dict[area.text] = href

max_page = 20
restaurant_data = []
for area, url in url_dict.items():
    time.sleep(1)
    res_area = requests.get(url)
    soup_area = BeautifulSoup(res_area.content, 'html.parser')
    rc_count = int(soup_area.find_all('span', class_='list-condition__count')[0].text)
    print('There are ' + str(rc_count) + ' restaurants in ' + area)
    for i in tqdm(range(1,min(math.ceil(rc_count/20)+1,max_page+1,61))):
        url_rc = url + 'rstLst/RC/' + str(i) + '/?Srt=D&SrtT=nod'
        res_rc = requests.get(url_rc)
        soup_rc = BeautifulSoup(res_rc.content, 'html.parser')
        for rc_div in soup_rc.find_all('div', class_='list-rst__wrap js-open-new-window'):
            rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
            rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
            rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
            if rc_score is None:
                rc_score = -1.
            else:
                rc_score = float(rc_score.text)
            rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
            if rc_review_num != ' - ':
                for page in range(1,math.ceil(int(rc_review_num)/100)+1):
                    rc_url_pg = rc_url + 'dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=' + str(page)
                    time.sleep(1)
                    res_pg = requests.get(rc_url_pg)
                    soup_pg = BeautifulSoup(res_pg.content, 'html.parser')
                    if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
                        continue
                    try:
                        station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
                    except:
                        try:
                            station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
                        except:
                            station = ''
                    genre = '/'.join([genre_.text for genre_ in soup_pg.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
                    price = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
                    score = [score_.next_sibling.next_sibling.text for score_ in soup_pg.find_all('span', class_='c-rating__time c-rating__time--dinner')]
                    restaurant_data.append([area, rc_count, rc_name, rc_url, rc_score, rc_review_num, station, genre, price, score])

Due to the large amount of data, the area was temporarily limited to visit_areas = ['Shibuya / Ebisu / Daikanyama']. Also, since max_page = 20, only 400 data of 20 items / page✖️20 (max_page) can be acquired for each location. In addition, after acquiring the number of reviews, 100 reviews / page are acquired, but the number of reviews acquired first includes the daytime evaluation, and the loop processing targets only the nighttime evaluation. Therefore, in the process of loop processing, there was a situation where there were not as many reviews as originally expected.

if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
    continue

It was processed in. Also, most shops list the nearest station, but there are a few shops that list the area instead of the nearest station, in which case the tag names are different.

try:
    station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
except:
    try:
        station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
    except:
        station = ''

Is handling this. As a result, we were able to obtain 895 data. Of these, 804 had even one scored review.

The scatter plot above shows the relationship between the average word-of-mouth evaluation and the tabelog score. Shops that do not have a tabelog score are treated as 2.9 points. As a whole, if the average word-of-mouth evaluation is high, you can see that the tabelog score is also high. Although the average word-of-mouth evaluation was high, the tabelog score was low, and it was also found that there are stores that are underrated.

So, I'm going to go to one of these underrated stores next time! !!