[PYTHON] Scraping and tabelog ~ I want to find a good restaurant! ~ (Work)

1. Motivation

2. About tabelog

3. Scraping area and URL acquisition

Looking at the URL of each store, in the case of Tokyo, it is in the form of "//tabelog.com/tokyo/A ..../A....../......../" It has become. For example, in the case of a shop in Shibuya, the URL is "//tabelog.com/tokyo/A1303/A130301/......../" and "Tabelog / Tokyo / Shibuya / Ebisu / Daikanyama / It can be interpreted as "Shibuya / specific shop /". Looking at the top page of the tabelog, there is data up to "// tabelog.com/tokyo/A ..../" in the place of "Search by area", so I will get this first. As expected, we don't need all the data for the whole country, so after narrowing down a large area to some extent, we will get only the data we want.

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')

I parsed it with Beautiful Soup like this, and when I looked inside,

<h2 class="rsttop-heading1 rsttop-search__title">
Search by area
                  </h2>
</div>
<ul class="rsttop-area-search__list">
<li class="rsttop-area-search__item">
<a class="rsttop-area-search__target js-area-swicher-target" data-swicher-area-list='[{"areaName":"Ginza / Shinbashi / Yurakucho","url":"/tokyo/A1301/"},{"areaName":"Nihonbashi, Tokyo","url":"/tokyo/A1302/"},{"areaName":"Shibuya / Ebisu / Daikanyama","url":"/tokyo/A1303/"},...

↑ The data you want around here! !!

a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')
a[0].get('data-swicher-area-list')

If you do it like that

'[{"areaName":"Ginza / Shinbashi / Yurakucho","url":"/tokyo/A1301/"},{"areaName":"Nihonbashi, Tokyo","url":"/tokyo/A1302/"},{"areaName":"Shibuya / Ebisu / Daikanyama","url":"/tokyo/A1303/"},...

And so on. However, I thought that it was a dictionary list, but it was completely a character string. .. .. So I tried to find out how to do it, but I couldn't find it, so I'll fix it to the necessary form forcibly though it's not good. If you know how to handle the process here smoothly, please let me know!

splitted = a[0].get('data-swicher-area-list').split('"')
area_dict = {}
for i in range(int((len(splitted)-1)/8)):
    area_dict[splitted[i*8+3]] = splitted[i*8+7]

With this, I managed to get the following dictionary.

{'Ueno / Asakusa / Nippori': '/tokyo/A1311/',
 'Ryogoku, Kinshicho, Koiwa': '/tokyo/A1312/',
 'Nakano-Nishi-Ogikubo': '/tokyo/A1319/',...

To be honest, Tokyo alone is fine, but if you want to get it comprehensively, it's as follows.

area_url = {}
for area in a:
    area_dict = {}
    splitted = area.get('data-swicher-area-list').split('"')
    for i in range(int((len(splitted)-1)/8)):
        area_dict[splitted[i*8+3]] = splitted[i*8+7]
    area_url[area.get('data-swicher-city').split('"')[3]] = area_dict

What I was interested in on the way was len (area_url) = 47 for len (a) = 53. When I looked it up, the cause was that Tokyo, Kanagawa, Aichi, Osaka, Kyoto, and Fukuoka appeared twice, but the content was the same for the part I wanted, so the above code said that the purpose was achieved. It was judged. The URL can be obtained in the following form.

area_url
  │
  ├──'Tokyo'
  │    ├──'Ueno / Asakusa / Nippori' : '/tokyo/A1311/'
  │    ├──'Ryogoku, Kinshicho, Koiwa' : '/tokyo/A1312/'
  │          ⋮
  │    └──'Ginza / Shinbashi / Yurakucho' : '/tokyo/A1301/'
  │
  ├──'Kanagawa'
  │    ├──'Around Odawara' : '/kanagawa/A1409/'
  │          ⋮
  ⋮          

Now that the major classification of the area has been obtained, the next step is to obtain the minor classification. In the same way as getting a major classification,

url = '/tokyo/A1304/'
res = requests.get(root_url + url[1:])
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='c-link-arrow')
area_dict = {}
for area in a:
    href = area['href']
    if href[-21:-8]!=url:
        continue
    else:
        area_dict[area.text] = href

If you do

{'Yoyogi': 'https://tabelog.com/tokyo/A1304/A130403/',
 'Okubo / Shin-Okubo': 'https://tabelog.com/tokyo/A1304/A130404/',
 'Shinjuku': 'https://tabelog.com/tokyo/A1304/A130401/',
 'Shinjuku Gyoen': 'https://tabelog.com/tokyo/A1304/A130402/'}

It feels good. In addition, the if statement was inserted because some advertisements have class = "c-link-arrow" when soup.find_all ('a', class_ ='c-link-arrow') is executed. This is to eliminate these.

Next, specify the area you want to go to and get the URL of that area.

visit_areas = ['Roppongi / Azabu / Hiroo', 'Harajuku / Omotesando / Aoyama', 'Yotsuya / Ichigaya / Iidabashi', 'Shinjuku / Yoyogi / Okubo', 
               'Nihonbashi, Tokyo', 'Shibuya / Ebisu / Daikanyama', 'Meguro / Platinum / Gotanda', 'Akasaka / Nagatacho / Tameike', 'Ginza / Shinbashi / Yurakucho']
url_dict = {}
for visit_area in visit_areas:
    url = area_url['Tokyo'][visit_area]
    time.sleep(1)
    res = requests.get(root_url + url[1:])
    soup = BeautifulSoup(res.content, 'html.parser')
    a = soup.find_all('a', class_='c-link-arrow')
    for area in a:
        href = area['href']
        if href[-21:-8]!=url:
            continue
        else:
            url_dict[area.text] = href

We succeeded in getting the URL of 34 areas in the following form!

{'Marunouchi / Otemachi': 'https://tabelog.com/tokyo/A1302/A130201/',
 'Kudanshita': 'https://tabelog.com/tokyo/A1309/A130906/',...

4. Scraping individual restaurant URL acquisition

Now that you have the URL that points to the area ("// tabelog.com/tokyo/A ..../A....../"), the next step is to get the URL of the individual restaurant ("/". /tabelog.com/tokyo/A ..../A....../......../").

url = 'https://tabelog.com/tokyo/A1302/A130201/'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

The above url will show you the first 20 stores in this area. The URL of the next 20 stores is "//tabelog.com/tokyo/A1302/A130201/rstLst/2/", and you can specify rstLst / 3,4,5, ... You can see that. Since the maximum value of rstLst is required for loop processing here, the value obtained by dividing the total number of restaurants by 20 is rounded up to an integer as shown below.

import math
count = soup.find_all('span', class_='list-condition__count')
print(math.ceil(int(count[0].text)/20))
90

There are 1,784 shops in total, and if there are 20 shops on one page, the last page will be the 90th page. However, when I tried to display the 90th page, ...

Unable to display this page Thank you for using Tabelog. It cannot be displayed after page 60. Please narrow down the conditions and search again.

Is displayed! !! For the time being, it only displays up to 60 pages. So, you don't have to worry about whether you want to narrow down the conditions beyond the area so that the number of stores is 1,200 or less and then run the loop processing, or get the top 1,200 in the order of new opening and be satisfied. In a bad way. So, once check how many shops are listed in each area.

counts = {}
for key,value in url_dict.items():
    time.sleep(1)
    res = requests.get(value)
    soup = BeautifulSoup(res.content, 'html.parser')
    counts[key] = int(soup.find_all('span', class_='list-condition__count')[0].text)
print(sorted(counts.items(), key=lambda x:x[1], reverse=True)[:15])
[('Shinjuku', 5756),
 ('Shibuya', 3420),
 ('Shimbashi / Shiodome', 2898),
 ('Ginza', 2858),
 ('Roppongi / Nogizaka / Nishiazabu', 2402),
 ('Marunouchi / Otemachi', 1784),
 ('Iidabashi / Kagurazaka', 1689),
 ('Ebisu', 1584),
 ('Nihonbashi / Kyobashi', 1555),
 ('Akasaka', 1464),
 ('Ningyocho / Kodenmacho', 1434),
 ('Gotanda / Takanawadai', 937),
 ('Yurakucho / Hibiya', 773),
 ('Tameike Sanno / Kasumigaseki', 756),
 ('Kayabacho / Hatchobori', 744)]

As a result, it became clear that the number of publications exceeded 1,200 in 11 areas. As a result of various trials and errors on what to do, I decided to be satisfied because I limited the genre to restaurants in light of this purpose and acquired the data of the top 1,200 items in the order of new opening. For the time being, get the restaurant information displayed on a specific page.

url = 'https://tabelog.com/tokyo/A1301/A130101/rstLst/RC/1/?Srt=D&SrtT=nod'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

rc_list = []
for rc_div in soup.find_all('div', class_='list-rst__wrap js-open-new-window'):
    rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
    rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
    rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
    if rc_score is None:
        rc_score = -1.
    else:
        rc_score = float(rc_score.text)
    rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
    rc_list.append([rc_name, rc_url, rc_score, rc_review_num])

I will add some explanations. First, for the first url, the genre is limited to restaurants with'/ rstLst / RC'. The'/1'that comes after that means the first page, that is, the first 20 cases. Furthermore,'/? Srt = D & SrtT = nod' is a new open order specification. In the for statement, 20 restaurant data are processed in order. It is the tabelog score that needs attention. You can get the score with the above find method, but if there is no score, this tag itself does not exist. Therefore, the if statement is None was used to classify the cases, and if there was no score, the score was once set to -1. Regarding the number of reviews, if there is no review,'-' will be returned. After that, you can get the URL of the restaurant by turning the loop for each area and each page! !!

5. Scraping word-of-mouth and evaluation acquisition

Now that you can get the URL of each restaurant, the purpose is to get word-of-mouth information from the page of each restaurant. When I put the code first, I did the following.

url = 'https://tabelog.com/tokyo/A1301/A130101/13079232/dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=1'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

station = soup.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
genre = '/'.join([genre_.text for genre_ in soup.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
price = soup.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
score = [score_.next_sibling.next_sibling.text for score_ in soup.find_all('span', class_='c-rating__time c-rating__time--dinner')]
print(station, genre, price, score)

I will add commentary. First of all, regarding the url,'/ dtlrvwlst' is a review list,'/ COND-2' is a night review,'smp0' is a simple display,'lc = 2'is 100 items each, and'PG = 1'is 1 It means the page. The nearest station, genre, and budget are acquired in order because the data is in the'dl'tag whose class name is'rdheader-subinfo__item'. As for genres, in most cases, multiple genres are assigned to one store, so here, all genre names are combined once with'/'. The budget and the score of each word of mouth are a little complicated because I wanted only the one at night.

6. Run! !!

Now that we can get the necessary information individually, we just need to get the data we want by loop processing! !!

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import math
import time
from tqdm import tqdm

root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')

area_url = {}
for area in a:
    area_dict = {}
    splitted = area.get('data-swicher-area-list').split('"')
    for i in range(int((len(splitted)-1)/8)):
        area_dict[splitted[i*8+3]] = splitted[i*8+7]
    area_url[area.get('data-swicher-city').split('"')[3]] = area_dict

visit_areas = ['Shibuya / Ebisu / Daikanyama']
url_dict = {}
for visit_area in visit_areas:
    url = area_url['Tokyo'][visit_area]
    time.sleep(1)
    res = requests.get(root_url + url[1:])
    soup = BeautifulSoup(res.content, 'html.parser')
    a = soup.find_all('a', class_='c-link-arrow')
    for area in a:
        href = area['href']
        if href[-21:-8]!=url:
            continue
        else:
            url_dict[area.text] = href

max_page = 20
restaurant_data = []
for area, url in url_dict.items():
    time.sleep(1)
    res_area = requests.get(url)
    soup_area = BeautifulSoup(res_area.content, 'html.parser')
    rc_count = int(soup_area.find_all('span', class_='list-condition__count')[0].text)
    print('There are ' + str(rc_count) + ' restaurants in ' + area)
    for i in tqdm(range(1,min(math.ceil(rc_count/20)+1,max_page+1,61))):
        url_rc = url + 'rstLst/RC/' + str(i) + '/?Srt=D&SrtT=nod'
        res_rc = requests.get(url_rc)
        soup_rc = BeautifulSoup(res_rc.content, 'html.parser')
        for rc_div in soup_rc.find_all('div', class_='list-rst__wrap js-open-new-window'):
            rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
            rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
            rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
            if rc_score is None:
                rc_score = -1.
            else:
                rc_score = float(rc_score.text)
            rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
            if rc_review_num != ' - ':
                for page in range(1,math.ceil(int(rc_review_num)/100)+1):
                    rc_url_pg = rc_url + 'dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=' + str(page)
                    time.sleep(1)
                    res_pg = requests.get(rc_url_pg)
                    soup_pg = BeautifulSoup(res_pg.content, 'html.parser')
                    if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
                        continue
                    try:
                        station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
                    except:
                        try:
                            station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
                        except:
                            station = ''
                    genre = '/'.join([genre_.text for genre_ in soup_pg.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
                    price = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
                    score = [score_.next_sibling.next_sibling.text for score_ in soup_pg.find_all('span', class_='c-rating__time c-rating__time--dinner')]
                    restaurant_data.append([area, rc_count, rc_name, rc_url, rc_score, rc_review_num, station, genre, price, score])

Due to the large amount of data, the area was temporarily limited to visit_areas = ['Shibuya / Ebisu / Daikanyama']. Also, since max_page = 20, only 400 data of 20 items / page✖️20 (max_page) can be acquired for each location. In addition, after acquiring the number of reviews, 100 reviews / page are acquired, but the number of reviews acquired first includes the daytime evaluation, and the loop processing targets only the nighttime evaluation. Therefore, in the process of loop processing, there was a situation where there were not as many reviews as originally expected.

if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
    continue

It was processed in. Also, most shops list the nearest station, but there are a few shops that list the area instead of the nearest station, in which case the tag names are different.

try:
    station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
except:
    try:
        station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
    except:
        station = ''

Is handling this. As a result, we were able to obtain 895 data. Of these, 804 had even one scored review.

save.png

The scatter plot above shows the relationship between the average word-of-mouth evaluation and the tabelog score. Shops that do not have a tabelog score are treated as 2.9 points. As a whole, if the average word-of-mouth evaluation is high, you can see that the tabelog score is also high. Although the average word-of-mouth evaluation was high, the tabelog score was low, and it was also found that there are stores that are underrated.

So, I'm going to go to one of these underrated stores next time! !!

Recommended Posts

Scraping and tabelog ~ I want to find a good restaurant! ~ (Work)
I want to easily find a delicious restaurant
I want to find a popular package on PyPi
I want to work with a robot in python.
I want to find the intersection of a Bezier curve and a straight line (Bezier Clipping method)
I want to record the execution time and keep a log.
I want to acquire and list Japanese stock data without scraping
I want to create a pipfile and reflect it in docker
I wrote a script to help goodnotes5 and Anki work together
I want to print in a comprehension
I want to build a Python environment
I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
I tried to verify the best way to find a good marriage partner
I want to make a web application using React and Python flask
I want to make matplotlib a dark theme
Scraping tabelog with python and outputting to CSV
I want to INSERT a DataFrame into MSSQL
I want to create a window in Python
I want to make a game with Python
I don't want to take a coding test
I want to create a plug-in type implementation
I want to sell Mercari by scraping python
I want to write to a file with Python
I want to upload a Django app to heroku
I want to drop a file on tkinter and get its path [Tkinter DnD2]
I want to make a music player and file music at the same time
I want to write an element to a file with numpy and check it.
I want to embed a variable in a Python string
I want to easily implement a timeout in python
I want to iterate a Python generator many times
I want DQN Puniki to hit a home run
100 image processing knocks !! (021-030) I want to take a break ...
I want to give a group_id to a pandas data frame
I want to generate a UUID quickly (memorandum) ~ Python ~
I want to transition with a button in flask
I want to climb a mountain with reinforcement learning
I want to write in Python! (2) Let's write a test
I want to randomly sample a file in Python
I want to easily build a model-based development environment
I want to split a character string with hiragana
I want to install a package of Php Redis
I want to write a triple loop and conditional branch in one line in python
[Python] I want to make a nested list a tuple
I want to manually create a legend with matplotlib
I want to send a business start email automatically
I want to run a quantum computer with Python
I want to bind a local variable with lambda
Good and bad code to compare on a minimap
I want to find a stock that will rise 5 minutes after the Nikkei Stock Average rises
I want to create a karaoke sound source by separating instruments and vocals using Python
I want to make a voice changer using Python and SPTK with reference to a famous site
I want to make a blog editor with django admin
I want to start a jupyter environment with one command
I want to start a lot of processes from python
I want to make a click macro with pyautogui (desire)
NikuGan ~ I want to see a lot of delicious meat! !!
I want to make a click macro with pyautogui (outlook)
I tried crawling and scraping a horse racing site Part 2
I want to use a virtual environment with jupyter notebook!
I want to install a package from requirements.txt with poetry
I want to know the features of Python and pip