[Python] I want to be a gourmet person [Data Driven approach] Choosing a store for the year-end and New Year holidays

1. Motivation

** "Use scraping to find a delicious and powerful restaurant regardless of the tabelog score! 』**

As an index that reflects the voices of users, the score increases as more high evaluations are collected from users who have an influence. For example, if the degree of influence is the same, a store with 100 5-point evaluations will get a higher score than a store with only 2 5-point evaluations.

2. Implemented code

Code to get information from tabelog in Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import math
import time

root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')

area_url = {}
for area in a:
    area_dict = {}
    splitted = area.get('data-swicher-area-list').split('"')
    for i in range(int((len(splitted)-1)/8)):
        area_dict[splitted[i*8+3]] = splitted[i*8+7]
    area_url[area.get('data-swicher-city').split('"')[3]] = area_dict

visit_areas = ['Shibuya / Ebisu / Daikanyama']
url_dict = {}
for visit_area in visit_areas:
    url = area_url['Tokyo'][visit_area]
    time.sleep(1)
    res = requests.get(root_url + url[1:])
    soup = BeautifulSoup(res.content, 'html.parser')
    a = soup.find_all('a', class_='c-link-arrow')
    for area in a:
        href = area['href']
        if href[-21:-8]!=url:
            continue
        else:
            url_dict[area.text] = href

max_page = 20
restaurant_data = []
for area, url in url_dict.items():
    time.sleep(1)
    res_area = requests.get(url)
    soup_area = BeautifulSoup(res_area.content, 'html.parser')
    rc_count = int(soup_area.find_all('span', class_='list-condition__count')[0].text)
    print('There are ' + str(rc_count) + ' restaurants in ' + area)
    for i in range(1,min(math.ceil(rc_count/20)+1,max_page+1,61)):
        print('Processing...  ' + str(i) + '/' + str(min(math.ceil(rc_count/20)+1,max_page+1,61)-1))
        url_rc = url + 'rstLst/RC/' + str(i) + '/?Srt=D&SrtT=nod'
        res_rc = requests.get(url_rc)
        soup_rc = BeautifulSoup(res_rc.content, 'html.parser')
        for rc_div in soup_rc.find_all('div', class_='list-rst__wrap js-open-new-window'):
            rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
            rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
            rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
            if rc_score is None:
                rc_score = -1.
            else:
                rc_score = float(rc_score.text)
            rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
            if rc_review_num != ' - ':
                page = 1
                score = []
                while True:
                    rc_url_pg = rc_url + 'dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=' + str(page)
                    time.sleep(1)
                    res_pg = requests.get(rc_url_pg)
                    soup_pg = BeautifulSoup(res_pg.content, 'html.parser')
                    if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
                        break
                    try:
                        station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
                    except:
                        try:
                            station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
                        except:
                            station = ''
                    genre = '/'.join([genre_.text for genre_ in soup_pg.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
                    price = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
                    score = score + [score_.next_sibling.next_sibling.text for score_ in soup_pg.find_all('span', class_='c-rating__time c-rating__time--dinner')]
                    page += 1
                    if page == math.ceil(int(rc_review_num)/100)+1:
                        break
                restaurant_data.append([area, rc_count, rc_name, rc_url, rc_score, rc_review_num, station, genre, price, score])

Also, for code explanations and details, see Scraping and Tabelog-I want to find a good restaurant! ~ (Work), so please have a look if you are interested.

3. What is the result you are interested in? ??

Plot and visualize data

The result is as follows! The top 400 restaurants in Shibuya, Ebisu, and Daikanyama in order of new opening are targeted. save.png

Implication

What you can read from this scatter plot is ...

4. Deeper consideration

From here on, I'll consider it a little more, with my curiosity in mind.

Score distribution

Looking at the scatter plot earlier, I was curious that the distribution of tabelog scores was distorted, and that restaurants were concentrated on a specific score. So if you look at the distribution save.png I'm also concerned that restaurants are concentrated on a specific score, but I'm also concerned that there are extremely few restaurants on a specific score.

I investigated the relationship with the number of reviews, but I could not get a result that could explain this ... Perhaps the key is the data not acquired this time, such as the number of days since opening. On a rumor basis, there is a theory that the annual membership fee paid by the restaurant to the tabelog limits the tabelog score, which may also have an effect. However, depending on the way of thinking, the tabelog score is low due to the upper limit of the score, but the word-of-mouth score is a factor that causes high restaurants, and if you use the approach introduced this time, ** it is not becoming more popular than necessary. You can find a store **.

5. Summary

Recommended Posts

[Python] I want to be a gourmet person [Data Driven approach] Choosing a store for the year-end and New Year holidays
I want to create a nice Python development environment for my new Mac
I want to be able to analyze data with Python (Part 3)
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
Build a Python environment and transfer data to the server
I want to know the features of Python and pip
I want to cut out only the face from a person image with Python and save it ~ Face detection and trimming with face_recognition ~
I want to create a Dockerfile for the time being.
I want to clear up the question of the "__init__" method and the "self" argument of a Python class.
I want to record the execution time and keep a log.
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
During the New Year holidays, I will study Python at Soshimu's Anaconda.
Python: I want to measure the processing time of a function neatly
I want to make a web application using React and Python flask
I want to build a Python environment
I want to create a priority queue that can be updated in Python (2.7)
I want to make a music player and file music at the same time
I want to exe and distribute a program that resizes images Python3 + pyinstaller
I tried to process and transform the image and expand the data for machine learning
I searched for the skills needed to become a web engineer in Python
[Introduction to Python] How to get the index of data with a for statement
I want to create a window in Python
I want to make a game with Python
I want to write to a file with Python
I want to display the progress in Python!
[Python] I want to use only index when looping a list with a for statement
[Python] I made a system to introduce "recipes I really want" from the recipe site!
The story of Linux that I want to teach myself half a year ago
I just want to find the 95% confidence interval for the difference in population ratios in Python
I want to create a lunch database [EP1] Django study for the first time
I want to write a triple loop and conditional branch in one line in python
I want to create a lunch database [EP1-4] Django study for the first time
I drew a Python graph using public data on the number of patients positive for the new coronavirus (COVID-19) in Tokyo + with a link to the national version of practice data
I want to easily implement a timeout in python
I want to iterate a Python generator many times
I want to give a group_id to a pandas data frame
I want to handle optimization with python and cplex
I want to write in Python! (2) Let's write a test
I want to randomly sample a file in Python
I want to inherit to the back with python dataclass
I want to work with a robot in python.
[Python] I want to make a nested list a tuple
I want to use the R dataset in python
I want to run a quantum computer with Python
Python --Read data from a numeric data file to find the covariance matrix, eigenvalues, and eigenvectors
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I want to find the intersection of a Bezier curve and a straight line (Bezier Clipping method)
I want to create a karaoke sound source by separating instruments and vocals using Python
I also tried to imitate the function monad and State monad with a generator in Python
I want to make a voice changer using Python and SPTK with reference to a famous site
[Python] A program to find the number of apples and oranges that can be harvested