[PYTHON] [First data science ⑥] I tried to visualize the market price of restaurants in Tokyo

Nice to meet you. I'm N.D., a 4th year university student belonging to the Department of Physics. The experience of Python is self-taught and a little touched. It was my first time scraping and crawling.

The Data Science Division of Cacco Inc., which is currently an intern, has the task of creating crawlers during the trial period to collect, process, and visualize data, and briefly discuss what they have learned.

Task

theme

Visualize and consider the market prices of restaurants throughout Tokyo Also, acquire other variables that are likely to be taken and analyze them while comparing them with the budget.

_ Sub-theme _

Since the theme is abstract, I would like to set up the following specific situations. situation Use data to objectively show friends who come to Tokyo, "What is the market price of restaurants in Tokyo, and what genre is the most popular in that market?"

Other discovered events

Visualize and show what you know to compare your budget with other variables along with your sub-themes.

policy

  1. Crawl the gourmet site "Hot Pepper Gourmet" and get the URL of each shop's detail page.
  2. Save the html file from the url of each shop's detail page (number of shops = number of saved html)
  3. Scraping each variable from the obtained HTML file
  4. Visualization / data analysis
  5. Present the answer to the theme

_ Crawling _

This time, we are crawling the search results of "Shops that can make online reservations throughout Tokyo" on the hot pepper gourmet site. Acquired 16475 stores on Wednesday, October 16, 2019.

** Crawling procedure **

  1. Get the number of shops from the first page before crawling to check the acceptance result later
  2. [1st page] Read the url of each shop's detail page (hereinafter, shop url) and save it in the python list.
  3. [Transition to next page] Get the URL of the next page from pagination and transition
  4. Read the shop url from the destination page and save it in the python list
  5. Repeat a few steps until there are no more pages
  6. Jump to the store url saved in the list and save the html file one by one
  7. Finally, get the number of shops by scraping from the last page in the same way as 1.

The crawling code looks like this:

crawling.py


from bs4 import BeautifulSoup
import requests
import time
import os
# timer
t1 = time.time()

# function
# get number of shop
def get_num(soup):
    num = soup.find('p', {'class':'sercheResult fl'}).find('span', {'class':'fcLRed bold fs18 padLR3'}).text
    print('num:{}'.format(num))

# get url of shop
def get_shop_urls(tags):
    shop_urls = []
    # ignore the first shop because it is PR
    tags = tags[1:]
    for tag in tags:
        shop_url = tag.a.get('href')
        shop_urls.append(shop_url)
    return shop_urls

def save_shop_urls(shop_urls, dir_path=None, test=False):
    # make directry
    if test:
        if dir_path is None:
            dir_path = './html_dir_test'
    elif dir_path is None:
        dir_path = './html_dir'

    if not os.path.isdir(dir_path):
        os.mkdir(dir_path)

    for i, shop_url in enumerate(shop_urls):
        time.sleep(1)
        shop_url = 'https://www.hotpepper.jp' + shop_url
        r = requests.get(shop_url).text
        file_path = 'shop{:0>5}_url.html'.format(i)
        with open(dir_path + '/' + file_path, 'w') as f:
            f.write(r)
    # return last shop number
    return len(shop_urls)


start_url = 'https://www.hotpepper.jp/yoyaku/SA11/'
response = requests.get(start_url).text
soup = BeautifulSoup(response, 'html.parser')
tags = soup.find_all('h3', {'class':'detailShopNameTitle'})

# get last page number
last_page = soup.find('li', {'class':'lh27'}).text.replace('1/', '').replace('page', '')
last_page = int(last_page)
print('last page num:{}'.format(last_page))

# get the number of shops before crawling
get_num(soup)

# first page crawling
start_shop_urls = get_shop_urls(tags)

# from 2nd page
shop_urls = []
# last page(test)
last_page = 10 # test
for p in range(last_page-1):
    time.sleep(1)
    url = start_url + 'bgn' + str(p+2) + '/'
    r = requests.get(url).text
    soup = BeautifulSoup(r, 'html.parser')
    tags = soup.find_all('h3', {'class':'detailShopNameTitle'})
    shop_urls.extend(get_shop_urls(tags))
    # how speed
    if p % 100 == 0:
        percent = p/last_page*100
        print('{:.2f}% Done'.format(percent))

start_shop_urls.extend(shop_urls)
shop_urls = start_shop_urls

t2 = time.time()
elapsed_time = t2 - t1
print('time(get_page):{:.2f}s'.format(elapsed_time))
print('num(shop_num):{}'.format(len(shop_urls)))

# get the url of shop
last_num = save_shop_urls(shop_urls) # html_dir

# get the number of shops after crawling
get_num(soup)

t3 = time.time()
elapsed_time = t3 - t1
print('time(get_html):{:.2f}s'.format(elapsed_time))
print('num(shop_num):{}'.format(last_num))

Scraping

The following are the variables scraped this time. scraping_var.png

procedure

  1. Scrap the above 9 variables for each store
  2. When all the variables are available, add them as records to the DataFrame of pandas.
  3. Check if the number of records is consistent with the number of shops acquired by crawling

The scraping code looks like this:

scraping.py


from bs4 import BeautifulSoup
import glob
import requests
import time
import os
import pandas as pd
from tqdm import tqdm
import numpy as np


def get_shopinfo(category, soup):
    shopinfo_th = soup.find('div', {'class':'shopInfoDetail'}).find_all('th')
    # get 'category' from 'shopinfo_th'
    category_value = list(filter(lambda x: category in x , shopinfo_th))
    if not category_value:
        category_value = None
    else:
        category_value = category_value[0]
        category_index = shopinfo_th.index(category_value)
        shopinfo_td = soup.find('div', {'class':'shopInfoDetail'}).find_all('td')
        category_value = shopinfo_td[category_index].text.replace('\n', '').replace('\t', '')
    return category_value

# judge [] or in
def judge(category):
    if category is not None:
        category = category.text.replace('\n', '').replace('\t', '')
    else:
        category = np.nan
    return category

# judge [] or in
def judge_atag(category):
    if category is not None:
        category = category.a.text.replace('\n', '').replace('\t', '')
    else:
        category = np.nan
    return category

# judge [] or in
def judge_ptag(category):
    if category is not None:
        category = category.p.text.replace('\n', '').replace('\t', '')
    else:
        category = np.nan
    return category

# judge [] or in
def judge_spantag(category):
    if category is not None:
        category = category.span.text.replace('\n', '').replace('\t', '')
    else:
        category = 0
    return category

# available=1, not=0
def available(strlist):
    available_flg = 0
    if 'available' in strlist:
        available_flg = 1
    return available_flg

# categorize money
def category2index(category, range):
    if category in range:
        category = range.index(category)
    return category

def scraping(html, df, price_range):
    soup = BeautifulSoup(html, 'html.parser')
    dinner = soup.find('span', {'class':'shopInfoBudgetDinner'})
    dinner = judge(dinner)
    dinner = category2index(dinner, price_range)
    lunch = soup.find('span', {'class':'shopInfoBudgetLunch'})
    lunch = judge(lunch)
    lunch = category2index(lunch, price_range)
    genre_tag = soup.find_all('dl', {'class':'shopInfoInnerSectionBlock cf'})[1]
    genre = genre_tag.find('p', {'class':'shopInfoInnerItemTitle'})
    genre = judge_atag(genre)
    area_tag = soup.find_all('dl', {'class':'shopInfoInnerSectionBlock cf'})[2]
    area = area_tag.find('p', {'class':'shopInfoInnerItemTitle'})
    area = judge_atag(area)
    rating = soup.find('div', {'class':'ratingInfo'})
    rating = judge_ptag(rating)
    review = soup.find('p', {'class':'review'})
    review = judge_spantag(review)
    f_meter = soup.find_all('dl', {'class':'featureMeter cf'})
    # if 'f_meter' is nan, 'size'='customer'='people'='peek'=nan
    if f_meter == []:
        size = np.nan
        customer = np.nan
        people = np.nan
        peek = np.nan
    else:
        meterActive = f_meter[0].find('span', {'class':'meterActive'})
        size = f_meter[0].find_all('span').index(meterActive)
        meterActive = f_meter[1].find('span', {'class':'meterActive'})
        customer = f_meter[1].find_all('span').index(meterActive)
        meterActive = f_meter[2].find('span', {'class':'meterActive'})
        people = f_meter[2].find_all('span').index(meterActive)
        meterActive = f_meter[3].find('span', {'class':'meterActive'})
        peek = f_meter[3].find_all('span').index(meterActive)
    credits = get_shopinfo('credit card', soup)
    credits = available(credits)
    emoney = get_shopinfo('Electronic money', soup)
    emoney = available(emoney)
    data = [lunch, dinner, genre, area, float(rating), review, size, customer, people, peek, credits, emoney]
    s = pd.Series(data=data, index=df.columns, name=str(i))
    df = df.append(s)
    return df

columns = ['budget(Noon)', 'budget(Night)', "Genre", "area", 'Evaluation', 'Number of reviews', 'Shop size'
           , 'Customer base', 'Number of people/set', 'Peak hours', 'credit card', 'Electronic money']
base_url = 'https://www.hotpepper.jp/SA11/'
response = requests.get(base_url).text
soup = BeautifulSoup(response, 'html.parser')
# GET range of price
price_range = soup.find('ul', {'class':'samaColumnList'}).find_all('a')
price_range = [p.text for p in price_range]
# price_range = ['~500 yen', '501-1000 yen', '1001-1500 yen', '1501-2000 yen', '2001-3000 yen', '3001-4000 yen', '4001-5000 yen'
#             , '5001 to 7000 yen', '7001-10000 yen', '10001-15000 yen', '15001 ~ 20000 yen', '20001-30000 yen', '30001 yen ~']

num = 16475  # number of data
# num = 1000 # test
df = pd.DataFrame(data=None, columns=columns)

for i in range(num):
# for i in tqdm(lis):
    html = './html_dir/shop{:0>5}_url.html'.format(i)
    with open(html,"r", encoding='utf-8') as f:
        shop_html = f.read()

    df = scraping(shop_html, df, price_range)
    if i % 1600 == 0:
        percent = i/num*100
        print('{:.3f}% Done'.format(percent))

df.to_csv('shop_info.csv', encoding='shift_jis')

_ Acceptance result _

The acceptance results are as follows. スクリーンショット 2019-11-26 11.41.30.png

It took a little less than an hour to crawl, so the site was updated during that time. You can see that there is a difference between the number of stores that were initially and the number of stores after crawling.

Results for sub-themes

_ Confirm sub-theme _

"Visualize the market prices of restaurants in Tokyo, Clarify which genre of shops are the most popular in that price range. "

Conclusion for sub-theme

--The market price for dinner is "** 2000-4000 yen ". --The market price for lunch is " 500-1000 yen ". ――The genre with the highest percentage in each of the dinner and lunch prices is " Izakaya **". —— Also, at lunch, the “500-1000 yen izakaya” would be a ** double cropping shop **. Here, the market price of the budget is defined as "mode value, not average value".

The underlying data are shown below in order.

_ Budget quote _

We have visualized the market price of the budget separately for dinner and lunch. スクリーンショット 2019-11-26 11.56.37.png

Genre by price range

From the above results, we have found a rough market price for restaurants in Tokyo, so let's visualize the genres by price range. スクリーンショット 2019-11-27 15.07.47.png スクリーンショット 2019-11-27 15.09.30.png

** Genres included in "Other" ** For both dinner and lunch, the following genres with a small total number are included in "Other". [Okonomiyaki / Monja / Cafe / Sweets / Ramen / Korean / International / Western / Creative / Other Gourmet]

I thought that "Izakaya" in the price range of "500-1000 yen" was too cheap for lunch, so I will dig deeper here.

_ What is an izakaya for "500-1000 yen"?

As shown below, it can be seen that while calling itself "Izakaya", the menu for lunch is offered during the day. スクリーンショット 2019-11-26 12.10.38.png

Other discovered events

Conclusion

--The customer base of shops in the price range of "** 7,000 yen to " for dinner tends to be more male than female customers, and both dinner and lunch are " 1000 to 3000 yen **". The customer base of shops in the price range tends to be more female than male.

--Both dinner and lunch tend to be ** highly rated ** as they become ** higher priced **.

-In the ** high price range **, there are many stores that accept ** credit cards **

――Shops in the price range of "** 2000-4000 yen **" for dinner tend to have a wide ** capacity **.

The data that serves as the basis are shown below.

Customer base by price range

We compared price ranges by customer base. スクリーンショット 2019-11-26 12.17.30.png

From this, it can be said that the customer base of shops in the price range of "7,000 yen" tends to be more male than female customers at dinner, and the price of "1000-3000 yen" for both dinner and lunch. It was found that the customer base of the obi shop tends to have more female customers than male customers.

Evaluation by price range

We will plot the ratings for each of the dinner and lunch price ranges. At that time, there are many shops with the same evaluation in the same price range, so we adopted jittering and intentionally shifted the plot. The results of the t-test are shown below the graph. ** Definition of t-test ** Dinner: Grouped at shops under 4000 yen and shops over 4000 yen Lunch: Grouped by shops under 2000 yen and shops over 2000 yen スクリーンショット 2019-11-26 12.31.33.png スクリーンショット 2019-11-26 12.32.18.png You can see that the higher the price range for both dinner and lunch, the higher the rating tends to be. From the results of the t-test, it can be said that there is a difference in "** evaluation **" between the high price range and the low price range.

_ Credit card usage by price range _

We compared the usage status of credit cards by price range. スクリーンショット 2019-11-27 15.22.17.png

Again, instinctively, we found that a large percentage of stores in the ** high price range ** accept ** credit cards **. In addition, the price range of "10,000 yen ~" for lunch is not displayed because the number of cases that is sufficient for evaluation was not obtained as 4 cases.

Store size by price range

We compared the sizes of shops evaluated on a 5-point scale by price range. Since I could conclude only dinner here, it will be posted only there. The darker the blue, the wider the store. スクリーンショット 2019-11-26 13.04.13.png You can see that shops in the price range of "2000-4000 yen" tend to have a large capacity for dinner. Since the ratio of izakaya is large in this price range, it is thought that izakaya with a large capacity is large.

in conclusion

_ Reflections _

I realized for myself how difficult it is to "collect the information obtained by scraping and visualize it so that the conclusion can be conveyed to the other party." ** If you do it again ** Set a clear purpose for your analysis before writing code and back-calculate to plan your process

_What I learned from feedback _

** Received code review ** I received the following points, and I would like to improve it thereafter.

--Write code while being aware of the python code convention called pep8 ――Please submit after organizing unnecessary line breaks and comment outs.

** After the announcement review ** It was a process of "how to show the graph to make it easier to convey". We received feedback that it is important to create an "intuitive graph", such as saying that the higher the price range, the higher the price range, and that the density is expressed by jittering. I also learned that showing a story-like conclusion leads to the understanding of the other person. I will spend my analytical life in the future while being aware of how to connect the obtained results to real problems.

Recommended Posts

[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to predict the price of ETF
I tried to visualize the spacha information of VTuber
[First data science ⑤] I tried to help my friend find the first property by data analysis.
[Python] I tried to visualize the follow relationship of Twitter
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to find the trend of the number of ships in Tokyo Bay from satellite images.
Use PyCaret to predict the price of pre-owned apartments in Tokyo!
I tried to visualize the common condition of VTuber channel viewers
I tried to solve the first question of the University of Tokyo 2019 math entrance exam with python sympy
I tried fitting the exponential function and logistics function to the number of COVID-19 positive patients in Tokyo
I tried to visualize the age group and rate distribution of Atcoder
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to display the altitude value of DTM in a graph
I tried to save the data with discord
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to vectorize the lyrics of Hinatazaka46!
I tried to open the latest data of the Excel file managed by date in the folder with Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried to summarize the basic form of GPLVM
I tried to predict the J-League match (data analysis)
I tried using the API of the salmon data project
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
I tried to visualize the lyrics of GReeeen, which I used to listen to crazy in my youth but no longer listen to it.
I tried to find the entropy of the image with python
Try scraping the data of COVID-19 in Tokyo with Python
[Horse Racing] I tried to quantify the strength of racehorses
[First COTOHA API] I tried to summarize the old story
I tried to get the location information of Odakyu Bus
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried to summarize the code often used in Pandas
I tried to illustrate the time and time in C language
I tried to summarize the commands often used in business
I tried to implement the mail sending function in Python
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
[Celebration: Market capitalization $ 2 trillion] I tried to visualize Apple's patent
[Python] I tried collecting data using the API of wikipedia
I tried to fight the Local Minimum of Goldstein-Price Function
What I saw by analyzing the data of the engineer market
I tried to implement blackjack of card game in Python
I sent the data of Raspberry Pi to GCP (free)
I tried to visualize the power consumption of my house with Nature Remo E lite
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
I tried to move ROS (Melodic) with the first Raspberry Pi (Stretch) at the beginning of 2021
I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action
I tried to create a Python script to get the value of a cell in Microsoft Excel
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to put HULFT IoT (Edge Streaming) in the gateway Rooster of Sun Electronics
Since the stock market crashed due to the influence of the new coronavirus, I tried to visualize the performance of my investment trust with Python.
I tried to describe the traffic in real time with WebSocket
[Linux] I tried to summarize the command of resource confirmation system