Nice to meet you. I'm N.D., a 4th year university student belonging to the Department of Physics. The experience of Python is self-taught and a little touched. It was my first time scraping and crawling.
The Data Science Division of Cacco Inc., which is currently an intern, has the task of creating crawlers during the trial period to collect, process, and visualize data, and briefly discuss what they have learned.
Visualize and consider the market prices of restaurants throughout Tokyo Also, acquire other variables that are likely to be taken and analyze them while comparing them with the budget.
Since the theme is abstract, I would like to set up the following specific situations. situation Use data to objectively show friends who come to Tokyo, "What is the market price of restaurants in Tokyo, and what genre is the most popular in that market?"
Visualize and show what you know to compare your budget with other variables along with your sub-themes.
This time, we are crawling the search results of "Shops that can make online reservations throughout Tokyo" on the hot pepper gourmet site. Acquired 16475 stores on Wednesday, October 16, 2019.
** Crawling procedure **
The crawling code looks like this:
crawling.py
from bs4 import BeautifulSoup
import requests
import time
import os
# timer
t1 = time.time()
# function
# get number of shop
def get_num(soup):
num = soup.find('p', {'class':'sercheResult fl'}).find('span', {'class':'fcLRed bold fs18 padLR3'}).text
print('num:{}'.format(num))
# get url of shop
def get_shop_urls(tags):
shop_urls = []
# ignore the first shop because it is PR
tags = tags[1:]
for tag in tags:
shop_url = tag.a.get('href')
shop_urls.append(shop_url)
return shop_urls
def save_shop_urls(shop_urls, dir_path=None, test=False):
# make directry
if test:
if dir_path is None:
dir_path = './html_dir_test'
elif dir_path is None:
dir_path = './html_dir'
if not os.path.isdir(dir_path):
os.mkdir(dir_path)
for i, shop_url in enumerate(shop_urls):
time.sleep(1)
shop_url = 'https://www.hotpepper.jp' + shop_url
r = requests.get(shop_url).text
file_path = 'shop{:0>5}_url.html'.format(i)
with open(dir_path + '/' + file_path, 'w') as f:
f.write(r)
# return last shop number
return len(shop_urls)
start_url = 'https://www.hotpepper.jp/yoyaku/SA11/'
response = requests.get(start_url).text
soup = BeautifulSoup(response, 'html.parser')
tags = soup.find_all('h3', {'class':'detailShopNameTitle'})
# get last page number
last_page = soup.find('li', {'class':'lh27'}).text.replace('1/', '').replace('page', '')
last_page = int(last_page)
print('last page num:{}'.format(last_page))
# get the number of shops before crawling
get_num(soup)
# first page crawling
start_shop_urls = get_shop_urls(tags)
# from 2nd page
shop_urls = []
# last page(test)
last_page = 10 # test
for p in range(last_page-1):
time.sleep(1)
url = start_url + 'bgn' + str(p+2) + '/'
r = requests.get(url).text
soup = BeautifulSoup(r, 'html.parser')
tags = soup.find_all('h3', {'class':'detailShopNameTitle'})
shop_urls.extend(get_shop_urls(tags))
# how speed
if p % 100 == 0:
percent = p/last_page*100
print('{:.2f}% Done'.format(percent))
start_shop_urls.extend(shop_urls)
shop_urls = start_shop_urls
t2 = time.time()
elapsed_time = t2 - t1
print('time(get_page):{:.2f}s'.format(elapsed_time))
print('num(shop_num):{}'.format(len(shop_urls)))
# get the url of shop
last_num = save_shop_urls(shop_urls) # html_dir
# get the number of shops after crawling
get_num(soup)
t3 = time.time()
elapsed_time = t3 - t1
print('time(get_html):{:.2f}s'.format(elapsed_time))
print('num(shop_num):{}'.format(last_num))
The following are the variables scraped this time.
procedure
The scraping code looks like this:
scraping.py
from bs4 import BeautifulSoup
import glob
import requests
import time
import os
import pandas as pd
from tqdm import tqdm
import numpy as np
def get_shopinfo(category, soup):
shopinfo_th = soup.find('div', {'class':'shopInfoDetail'}).find_all('th')
# get 'category' from 'shopinfo_th'
category_value = list(filter(lambda x: category in x , shopinfo_th))
if not category_value:
category_value = None
else:
category_value = category_value[0]
category_index = shopinfo_th.index(category_value)
shopinfo_td = soup.find('div', {'class':'shopInfoDetail'}).find_all('td')
category_value = shopinfo_td[category_index].text.replace('\n', '').replace('\t', '')
return category_value
# judge [] or in
def judge(category):
if category is not None:
category = category.text.replace('\n', '').replace('\t', '')
else:
category = np.nan
return category
# judge [] or in
def judge_atag(category):
if category is not None:
category = category.a.text.replace('\n', '').replace('\t', '')
else:
category = np.nan
return category
# judge [] or in
def judge_ptag(category):
if category is not None:
category = category.p.text.replace('\n', '').replace('\t', '')
else:
category = np.nan
return category
# judge [] or in
def judge_spantag(category):
if category is not None:
category = category.span.text.replace('\n', '').replace('\t', '')
else:
category = 0
return category
# available=1, not=0
def available(strlist):
available_flg = 0
if 'available' in strlist:
available_flg = 1
return available_flg
# categorize money
def category2index(category, range):
if category in range:
category = range.index(category)
return category
def scraping(html, df, price_range):
soup = BeautifulSoup(html, 'html.parser')
dinner = soup.find('span', {'class':'shopInfoBudgetDinner'})
dinner = judge(dinner)
dinner = category2index(dinner, price_range)
lunch = soup.find('span', {'class':'shopInfoBudgetLunch'})
lunch = judge(lunch)
lunch = category2index(lunch, price_range)
genre_tag = soup.find_all('dl', {'class':'shopInfoInnerSectionBlock cf'})[1]
genre = genre_tag.find('p', {'class':'shopInfoInnerItemTitle'})
genre = judge_atag(genre)
area_tag = soup.find_all('dl', {'class':'shopInfoInnerSectionBlock cf'})[2]
area = area_tag.find('p', {'class':'shopInfoInnerItemTitle'})
area = judge_atag(area)
rating = soup.find('div', {'class':'ratingInfo'})
rating = judge_ptag(rating)
review = soup.find('p', {'class':'review'})
review = judge_spantag(review)
f_meter = soup.find_all('dl', {'class':'featureMeter cf'})
# if 'f_meter' is nan, 'size'='customer'='people'='peek'=nan
if f_meter == []:
size = np.nan
customer = np.nan
people = np.nan
peek = np.nan
else:
meterActive = f_meter[0].find('span', {'class':'meterActive'})
size = f_meter[0].find_all('span').index(meterActive)
meterActive = f_meter[1].find('span', {'class':'meterActive'})
customer = f_meter[1].find_all('span').index(meterActive)
meterActive = f_meter[2].find('span', {'class':'meterActive'})
people = f_meter[2].find_all('span').index(meterActive)
meterActive = f_meter[3].find('span', {'class':'meterActive'})
peek = f_meter[3].find_all('span').index(meterActive)
credits = get_shopinfo('credit card', soup)
credits = available(credits)
emoney = get_shopinfo('Electronic money', soup)
emoney = available(emoney)
data = [lunch, dinner, genre, area, float(rating), review, size, customer, people, peek, credits, emoney]
s = pd.Series(data=data, index=df.columns, name=str(i))
df = df.append(s)
return df
columns = ['budget(Noon)', 'budget(Night)', "Genre", "area", 'Evaluation', 'Number of reviews', 'Shop size'
, 'Customer base', 'Number of people/set', 'Peak hours', 'credit card', 'Electronic money']
base_url = 'https://www.hotpepper.jp/SA11/'
response = requests.get(base_url).text
soup = BeautifulSoup(response, 'html.parser')
# GET range of price
price_range = soup.find('ul', {'class':'samaColumnList'}).find_all('a')
price_range = [p.text for p in price_range]
# price_range = ['~500 yen', '501-1000 yen', '1001-1500 yen', '1501-2000 yen', '2001-3000 yen', '3001-4000 yen', '4001-5000 yen'
# , '5001 to 7000 yen', '7001-10000 yen', '10001-15000 yen', '15001 ~ 20000 yen', '20001-30000 yen', '30001 yen ~']
num = 16475 # number of data
# num = 1000 # test
df = pd.DataFrame(data=None, columns=columns)
for i in range(num):
# for i in tqdm(lis):
html = './html_dir/shop{:0>5}_url.html'.format(i)
with open(html,"r", encoding='utf-8') as f:
shop_html = f.read()
df = scraping(shop_html, df, price_range)
if i % 1600 == 0:
percent = i/num*100
print('{:.3f}% Done'.format(percent))
df.to_csv('shop_info.csv', encoding='shift_jis')
The acceptance results are as follows.
It took a little less than an hour to crawl, so the site was updated during that time. You can see that there is a difference between the number of stores that were initially and the number of stores after crawling.
"Visualize the market prices of restaurants in Tokyo, Clarify which genre of shops are the most popular in that price range. "
--The market price for dinner is "** 2000-4000 yen ". --The market price for lunch is " 500-1000 yen ". ――The genre with the highest percentage in each of the dinner and lunch prices is " Izakaya **". —— Also, at lunch, the “500-1000 yen izakaya” would be a ** double cropping shop **. Here, the market price of the budget is defined as "mode value, not average value".
The underlying data are shown below in order.
We have visualized the market price of the budget separately for dinner and lunch.
From the above results, we have found a rough market price for restaurants in Tokyo, so let's visualize the genres by price range.
** Genres included in "Other" ** For both dinner and lunch, the following genres with a small total number are included in "Other". [Okonomiyaki / Monja / Cafe / Sweets / Ramen / Korean / International / Western / Creative / Other Gourmet]
I thought that "Izakaya" in the price range of "500-1000 yen" was too cheap for lunch, so I will dig deeper here.
As shown below, it can be seen that while calling itself "Izakaya", the menu for lunch is offered during the day.
--The customer base of shops in the price range of "** 7,000 yen to " for dinner tends to be more male than female customers, and both dinner and lunch are " 1000 to 3000 yen **". The customer base of shops in the price range tends to be more female than male.
--Both dinner and lunch tend to be ** highly rated ** as they become ** higher priced **.
-In the ** high price range **, there are many stores that accept ** credit cards **
――Shops in the price range of "** 2000-4000 yen **" for dinner tend to have a wide ** capacity **.
The data that serves as the basis are shown below.
We compared price ranges by customer base.
From this, it can be said that the customer base of shops in the price range of "7,000 yen" tends to be more male than female customers at dinner, and the price of "1000-3000 yen" for both dinner and lunch. It was found that the customer base of the obi shop tends to have more female customers than male customers.
We will plot the ratings for each of the dinner and lunch price ranges. At that time, there are many shops with the same evaluation in the same price range, so we adopted jittering and intentionally shifted the plot. The results of the t-test are shown below the graph. ** Definition of t-test ** Dinner: Grouped at shops under 4000 yen and shops over 4000 yen Lunch: Grouped by shops under 2000 yen and shops over 2000 yen You can see that the higher the price range for both dinner and lunch, the higher the rating tends to be. From the results of the t-test, it can be said that there is a difference in "** evaluation **" between the high price range and the low price range.
We compared the usage status of credit cards by price range.
Again, instinctively, we found that a large percentage of stores in the ** high price range ** accept ** credit cards **. In addition, the price range of "10,000 yen ~" for lunch is not displayed because the number of cases that is sufficient for evaluation was not obtained as 4 cases.
We compared the sizes of shops evaluated on a 5-point scale by price range. Since I could conclude only dinner here, it will be posted only there. The darker the blue, the wider the store. You can see that shops in the price range of "2000-4000 yen" tend to have a large capacity for dinner. Since the ratio of izakaya is large in this price range, it is thought that izakaya with a large capacity is large.
I realized for myself how difficult it is to "collect the information obtained by scraping and visualize it so that the conclusion can be conveyed to the other party." ** If you do it again ** Set a clear purpose for your analysis before writing code and back-calculate to plan your process
** Received code review ** I received the following points, and I would like to improve it thereafter.
--Write code while being aware of the python code convention called pep8 ――Please submit after organizing unnecessary line breaks and comment outs.
** After the announcement review ** It was a process of "how to show the graph to make it easier to convey". We received feedback that it is important to create an "intuitive graph", such as saying that the higher the price range, the higher the price range, and that the density is expressed by jittering. I also learned that showing a story-like conclusion leads to the understanding of the other person. I will spend my analytical life in the future while being aware of how to connect the obtained results to real problems.