[GO] [Python] I made a system to introduce "recipes I really want" from the recipe site!

Introduction

Hello, this is HanBei.

Previous article was about data collection for machine learning, but we will continue this time.

Continue to use Google Colaboratory.

If you are interested, please ** comment ** or ** LGTM **!

1-1. Purpose

I use the recipe site, but I thought ** "It's a lot" **, ** "Is the really recommended dish delicious? (Excuse me)" **, so I'm going to find the recipe I want.

1-2. Target (reason to read this article)

I hope it will be useful for those who want to get data for machine learning using the data on the Web **.

1-3. Attention

Scraping is a ** crime ** if you do not follow the usage and dosage correctly.

"I want to do scraping, so don't worry about it!" For those who are optimistic or worried, we recommend that you read at least the two articles below.

miyabisun: "Don't ask questions on the Q & A site about scraping methods" nezuq: "List of precautions for web scraping"

1-4. Items to check

This article teaches you how to scrape, but ** we are not responsible for it **.

Think for yourself and use it with the right ethics.

2. Preparation

2-1. Examination of recipe site

Recommended such as Cookpad, Nadia, White rice.com There is a recipe site for, but this time we will use ** Rakuten Recipe **.

Reason, ・ There are quantitative data such as "I want to repeat", "It was easy", and "I saved". ・ Many recipes

is.

2-2. Make Google Colaboratory available

If you haven't created a Google account to use Google Colaboratory, create one.

How to create a new notebook ...

  1. Click Google Colaboratory to get started
  2. Create a new one from Google Drive Reference: shoji9x9 "Summary of how to use Google Colab"

3. Practice

From here, I will write the contents of the implementation.

3-1. Introduction

First, import the library.

from bs4 import BeautifulSoup
from google.colab import drive
from google.colab import files
import urllib
import urllib.parse
import urllib.request as req
import csv
import random
import pandas as pd
import numpy as np
import time
import datetime

Decide on the name of the dish you want to look up!

#The name of the dish you want to look up
food_name = 'curry'

3-2. Get the recipe URL

Create a function to get the URL of the recipe.

#Store the url for each recipe
recipe_url_lists = []

def GetRecipeURL(url):
  res = req.urlopen(url)
  soup = BeautifulSoup(res, 'html.parser')

  #Select a range of recipe list
  recipe_text = str(soup.find_all('li', class_= 'clearfix'))
  #Divide the acquired text line by line and store it in the list
  recipe_text_list = recipe_text.split('\n')

  #Read the list line by line and extract only the lines that match the dish name
  for text in recipe_text_list:
    #Get url for each recipe
    if 'a href="/recipe/' in text:
      #Specify a specific part and put it in the url
      recipe_url_id = text[16:27]
      #url join
      recipe_url_list = 'https://recipe.rakuten.co.jp/recipe/' + recipe_url_id + '/?l-id=recipe_list_detail_recipe'
      #Store url
      recipe_url_lists.append(recipe_url_list)  

    #Get the title of each recipe
    if 'h3' in text:
      print(text + ", " + recipe_url_list)

Check recipes in order of popularity

#Amount of pages you want to look up
page_count = 2

#Encode to put the dish name in the url
name_quote = urllib.parse.quote(food_name)

#Combine urls (only one page url)
#In order of popularity
base_url = 'https://recipe.rakuten.co.jp/search/' + name_quote
#New arrival order
# base_url = 'https://recipe.rakuten.co.jp/search/' + name_quote + '/?s=0&v=0&t=2'

for num in range(page_count):
  #To get after a specific page
  # num = num + 50

  if num == 1:
    #Combine urls (only one page url)
   GetRecipeURL(base_url)

  if num  > 1:
    #Combine urls (urls from page 2 onwards)
    #In order of popularity
    base_url_other =  'https://recipe.rakuten.co.jp/search/' + name_quote + '/' + str(num) + '/?s=4&v=0&t=2'
    #New arrival order
    # base_url_other =  'https://recipe.rakuten.co.jp/search/' + name_quote + '/' + str(num) + '/?s=0&v=0&t=2'
    GetRecipeURL(base_url_other)

  #Apply 1 second rule for scraping
  time.sleep(1)

After doing so, the title and recipe URL will be displayed. Searching_of_Delicious_Food_ipynb_Colaboratory.png

Let's check the number of recipes obtained here!


#Number of recipes acquired
len(recipe_url_lists)

When executed, 17 items are displayed.

Next, get the necessary data from each recipe.


data_count = []
recipe_data_set = []

def SearchRecipeInfo(url, tag, id_name):
  res = req.urlopen(url)
  soup = BeautifulSoup(res, 'html.parser')

  for all_text in soup.find_all(tag, id= id_name):
    # ID
    for text in all_text.find_all('p', class_= 'rcpId'):
      recipe_id = text.get_text()[7:17]
      
    #release date
    for text in all_text.find_all('p', class_= 'openDate'):
      recipe_date = text.get_text()[4:14]

    #it was delicious,It was easy,I was able to save 3 types of stamps
    for text in all_text.find_all('div', class_= 'stampHead'):
      for tag in text.find_all('span', class_= 'stampCount'):
        data_count.append(tag.get_text())

    #I made the number of reports
    for text in all_text.find_all('div', class_= 'recipeRepoBox'):
      for tag in text.find_all('h2'):
        #When the number of reports is 0
        if tag.find('span') == None:
          report = str(0)
        else:
          for el in tag.find('span'):
            report = el.replace('\n							', '').replace('Case', '')

  print("ID: " + recipe_id + ", DATE: " + recipe_date + ",Number made: " + report + 
        ",I want to repeat: " + data_count[0] +
        ",It was easy: " + data_count[1] +
        ",I was able to save: " + data_count[2]+
        ", url: " + url)

  #Store to write to csv file
  recipe_data_set.append([recipe_id, recipe_date, data_count[0], data_count[1], data_count[2], report, url])

  #Empty the array containing the number of stamps
  data_count.clear()

  #Scraping restrictions
  time.sleep(1)

Here, check the acquired data.


for num in range(len(recipe_url_lists)):
  SearchRecipeInfo(recipe_url_lists[num], 'div', 'detailContents')

When you execute it, you can see that it has been acquired properly. Searching_of_Delicious_Food_ipynb_Colaboratory (1).png

3-3. Output csv file to Google Drive

Create a spread sheet on Google Drive and output the data

#Mount the directory you want to use
drive.mount('/content/drive')

Select any folder in Google Drive and specify the file name Please specify "○○○".

#Create a folder on google drive and specify the save destination
save_dir = "./drive/My Drive/Colab Notebooks/〇〇〇/"
#Select a file name
data_name = '〇〇〇.csv'
#Save csv file in folder
data_dir = save_dir + data_name

#Add items to csv file
with open(data_dir, 'w', newline='') as file:
  writer = csv.writer(file, lineterminator='\n')
  writer.writerow(['ID','Release Date','Repeat','Easy','Economy','Report','URL'])

  for num in range(len(recipe_url_lists)):
    writer.writerow(recipe_data_set[num])

#Save the created file
with open(data_dir, 'r') as file:
  sheet_info = file.read()

When executed, 〇〇〇.csv will be output to the specified directory. Please refer to the contents so far in a simple slide.

2020_05_22_レシピ検索_Google_スライド.png 2020_05_22_レシピ検索_Google_スライド (1).png

3-4. Weighting of recipe data

Check the csv file output by Pandas.


#Load csv
rakuten_recipes = pd.read_csv(data_dir, encoding="UTF-8")

#Ready to add to column
df = pd.DataFrame(rakuten_recipes)

df

The image of the output is omitted.

Next, calculate the number of days elapsed from the publication date of the recipe to today.


# rakuten_recipes.Extract Release Date from csv
date = np.array(rakuten_recipes['Release Date'])
#Get the current date
today = datetime.date.today()

#Match the mold
df['Release Date'] = pd.to_datetime(df['Release Date'], format='%Y-%m-%d')
today = pd.to_datetime(today, format='%Y-%m-%d')

df['Elapsed Days'] = today - df['Release Date']

#I will take out only the value of the number of elapsed days
for num in range(len(df['Elapsed Days'])):
  df['Elapsed Days'][num] = df['Elapsed Days'][num].days

#Check only 5 lines from the top
df.head()

Then, the number of elapsed days will appear next to the URL column.

Next, let the number of days elapsed be ** weight **, and weight three types of stamps: Repeat, Easy, and Economy. Add the weighted data to the existing recipe column.

2020_05_22_レシピ検索_Google_スライド (2).png


#Fixed not to be too small
weighting = 1000000

#Extract 3 types of stamps and report values
repeat_stamp = np.array(rakuten_recipes['Repeat'])
easy_stamp = np.array(rakuten_recipes['Easy'])
economy_stamp = np.array(rakuten_recipes['Economy'])
report_stamp = np.array(rakuten_recipes['Report'])

#Total of each stamp and report
repeat_stamp_sum = sum(repeat_stamp)
easy_stamp_sum = sum(easy_stamp)
economy_stamp_sum = sum(economy_stamp)
report_stamp_sum = sum(report_stamp)

#Add a column of weighted values
'''
Repeat weighting= (Number of repeat stamps ÷ total repeat) × (Corrected value ÷ number of days elapsed from the publication date)
'''
df['Repeat WT'] = (df['Repeat'] / repeat_stamp_sum) * (weighting / df['Elapsed Days'])
df['Easy WT'] = (df['Easy'] / easy_stamp_sum) * (weighting / df['Elapsed Days'])
df['Economy WT'] = (df['Economy'] / economy_stamp_sum) * (weighting / df['Elapsed Days'])

#Report importance (range 0 to 1)
proportions_rate = 0.5

#Add a column of weighted values
'''
Repeat weighting= (Repeat weighting× (1 -importance)) × ((Number of reports ÷ total number of reports) ×importance[%])
'''
df['Repeat WT'] = (df['Repeat WT'] * (1 - proportions_rate)) * ((df['Report'] / report_stamp_sum) * proportions_rate)
df['Easy WT'] = (df['Easy WT'] * (1 - proportions_rate)) * ((df['Easy WT'] / report_stamp_sum) * proportions_rate)
df['Economy WT'] = (df['Economy WT'] * (1 - proportions_rate)) * ((df['Economy WT'] / report_stamp_sum) * proportions_rate)

About weighting ... Regarding the number of days elapsed, suppose that there are articles one month ago and one year ago, and the same 100 stamps are attached. Which one is more recommended for the recipe is one month ago. Therefore, the score is lower for articles that have passed the number of days.

Change the range of maximum and minimum weighted values from 0 to 1. I will post the page that I used as a reference.

QUANON: "Convert a number in one range to a number in another range"


df['Repeat WT'] = (df['Repeat WT'] - np.min(df['Repeat WT'])) / (np.max(df['Repeat WT']) - np.min(df['Repeat WT']))
df['Easy WT'] = (df['Easy WT'] - np.min(df['Easy WT'])) / (np.max(df['Easy WT']) - np.min(df['Easy WT']))
df['Economy WT'] = (df['Economy WT'] - np.min(df['Economy WT'])) / (np.max(df['Economy WT']) - np.min(df['Economy WT']))

df.head()

This is the execution result. By doing df.head (), only the top 5 lines will be displayed. Searching_of_Delicious_Food_ipynb_Colaboratory (2).png

3-5. Display recommended recipes

The score that flew from the user is evaluated on a 5-point scale and the search is performed.


#Used to specify the range (1: 0-0.2, 2: 0.2-0.4, 3: 0.4-0.6, 4: 0.6-0.8, 5: 0.8-1)
condition_num = 0.2

def PlugInScore(repeat, easy, economy):
  #Argument within the specified range
  if 1 >= repeat:
    repeat = 1
  if 5 <=repeat:
    repeat = 5
  if 1 >= easy:
    easy = 1
  if 5 <= easy:
    easy = 5
  if 1 >= economy:
    economy = 1
  if 5 <= economy:
    economy = 5

  #Narrow down recipes from 3 types of scores
  df_result =  df[((repeat*condition_num) - condition_num <= df['Repeat WT']) & (repeat*condition_num >= df['Repeat WT']) &
                  ((easy*condition_num) - condition_num <= df['Easy WT']) & (easy*condition_num >= df['Easy WT']) &
                  ((economy*condition_num) - condition_num <= df['Economy WT']) & (economy*condition_num >= df['Economy WT'])]
  # print(df_result)

  CsvOutput(df_result)

Output the search result to a csv file. Please enter any name for 〇〇〇!


#Select a file name
data_name = '〇〇〇_result.csv'
#Save csv file in folder
data_dir_result = save_dir + data_name

#Output csv
def CsvOutput(df_result):
  #Output the narrowed down result to a csv file
  with open(data_dir_result, 'w', newline='') as file:
    writer = csv.writer(file, lineterminator='\n')
    #title
    writer.writerow(df_result)
    #Each value
    for num in range(len(df_result)):
      writer.writerow(df_result.values[num])

  #Save the created file
  with open(data_dir, 'r') as file:
    sheet_info = file.read()
  
  AdviceRecipe()

Declares a function to display the result.


def AdviceRecipe():
  #Load csv
  rakuten_recipes_result = pd.read_csv(data_dir_result, encoding="UTF-8")

  #Ready to add to column
  df_recipes_res = pd.DataFrame(rakuten_recipes_result)

  print(df_recipes_res)

  print("Recommended for you" + food_name + " 」")
  print("Entry No.1: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])
  print("Entry No.2: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])
  print("Entry No.3: " + df_recipes_res['URL'][random.randint(0, len(df_recipes_res))])

Finally, give a score to the recipe you want to make and display the recommendations.


'''

plug_in_score(repeat, easy, economy)Substitute
  
  repeat :Do you want to make it again?
  easy   :Is it easy to make?
  economy:Can you save money and make it?

Evaluate the subjectivity in 5 steps from 1 to 5, and substitute an integer.

1 is negative, 5 is position

'''

PlugInScore(1,1,1)

3 types of scores ** 1 **: Do you want to make it again? ** 1 **: Is it easy to make? ** 1 **: Can you save money?

The execution result is ... Searching_of_Delicious_Food_ipynb_Colaboratory (3).png

4. Issues / Problems

・ Examination of evaluation method The three weighted values are pulled by the highest recipe, resulting in extreme results. Therefore, it is biased to either 1 or 5.

・ There are few articles stamped Approximately ** 10 **% of recipes have stamps and reports, and approximately ** 90 **% of all 0s Therefore, it may be nonsense to evaluate the recipe by points. There should be a great recipe in All 0.

5. Consideration

I used this system to search for "pork kimchi" and made it. It was delicious because it was a recommended recipe ^^

It was interesting because I could discover the recipes that were buried.

Thank you to everyone who has read this far. I would be grateful if you could give us your comments and advice ^^

Recommended Posts

[Python] I made a system to introduce "recipes I really want" from the recipe site!
I want to start a lot of processes from python
I want to send a message from Python to LINE Bot
I want to create a system to prevent forgetting to tighten the key 1
I want to use jar from python
I want to build a Python environment
I want to send a signal only from the sub thread to the main thread
[Python memo] I want to get a 2-digit hexadecimal number from a decimal number
I made a program to check the size of a file in Python
Python: I want to measure the processing time of a function neatly
I made a function to see the movement of a two-dimensional array (Python)
I want to create a window in Python
I want to email from Gmail using Python.
[Python] I want to manage 7DaysToDie from Discord! 1/3
I want to make a game with Python
I want to use ceres solver from python
[Python] I want to manage 7DaysToDie from Discord! 2/3
I want to make C ++ code from Python code!
I want to write to a file with Python
I want to display the progress in Python!
[LINE Messaging API] I want to send a message from the program to everyone's LINE
I made a library to operate AWS CloudFormation stack from CUI (Python Fabric)
I made a tool to generate Markdown from the exported Scrapbox JSON file
[Python] I tried to get the type name as a string from the type function
I made a script to record the active window using win32gui of Python
I want to embed a variable in a Python string
I want to iterate a Python generator many times
I want to generate a UUID quickly (memorandum) ~ Python ~
I want to write in Python! (2) Let's write a test
I made a Python module to translate comment outs
I wanted to use the Python library from MATLAB
I want to randomly sample a file in Python
I want to inherit to the back with python dataclass
I want to work with a robot in python.
[Python3] I want to generate harassment names from Japanese!
[Python] I want to make a nested list a tuple
I want to write in Python! (3) Utilize the mock
I made a command to markdown the table clipboard
I made a python library to do rolling rank
I want to use the R dataset in python
I want to run a quantum computer with Python
I want to tell people who want to import from a higher directory with Python direnv
I want to take a screenshot of the site on Docker using any font
I made a python text
maya Python I want to fix the baked animation again.
I made a package to filter time series with python
[Python] I want to get a common set between numpy
I made a tool to create a word cloud from wikipedia
I made a function to check the model of DCGAN
I want to calculate the allowable downtime from the operating rate
[Python] I want to use the -h option with argparse
I want to install a package from requirements.txt with poetry
I want to know the features of Python and pip
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I want to make input () a nice complement in python
I made you to execute a command from a web browser
I want to create a Dockerfile for the time being.
Extract the value closest to a value from a Python list element
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I made a class to get the analysis result by MeCab in ndarray with python
I made a server with Python socket and ssl and tried to access it from a browser