I tried scraping food recall information with Python to create a pandas data frame

I decided to keep a learning record of python, so I started Qitta. Since my job is a non-IT company, python is really a hobby ... I mean, I'm learning with interest.

Right now, I'm working on food hygiene, so I wonder if I can do something with python, so let's analyze the food recall information data! I thought.

As the first step, I tried to create a data frame of food recall information by scraping. The source of the data is a site called Recall Plus.

`food_recall_info.py`


from bs4 import BeautifulSoup
import requests
import re
import csv
import time
import pandas as pd

def recalls(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    recall_soup = soup.findAll("tr",{"class":{"return","info","apology"}})
    
    campany_list = []
    recall_list = []
    action_list = []
    recall_date = []

    for j in range(len(recall_soup)):
        #Get company name
        campany_list.append(recall_soup[j].find("a", href=re.compile("/company/*")).get_text())
        #Recall details
        recall_list.append(recall_soup[j].find("a", {"style":"float:left"}).get_text())
        #How to respond
        keyword = re.compile(r'Recovery|Recovery＆Refund|Recovery＆Refund/Exchange|Recovery＆Exchange|Refund|Exchange|点検＆Exchange|Notice|Violation of the prize labeling law|apology|Refund/Exchange|Send')
        action_list.append(re.search(keyword, str(recall_soup[j])).group())
        #Accrual date
        recall_date.append(recall_soup[j].find("td", {"class":"day"}).get_text().replace("\n        ","20"))
    return campany_list,recall_list,action_list,recall_date

campany_lists = []
recall_lists = []
action_lists = []
recall_dates = []

for i in range(1,20):
    resl = recalls("https://www.recall-plus.jp/category/1?page={}".format(i))
    campany_lists.extend(resl[0])
    recall_lists.extend(resl[1])
    action_lists.extend(resl[2])
    recall_dates.extend(resl[3])
    
recall_df = pd.DataFrame({'company name':campany_lists,'Recall details':recall_lists,'Correspondence':action_lists,'Accrual date':recall_dates})

Execution result


recall_df.head()
Company name Recall details Correspondence date
0 Kobe Bussan Business Supermarket Imo Stick Some products are mixed with resin pieces Recovery 2020/03/17
1 Marubun Marubun Domestic soybeans used Yose tofu Expiration date mislabeled Collection 2020/03/17
2 AEON Hitachiomiya store...Allergen for pork loin seasoned tonteki(milk)Missing display Apology 2020/03/18
3 Tsuruya Karuizawa store Delicious white fish Fry Allergen Milk ingredient display missing Recovery 2020/03/16
4 Hatanaka Koiya Hatanaka Koiya Koiya Sweet boiled allergen "wheat" missing display Recovery 2020/03/13

Apparently, I think it could be stored in the pandas data frame.

At first, I thought about writing the listing in the data frame each time, but I gave up because I didn't know how to do it. First of all, I made a list of each column and then tried to incorporate it into pandas.

I'm an amateur, so I remembered one thing and added the value with append () at first. However, it was added in list format and I couldn't import it into pandas.

After a lot of research, I found that I could use extend () to add only the values in the list.

I learned one thing.

Now that I have created the data safely, I would like to analyze the data. What can be analyzed with this data now is (1) Percentage of collections and returns when recalls occur ② Is there a time when recalls are likely to occur?

Is not it. I would like to make various trials and errors.

Recommended Posts

I tried scraping food recall information with Python to create a pandas data frame

I tried to create a list of prime numbers with python

[Pandas] I tried to analyze sales data with Python [For beginners]

I tried to get CloudWatch data with Python

I tried to create a program to convert hexadecimal numbers to decimal numbers with python

I tried fMRI data analysis with python (Introduction to brain information decoding)

[Outlook] I tried to automatically create a daily report email with Python

I tried scraping with Python

I tried scraping with python

I want to give a group_id to a pandas data frame