Collect product information and process data using Rakuten product search API [Python]

Introduction

Last time extracted 30 items (1 page) of data using "Rakuten product search API" and saved it in a csv file. It was.

This time, for the purpose of gathering more practical data for analysis and connection to decision making, we first developed the previous script to increase the number of data to be acquired, and then used product data as a preliminary step for aggregation and analysis. We even process data so that it can be classified by weight (5 kg, 10 kg ...).

policy

(1) Acquire product information

--Get 300 items (30 items x 10 pages) of search keyword "make-in" product information (specify NG word) --Create a dict type that contains only the necessary product information --Store in Pandas DataFrame and organize columns

The flow is the same as the last time, but this time the number of acquisitions and NG words are specified here. These are adjusted according to the purpose of collecting data and the amount and contents of information required.

In my case, I specified a word that does not match my expectations and is likely to affect the product price. Here, "organic cultivation / organic / pesticide-reduced / pesticide-free", "set / assortment", and "hometown tax payment" are excluded. There is no deep meaning in setting the number of acquisitions to 300, but when I searched for products with this keyword / NG word, the number of hits was about 320, so I chose a good number.

(2) Data processing procedure

--Extract only the necessary data from the collected data --Extract the weight (5kg, 10kg ...) from the "product name" and create a new "quantity" column

Originally, it would be more accurate to divide the amount by vegetable variety, but this time I decided not to do so. This is because the number of data will decrease if it is subdivided too much.

(1) Script to acquire product information

1. Import required libraries and prepare input parameters

This is almost the same as the last time, but only the page and NGKeyword and postageFlag parts of the input parameters that send requests with the API are changed.

The maximum number of data that can be acquired with a single request from the Rakuten Product Search API is 30 (the value of the input parameter hits). If you specify the number of pages 2 in page, you can get the data after the 31st. It is assumed that the value of this page is turned by the for statement.

In addition, the input parameter postageFlag specifies whether the item price includes shipping (1 includes shipping or free shipping).

import requests
import numpy as np
import pandas as pd

REQUEST_URL = "https://app.rakuten.co.jp/services/api/IchibaItem/Search/20170706"
APP_ID="<Enter the Rakuten API ID>"

#Input parameters
serch_keyword = 'Make-in'
ng_keyword = 'Hometown tax payment Organic organic pesticide-free pesticide-free set assortment'
page = 1
serch_params={
    "format" : "json",
    "keyword" : serch_keyword,
    "NGKeyword":ng_keyword,
    "applicationId" : [APP_ID],
    "availability" : 0,
    "hits" : 30,
    "page" : page,
    "sort" : "standard",
    "postageFlag" : 1
}

2. Create a dict type that contains only the necessary product information

Use the for statement to get product information for pages that span multiple pages.

Extracting only the necessary items from the product information, storing them in the dict named tmp_item, and storing them in the list named ʻitem_list` is the same flow as last time.

#Get product information in a list
item_list = [] #Dictionary-type product information acquired by 30 items tmp_10 pages of item
max_page = 10
for page in range(1, max_page+1):
    serch_params['page'] = page
    #Send a request to the API and get the product data result as a result
    response = requests.get(REQUEST_URL, serch_params)
    result = response.json()

    #Create a dict that extracts the necessary information from the result
    item_key = ['itemName', 'itemPrice', 'itemCaption', 'shopName', 'shopUrl', 'itemUrl']

    for i in range(0, len(result['Items'])):
        tmp_item = {}
        item = result['Items'][i]['Item']
        for key, value in item.items():
            if key in item_key:
                tmp_item[key] = value
        item_list.append(tmp_item.copy())

Here, even if I entered a keyword that had less than 10 pages of products, I was able to get it without any error. It seems that the number of pages can be set larger than the actual number of products (according to the API reference, the upper limit of page is 100). However, it takes a lot of time to process, and in rare cases an error occurs in my environment, so I think it is better to keep it to the minimum necessary.

3. Create a DataFrame

This also creates a Pandas DataFrame from the list that stores the dict of the product information earlier, as before. I've modified it a bit so that the index starts from 1 instead of 0.

#Data frame creation
df = pd.DataFrame(item_list)
df = df.reindex(columns=['itemName', 'itemPrice', 'itemCaption', 'itemUrl', 'shopName', 'shopUrl'])
df.columns = ['Product name', 'Product price', 'Product description', 'Product URL', 'Store name', 'Store URL']
df.index = df.index + 1 #Re-index from 1

Check the number of acquired data with df.count () and the first 5 data with df.head (). It seems that it is okay if it contains 300 data as expected.

4. csv output

It is troublesome to run this script every time to get the data, so I will output csv so that I can use it when I want to use it.

df.to_csv('20200914_rakuten_mayqueen.csv')

(2) Data processing / aggregation script

1. Read csv

Now that we have increased the number of data items, let's process the data into a form suitable for analysis. You can proceed as it is, but once you load the csv fill, proceed.

df = pd.read_csv('20200914_rakuten_mayqueen.csv')

2. Extract only the required data

I will process this DataFrame, but first of all, when I took a quick look at the contents of the data with a spreadsheet, I found that extra data other than vegetables, which does not fit the purpose of this price survey, was mixed. Since we need weight data this time, we will try with the policy of "leaving only product data with weight in the product name".

In Pandas, use str.contains () to extract (partially match) lines containing the specified string and return them as bool values.

#Leave only product data with "kg" in the product name
kg_flg = df['Product name'].str.contains('kg')
df = df[kg_flg]

kg_flg is a series of bool values, True for lines that contain" kg "and False for lines that do not. The row that is True is the data you want to keep. By using this and setting it to df [kg_flg], you can extract a DataFrame that contains only True rows.

When I checked the number of cases with df.count (), it decreased to 116 cases. If you want to secure more data, it seems that this area needs to be verified a little more.

3. Extract the weight from the product name to create a new column

This leaves only the row with kg in the product name, but I would like to cut this weight into another column. The product name should contain the weight in the form of "number + kg", so take out this number and create a new column called "quantity".

Let's use a regular expression here (I won't explain it in detail, but you can express "number + kg" by using ([0-9] +) kg). Specify this regular expression as an argument to Pandas's str.extract (). This method extracts a string that matches first when a regular expression is specified as an argument, and creates a new column. This is a convenient method that is perfect for what you want to do this time.

#Cut out the weight from the product name to another column
df['Quantity'] = df['Product name'].str.extract('([0-9]+)kg')
df =df.reindex(columns=['Product name', 'Quantity', 'Product price', 'Product description', 'Product URL', 'Store name', 'Store URL'])

df.to_csv('20200914_rakuten_mayqueen_2.csv')

Swap the columns to make it easier to see, and output csv at the end. I was able to create a DataFrame like this.

image

In the image, after converting the quantity to a numerical type for subsequent analysis, I also added the kg unit price, which is the product price divided by the quantity. However, the numbers are quite different, so I'll dig a little deeper next time.

in conclusion

Next time, try to calculate statistics such as the average value of product prices by aggregating data, and try visualization and linear approximation to kg unit price I will try to do a simple analysis that seems to lead to decision making, such as calculating.

Recommended Posts

Collect product information and process data using Rakuten product search API [Python]
Output product information to csv using Rakuten product search API [Python]
Aggregate and analyze product prices using Rakuten Product Search API [Python]
Data acquisition using python googlemap api
Let's automatically collect company information (XBRL data) using the EDINET API (4/10)
[Python] I tried to get various information using YouTube Data API!
Get Youtube data in Python using Youtube Data API
Collect data using scrapy and populate mongoDB
Creating Google Spreadsheet using Python / Google Data API
[Python] Get all comments using Youtube Data API
Try using ChatWork API and Qiita API in Python
Data analysis using Python 0
Get LEAD data using Marketo's REST API in Python
Data cleaning using Python
[Python] Get insight data using Google My Business API
Send and receive Gmail via the Gmail API using Python
Crawling with Python and Twitter API 1-Simple search function
Process csv data with python (count processing using pandas)
Search Twitter using Python
Speech transcription procedure using Python and Google Cloud Speech API
Using the National Diet Library Search API in Python
[Python] Get user information and article information with Qiita API
I tried to search videos using Youtube Data API (beginner)
Visualize plant activity from space using satellite data and Python
Get data using Ministry of Internal Affairs and Communications API
Collect large numbers of images using Bing's image search API
Collect tweets using tweepy in Python and save them in MongoDB
Predict gender from name using Gender API and Pykakasi in Python
Graph time series data in Python using pandas and matplotlib
[Python] I tried collecting data using the API of wikipedia
[Go language] Collect and save Vtuber images using Twitter API
Get product name and lowest price using Amazon Product Advertising API
[Python] Random data extraction / combination from DataFrame using random and pandas
I made a Chatbot using LINE Messaging API and Python
Play with YouTube Data API v3 using Google API Python Client
Search algorithm using word2vec [python]
[Python] Depth-first search and breadth-first search
Data analysis using python pandas
Automation of a research on geographical information such as store network using Python and Web API
Process Splunk execution results using Python and save to a file
[Python] LINE notification of the latest information using Twitter automatic search
Basic map information using Python Geotiff conversion of numerical elevation data
Crawling with Python and Twitter API 2-Implementation of user search function
[Python] Conversation using OpenJTalk and Talk API (up to voice output)
[Python] I wrote a REST API using AWS API Gateway and Lambda.
I made a Chatbot using LINE Messaging API and Python (2) ~ Server ~
Dump, restore and query search for Python class instances using mongodb
Get Salesforce data using REST API
Authentication using tweepy-User authentication and application authentication (Python)
[Python3] Google translate google translate without using api
Cooperation between python module and API
[python] Read information with Redmine API
Get Amazon data using Keep API # 1 Get data
Hashing data in R and Python
Clustering and visualization using Python and CytoScape
Data acquisition memo using Backlog API
Infinite product in Python (using functools)
Try using Python argparse's action API
Run Ansible from Python using API
Process Pubmed .xml data with python
Backtrace using DWARF information and pyelftools