Aggregate and analyze product prices using Rakuten Product Search API [Python]

Introduction

This is a continuation of Last time. This time will be the last. Now that we have preprocessed the price analysis of the product data obtained from Rakuten Ichiba, we will finally try a simple analysis.

――What is the market price of vegetables (make-in this time) sold at Rakuten Ichiba? ――How much is the appropriate amount per kg?

I would like to use rudimentary analysis skills to put out such things.

policy

The purpose is to reorganize "for price survey when selling agricultural products". The purpose of this time was to get some grounds for "how much price is appropriate to sell" by totaling or graphing the prices of agricultural products by kg. .. For that purpose, I would like to use it as one of the grounds by first examining the market price on Rakuten Ichiba, which seems to be easy to investigate.

Follow the steps below.

(1) Try to total

First, let's summarize the summary statistics (mean, median, etc.) of the product price for each quantity and make a table. In other words, the first step in this analysis is to grasp the product price for each weight (1kg, 2kg ...) with specific numerical values.

(2) Check the distribution (visualization)

Let's draw a histogram or scatter plot using the data for each quantity to see how the quantity and the number of products and the relationship between the quantity and the product price will be. In addition, by focusing on the median product price and visualizing it, it will lead to the analysis of (3) below.

(3) Find the parameters by linear approximation

Let's consider the relationship between the quantity and the median price as a linear function (straight line) and find the parameters of the formula. By focusing on the median price for each quantity and visualizing it earlier, it is speculated that the quantity and the product price are likely to have a linear function relationship (straight line on the graph). When this is actually applied, it can be considered that the product price is determined by adding the "amount per kg" to the "base amount of shipping fee, box fee, work fee, etc." By finding the parameters of the relational expression, let's calculate the "base amount" and "amount per kg".

Aggregation and analysis

Prepare the data before entering the aggregation. Read the processed data saved at the end of Last time from the csv file and put it in the DataFrame.

df = pd.read_csv('20200914_rakuten_mayqueen_2.csv')

(1) Try to total

When you're ready, let's start by looking at the summary statistics for commodity prices by quantity. Use the groupby () function to create a per-quantity table. Also, use the describe () method to display the statistics.

#Check the statistics for each quantity by group by
df_by_kg = df.groupby('Quantity').describe()['Product price']
df_by_kg

If you look at the count column of this table, you can see that there are rows with only one or two data and rows with a large number of data. For example, you can see that the number of "potatoes sold at 1 kg or 2 kg" is small in the first place, and the number of "potatoes sold at 5 kg or 10 kg" is large. Since the standard of the box is almost fixed, it is a reasonable number from the actual site, but I was able to confirm it from the data like this.

Extract only the required rows from this table. This time, we will use only the rows with a large number of cases. You can directly specify the number of lines to be left, but considering the possibility of reusing the script later, here we extract 5 or more lines (3 kg, 5 kg, 10 kg, 20 kg are applicable). I will try

#Extract only the necessary lines (leave only those with 5 or more cases)
df_by_kg = df_by_kg.loc[df_by_weight[('Product price', 'count')]>5, :]
df_by_kg

From here, leave only the columns you need. This time, I decided to leave "number of cases count "," mean mean", "minimum value min "," median 50%", and "maximum value max "as columns to be compared.

#Extract only the required columns
df_by_kg = df_by_kg.loc[:,  [ 'count', 'mean', 'min', '50%', 'max']]
df_by_kg

You now have a clean tabulation table. Based on this table, let's look at the product price of "potato 10kg" as an example. The statistic is

--Minimum value (min) 1,399 yen --Maximum value (max) 7,560 yen --Average value (mean) 3,820 yen --Median (50%) 3,600 yen

It was such an amount (shipping and tax included). It seems that the average value or the median value should be referred to in order to get a sense of the market price, but since the range between the minimum value and the maximum value is large, it seems that the overall variation is large. Therefore, I would like to look at the ** median ** instead of the average value as a reference value (this variation will be visualized in the next item).

By the way, the cheapest ones are unusually cheap, so when I was curious, it seemed that ultra-small balls were sold as so-called "translated products". On the contrary, looking at the highest price, it was not a special luxury brand product but a product in one area, but from June to early August 2020, the market price of potatoes was unusually high nationwide. Therefore, I wondered if it was listed for such an amount (just a guess).

(2) Check the distribution (visualization)

Next, I would like to take a deeper look at the data by drawing a graph based on the data and visualizing it.

First, let's display the "histogram". Here, the quantity (kg) and frequency (how many data are available) are illustrated.

To display the histogram, use the hist () function of the library matplotlib. Specify DataFram (df) before aggregation as data. By specifying bins as an option, the width of the histogram band is changed. Here, the number of cases is shown in 1 increments from 1 to 30.

import matplotlib.pyplot as plt

#Histogram (relationship between the number of entries and the number of cases)
plt.hist(df['Quantity'], bins=np.arange(1, 30))
plt.xlabel('Quantity(kg)')
plt.ylabel('Count')
plt.grid(True)
plt.show()

As you can see from the numerical values earlier, there is a difference in the number of cases depending on the number of kilograms. You can see that there are many products per 5kg and 10kg, and 3kg and 20kg are relatively available.

Next, let's display the "scatter plot". To display a scatter plot, use the scatter () function of matplotlib. Again, try specifying the DataFram (df) before aggregation.

#Scatter plot (relationship between quantity and product price)
plt.scatter(df['Quantity'], df['Product price'])
plt.xlabel('Quantity(kg)')
plt.ylabel('Pirce(yen)')
plt.grid(True)
plt.show()

The dots show how many kilograms of products are sold, but you can clearly see that the "price variation" mentioned earlier when totaling is large. In this state, there is a lot of variation and it seems that it cannot be used as a basis for determining the price.

So, let's draw a plot by focusing on the "median" of the product price. Unlike the scatter plot above, one point, which is the median value of the corresponding product price for each quantity, is plotted on the graph. Here, the figure is shown using the aggregated DataFrame (df_by_kg).

#Plot points (relationship between quantity and product price (median))
plt.plot(df_by_kg.index, df_by_kg['50%'], 'o')
plt.xlabel('Quantity(kg)')
plt.ylabel('Pirce(yen)')
plt.grid(True)
plt.show()

The plotted points appear to line up on a beautiful straight line. It seems that we can grasp the relationship between the number of items and the product price.

The numerical values (that is, the median) at each point are as follows.

When setting the sales amount of vegetables, it seems that by setting this amount, it is possible to set a reasonable price with little deviation from the market price in online sales. 　　 In this way, it is intuitively easy to understand by visualizing as well as looking at the numerical values. Also, even if it is visualized, there are many variations when looking at the whole vaguely, and I was not sure how to look at it, but when I narrow it down to one statistic called "median", I see regularity. I was able to.

(3) Find the parameters by linear approximation

Let's dig a little deeper into the results. The median value of the product price for the kilogram plotted earlier is considered to be close to the relationship of the linear function (points are lined up almost in a straight line on the graph). Therefore, let's try "linear approximation". By finding the parameters (slope and intercept) of the approximate straight line equation, it seems that the product price can be seen in detail. In other words, the "amount per kg" corresponds to the slope value of this straight line, and the "base amount of shipping, box fee, work fee, etc." corresponds to the intercept of this straight line.

Use the numpy function for linear approximation.

#Linear approximation
linear = np.polyfit(df_by_kg.index, df_by_kg['50%'], 1) #Linear approximation to find intercept and slope
func = np.poly1d(linear) #Make a linear equation from the intercept and slope
x = df_by_kg.index
y = func(x)

#Linear approximate graph display
plt.plot(x, y)

#A scatter plot is also displayed
plt.plot(df_by_kg.index, df_by_kg['50%'], 'o')
plt.xlabel('kg')
plt.ylabel('yen')

In this way, you can see that the plotted points are roughly on a straight line. And this linear variable contains the parameters of this linear expression.

print('Straight line parameters[Tilt,Intercept] = ', linear)
# ->Straight line parameters[Tilt,Intercept] =  [ 142.94797688 2071.99421965]

In other words

--Base amount (corresponding to intercept) Approximately 2072 yen --Amount per kg (corresponding to slope) Approximately 143 yen

Using this value, for example, if you want to make a 4kg product, you can calculate that it would be better to set it at around 2072 yen + 4kg @ 143 yen = 2,644 yen.

It was interesting that the median product price could be approximated to the equation of the linear function so beautifully. I think the number of data used this time was too small to analyze, but I was still able to obtain a numerical value that was intuitively convincing.

in conclusion

Even with a basic analysis like this one, we were able to obtain valid figures that could be used. I think that making decisions based on such analysis results is the key to utilizing data analysis at work. And I think it's important not to finish it once, but to go through a cycle of more accurate and persuasive analysis and decision-making by repeating trials and errors.

One important point is that the validity of the analysis results needs to be verified a little more. For example, you need to increase the number of data or scrutinize the contents of the data. In this example, I thought that if we subdivide by variety, we would be able to produce more personality data.

After analyzing the data, we will enter the decision-making phase of actual pricing, but that requires another perspective. At least the following things need to be solidified within myself.

――How do you set the value of the product? Whether to set it cheaper or higher than the market price. ――Who is the target? I analyzed it with Rakuten market data this time, but it may be necessary to analyze in a market closer to the target.

It would be nice if we could make decisions with a little more speed, but it's difficult.

For the time being, I would like to use this number to produce one output in the near future!

Well, I've been writing articles about data analysis three times, but this time I'll stop here. Since it was an exploratory analysis based on the knowledge learned, there may be a wrong way of thinking or a smarter way of writing. I would be very happy if you read this far and comment if you have any concerns!