[PYTHON] Kaggle Summary: Instacart Market Basket Analysis

1.First of all

We will update the information of Kaggle who participated in the past. Here, we will pick up the data introduction of Instacart Market Basket Analysis and the prominent discussions in the forum. I will introduce the winner's approach in another article. Since the code of notebook is introduced, please change% matplotlib inline etc. as appropriate.

2. Background

スクリーンショット 2017-07-20 16.54.52.png

Instacart's data science team is doing a lot of development. For example, building a predictive model of a user who repurchases an item, a new user, or an item to be added to the shopping cart next. More recently, Instacart has even made detailed transaction information publicly available. In this competition, we will compete for the accuracy of the prediction model using this open data. The content of the forecast is "Forecast the next product ordered by the user from the previously purchased product". Also, although not directly related to the competition, Instacart is currently looking for talented machine learning engineers. If you are interested, please check it out.

Instacart uses XGBoost and word2vec to predict what products users will buy next. -representations-of-words-and-phrases-and-their-compositionality.pdf), and Annoy. We recommend "buy again" for similar products that you have purchased once.

1-LNpbMMzWBsKqKyNvNH2APA.png

In addition, we recommend the results predicted by the created model.

1-tf40kqB8rRajbRn6A_0Jcw.png

Using this publicly available data, Instacart offers consumers the potential to discover new food products.

The features of this time are as follows.

3. Evaluation index

This evaluation index uses the general F1 score as an index for binary classification problems. スクリーンショット 2017-07-20 17.23.05.png

4. Introduction of data

Download the data from this link.

There are 7 types of data this time. Jordan Tremoureux visualizes the association of relational data.

instacart_Files.png

AISLES.csv and DEPARTMENTS.csv are product category information, respectively, and are information such as vegetables, meat, and sweets. PRODUCTS.csv shows the specific association between the product name and these category information. ORDER_PRODUCTS_ {PRIOR, TRAIN} .csv is the main training data. PRIOR is the previous order information and TRAIN is the current order information. These data include a 0-1 flag called reordered, and if this is 1, you are buying the same item. The explanation around here is a little complicated, so we will look at it in detail in the next EDA. SAMPLE_SUBMISSION.csv is a demo file showing the submission format. Finally, ORDERS.csv contains time series information associated with ORDER_PRODUCTS __ {PRIOR, TRAIN} .csv. Even though it is a time series, it is not a time stamp but time difference information such as "until the next purchase? Day".

As for the data as a whole, more than 200,000 items for one month's purchase history of users are organized in a neat and organized manner. By expressing time as a time difference, complex analysis as seen in other competitions can be omitted.

5. Data analysis information

Two analysis information

  1. SRK notebook
  2. Analysis information published by Instacart on the blog I will introduce about.

Although not introduced here, Philipp Spachtholz analysis results written in R is also wonderful. The content is almost the same as the analysis in 1), but the information is organized in a treemap so that it is easy to see, so please check it if you are interested.

unnamed-chunk-26-1.png

5.1. SRK notebook

5.1.1. Data organization

First, import the library

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'

The warning that appears many times is turned off in the last option setting. Please change this area as a hobby. Let's check the arrangement of data.

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

#aisles.csv
#departments.csv
#order_products__prior.csv
#order_products__train.csv
#orders.csv
#products.csv
#sample_submission.csv

It is assumed that the data is installed in the input folder in the next higher hierarchy where the notebook is open. If it is displayed as # ~, it is normal.

Then import the data.

order_products_train_df = pd.read_csv("../input/order_products__train.csv")
order_products_prior_df = pd.read_csv("../input/order_products__prior.csv")
orders_df = pd.read_csv("../input/orders.csv")
products_df = pd.read_csv("../input/products.csv")
aisles_df = pd.read_csv("../input/aisles.csv")
departments_df = pd.read_csv("../input/departments.csv")

Check the read contents and see.

orders_df.head()
スクリーンショット 2017-07-20 18.26.37.png
order_products_prior_df.head()
スクリーンショット 2017-07-20 18.26.43.png
order_products_train_df.head()
スクリーンショット 2017-07-20 18.26.49.png

It seems that orders_df contains almost all the information. order_products_prior_df and order_products_train_df have exactly the same columns. What is the difference between the two files? This time, we will predict the next purchase from the past purchase history. In other words, the prior file contains the purchase history of both train and test, and order_products_train_df is the correct answer data for training. DBEN explains it in an easy-to-understand diagram.

train_user.png test_user.png

For example, in orders_df, there are 10 data of user_id = 1 in eval_set = prior, 1 in eval_set = train, and 0 in eval_set = test. On the other hand, there are 5 data of user_id = 4 in eval_set = prior, 0 in eval_set = train, and 1 in eval_set = test. This "reordered" of eval_set = test is the target to be predicted this time, and it is estimated whether user_id = 4 has repurchased the item purchased in prior.

For confirmation, let's look at the number of priors, trains, and tests contained in orders_df.

cnt_srs = orders_df.eval_set.value_counts()

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[1])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Eval set type', fontsize=12)
plt.title('Count of rows in each dataset', fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

__results___10_0.png

Since train and test are predicted from multiple priors, we can see that there are overwhelmingly many priors.

5.1.2. Visualization of orders, order_products__ {prior, train}

From the contents of 5.1.1., You can see that something can be analyzed by visualizing with eval_set and reordered.

First, let's see how many users are in each of prior, train, and test. Group orders_df with eval_set and count user_ids within each group without duplication.

def get_unique_count(x):
    return len(np.unique(x))

cnt_srs = orders_df.groupby("eval_set")["user_id"].aggregate(get_unique_count)
cnt_srs

aggregate is a method that adapts various functions to a group by group.

eval_set
prior    206209
test      75000
train    131209
Name: user_id, dtype: int64

You can see that there are 206,209 people in total, of which 131,209 are in the train and 75,000 are in the test. How many times does user appear?

cnt_srs = orders_df.groupby("user_id")["order_number"].aggregate(np.max).reset_index()
cnt_srs = cnt_srs.order_number.value_counts()

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[2])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Maximum order number', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

__results___13_0.png

I have purchased the product a minimum of 4 times and a maximum of 100 times.

Probably there are many orders on weekends. Let's divide the number of orders by the day of the week information. The day of the week information is'order_dow'.

plt.figure(figsize=(12,8))
sns.countplot(x="order_dow", data=orders_df, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of order by week day", fontsize=15)
plt.show()

__results___15_0.png

It seems that day of week = 0 and 1 correspond to Saturday and Sunday, respectively. There is a slight difference in the number of orders on weekdays. Similarly, let's look at the number of orders by time.

plt.figure(figsize=(12,8))
sns.countplot(x="order_hour_of_day", data=orders_df, color=color[1])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of order by hour of day", fontsize=15)
plt.show()

__results___17_0.png

The most ordered time is daytime (10 am and 15:00 pm), probably because we check it on Saturdays and Sundays. There are few orders in the morning, and the number of orders decreases as the time gets late in the evening.

The relationship between the number of orders for the day of the week and the time seems to be deep. Let's visualize them with a heat map.

grouped_df = orders_df.groupby(["order_dow", "order_hour_of_day"])["order_number"].aggregate("count").reset_index()
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'order_number')

plt.figure(figsize=(12,6))
sns.heatmap(grouped_df)
plt.title("Frequency of Day of week Vs Hour of day")
plt.show()

__results___19_0.png

pivot focuses only on the two axes of the DataFrame and creates a new frame.

There is a clear difference between weekdays and holidays. On weekdays, there is not much difference between days of the week, and orders are high from 9 am to 4 pm. Holidays vary significantly on Saturdays and Sundays. Saturdays are the most ordered after noon, and orders continue until late. Sundays have the most orders in the morning.

Next, let's look at the time interval of the order.

plt.figure(figsize=(12,8))
sns.countplot(x="days_since_prior_order", data=orders_df, color=color[3])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Days since prior order', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency distribution by days since prior order", fontsize=15)
plt.show()

__results___21_0.png

The number of days between the previous order and the next order is indicated by'days_since_prior_order'. The peak is on the 7th and 30th, and after the 7th, there is a periodicity of 14th, 21st, 28th and every week.

Now that the relationship between time and order count is clear, let's take a look at the predictable repurchases in this competition.

print(order_products_prior_df.reordered.sum() / order_products_prior_df.shape[0])
print(order_products_train_df.reordered.sum() / order_products_train_df.shape[0])
0.589697466792
0.598594412751

It can be seen that the percentage of repurchases for both prior and train is around 59%. Nearly the remaining 40% were non-repurchase orders, so it probably contains an order (order_id) that does not include a repurchase.

grouped_df = order_products_prior_df.groupby("order_id")["reordered"].aggregate("sum").reset_index()
grouped_df["reordered"].ix[grouped_df["reordered"]>1] = 1
grouped_df.reordered.value_counts() / grouped_df.shape[0]
1    0.879151
0    0.120849
Name: reordered, dtype: float64

12% of orders have not been repurchased in prior.

grouped_df = order_products_train_df.groupby("order_id")["reordered"].aggregate("sum").reset_index()
grouped_df["reordered"].ix[grouped_df["reordered"]>1] = 1
grouped_df.reordered.value_counts() / grouped_df.shape[0]
1    0.93444
0    0.06556
Name: reordered, dtype: float64

Orders that have not been repurchased on the train are 6%, which is half of the prior.

How much do you order in one order? Let's make a histogram.

grouped_df = order_products_train_df.groupby("order_id")["add_to_cart_order"].aggregate("max").reset_index()
cnt_srs = grouped_df.add_to_cart_order.value_counts()

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Number of products in the given order', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

__results___29_0.png

SRK does not count order_id, but calculates the number of orders at one time from the maximum value of add_to_cart_order. The most common is 5 simultaneous orders, which shows a decrease in right tailed, such as the Poisson distribution.

5.1.3. Detailed analysis

Data about products is distributed among products, aisles, and departments. First, we will combine this information.

order_products_prior_df = pd.merge(order_products_prior_df, products_df, on='product_id', how='left')
order_products_prior_df = pd.merge(order_products_prior_df, aisles_df, on='aisle_id', how='left')
order_products_prior_df = pd.merge(order_products_prior_df, departments_df, on='department_id', how='left')
order_products_prior_df.head()
スクリーンショット 2017-07-21 12.16.03.png

These data were combined using product_id, aisle_id, and department_id as keys.

** Analysis by product name ** What is the most ordered product name?

cnt_srs = order_products_prior_df['product_name'].value_counts().reset_index().head(20)
cnt_srs.columns = ['product_name', 'frequency_count']
cnt_srs
スクリーンショット 2017-07-21 12.18.15.png

Bananas are on the top, and after that, various food items are lined up, centering on fruits. In addition, almost all are organic products.

** Analysis by product category (aisle) ** What kind of products are lined up? Product type information can be found at aisle.

cnt_srs = order_products_prior_df['aisle'].value_counts().head(20)
plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[5])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Aisle', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

__results___38_0.png

The two most common are fresh fruits and fresh vegetables.

** Analysis by department **

Let's take a look at the product department.

plt.figure(figsize=(10,10))
temp_series = order_products_prior_df['department'].value_counts()
labels = (np.array(temp_series.index))
sizes = (np.array((temp_series / temp_series.sum())*100))
plt.pie(sizes, labels=labels, 
        autopct='%1.1f%%', startangle=200)
plt.title("Departments distribution", fontsize=15)
plt.show()

__results___40_0.png

Produce is the most common department.

** Department-Repurchase ratio **

Which departments are likely to be repurchased?

grouped_df = order_products_prior_df.groupby(["department"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12,8))
sns.pointplot(grouped_df['department'].values, grouped_df['reordered'].values, alpha=0.8, color=color[2])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Department', fontsize=12)
plt.title("Department wise reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

__results___42_0.png

Personal care is the least repurchased and daily eggs are the most repurchased.

** Product Category-Repurchase Ratio ** Let's look at the department and product category (aisle) by repurchase ratio.

grouped_df = order_products_prior_df.groupby(["department_id", "aisle"])["reordered"].aggregate("mean").reset_index()

fig, ax = plt.subplots(figsize=(12,20))
ax.scatter(grouped_df.reordered.values, grouped_df.department_id.values)
for i, txt in enumerate(grouped_df.aisle.values):
    ax.annotate(txt, (grouped_df.reordered.values[i], grouped_df.department_id.values[i]), rotation=45, ha='center', va='center', color='green')
plt.xlabel('Reorder Ratio')
plt.ylabel('department_id')
plt.title("Reorder ratio of different aisles", fontsize=15)
plt.show()

__results___44_0.png

The vertical axis is the department and the horizontal axis is the repurchase ratio. The points at the same height are products in different categories in the same category.

** Order to add to cart-Repurchase rate **

order_products_prior_df["add_to_cart_order_mod"] = order_products_prior_df["add_to_cart_order"].copy()
order_products_prior_df["add_to_cart_order_mod"].ix[order_products_prior_df["add_to_cart_order_mod"]>70] = 70
grouped_df = order_products_prior_df.groupby(["add_to_cart_order_mod"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12,8))
sns.pointplot(grouped_df['add_to_cart_order_mod'].values, grouped_df['reordered'].values, alpha=0.8, color=color[2])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Add to cart order', fontsize=12)
plt.title("Add to cart order - Reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

__results___46_0.png

Speaking of course, it is natural, but the more products you put in the cart at the beginning, the more you repurchase. You can see that the items that you purchase constantly are put in the cart first.

** Time-Repurchase rate ** Let's look at the repurchase rate for the day of the week and the time of the day. Create the three graphs created in 5.1.2. Using the average value of reordered as an index.

order_products_train_df = pd.merge(order_products_train_df, orders_df, on='order_id', how='left')
grouped_df = order_products_train_df.groupby(["order_dow"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12,8))
sns.barplot(grouped_df['order_dow'].values, grouped_df['reordered'].values, alpha=0.8, color=color[3])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.title("Reorder ratio across day of week", fontsize=15)
plt.xticks(rotation='vertical')
plt.ylim(0.5, 0.7)
plt.show()

__results___48_0.png

The repurchase rate does not seem to increase or decrease on any particular day of the week.

grouped_df = order_products_train_df.groupby(["order_hour_of_day"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12,8))
sns.barplot(grouped_df['order_hour_of_day'].values, grouped_df['reordered'].values, alpha=0.8, color=color[4])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.title("Reorder ratio across hour of day", fontsize=15)
plt.xticks(rotation='vertical')
plt.ylim(0.5, 0.7)
plt.show()

__results___49_0.png

Items ordered in the morning are likely to be repurchased.

grouped_df = order_products_train_df.groupby(["order_dow", "order_hour_of_day"])["reordered"].aggregate("mean").reset_index()
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'reordered')

plt.figure(figsize=(12,6))
sns.heatmap(grouped_df)
plt.title("Reorder ratio of Day of week Vs Hour of day")
plt.show()

__results___50_0.png

Overall, the morning repurchase rate is high. Weekends are especially high at 6-8 o'clock and 5-6 o'clock on Tuesdays and Wednesdays. In 5.1.2., There were many orders on Saturday and Sunday morning and afternoon. However, in terms of repurchasing, I found that there are many in the early morning regardless of weekdays and holidays.

5.2. Analysis information published by Instacart on the blog

In the Medium article, Instacart introduces open data. The data used in the article seems to be the same as the data handled in this competition. In addition to introducing the data, some analysis results are also posted. Here, I will explain the two figures introduced in the article.

1-jwDcKJTXV8D1DK0KOlUJAQ.png

This figure is a scatter plot of the number of purchases and the repurchase rate plotted by aisle. Fresh fruits have a higher repurchase rate than fresh vegetables. Vegetables are more for recipes than fruits and may be the result of intermittent purchases. Also, groceries such as soups and dough ingredients have the lowest repurchase rates. Probably because the demand frequency is low in the first place.

1-wKfV6OV-_1Ipwrl7AjjSuw (1).png

Health snacks and main foods are often purchased in the morning, and ice cream (especially Half Baked, The Tonight Dough) is often purchased in the evening. Of the top 25 most recently (evening) ordered items, 24 were ice cream and 25th was frozen pizza.

Chippy reproduces the latter half of the figure. This figure is written in R and has a volume of about 300 lines. If you are interested, please check it out.

5.3. Benchmark using light GBM

(I will add it later.)

Recommended Posts

Kaggle Summary: Instacart Market Basket Analysis
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle Summary: Redhat (Part 1)
Basket analysis with Spark (1)
Kaggle ~ Housing Analysis ③ ~ Part1
Kaggle Summary: BOSCH (kernels)
Kaggle Summary: BOSCH (winner)
Kaggle Summary: Redhat (Part 2)
Kaggle Kernel Method Summary [Image]
Ensemble learning and basket analysis