[PYTHON] <Pandas> How to handle time series data in a pivot table

Introduction

The author is currently participating in an AI human resources development program sponsored by the Ministry of Economy, Trade and Industry called AI Quest, where the first term competition ended at 8:00 today. I'm not glad that the result was 18th place (295 participants), but I think that it was a very practical content because it was a demand forecast assuming actual data, so I wrote down this experience. I will write it down.

Data overview

The outline of this competition is that daily sales data for each product store of a company for about two years is given, and based on that, sales for the next month are predicted, and the data is as follows. It was given in the form of.

Date Store ID Product ID Product price Quantity sold
0        2018-01-01     9  1000001   420   1.0
1        2018-01-01     6  1000001   420   1.0
2        2018-01-01    10  1000001   420   1.0
3        2018-01-01     0  1000017   250   1.0

Preprocessing

Set the date to datetime type before handling it with pivot.


df = pd.read_csv("data/sales_history.csv", parse_dates=["date"])

When reading in this way, the date column could be read with datetime. If you display it as df.info (), you can see that it is of datetime type as shown below.

RangeIndex: 1119570 entries, 0 to 1119569
Data columns (total 5 columns):
 #   Column  Non-Null Count    Dtype         
---  ------  --------------    -----         
0 date 1119570 non-null  datetime64[ns]
1 Store ID 1119570 non-null  int64         
2 Product ID 1119570 non-null  int64         
3 Product price 1119570 non-null  int64         
4 Number of units sold 1119570 non-null  float64       
dtypes: datetime64[ns](1),float64(1),int64(3)
memory usage: 42.7 MB

Next, I want to aggregate this by month, so I will create a column that matches the year and month.


df["Year month"] = df["date"].dt.strftime("%y%m")

If you try it as df ["year / month "] .head (), you can see that it is created as follows. By the way, if you set it to % Y, it will be 4 digits in the Christian era, and if you set it to% y, it will be the last 2 digits.

0    1801
1    1801
2    1801
3    1801
4    1801
Name:Year month, dtype: object

Let's summarize it by month. The total number of units sold is fine, but since we want to use the average price instead of the total (this time we set the median value that is not affected by abnormal values), we will describe them together in agg.


df = df.groupby(["Year month", "Product ID", "Store ID"]).agg({"Number of units sold":"sum", "Product price":"median"}).reset_index()

It is summarized by month as follows. Here, if you use .reset_index (), the column used for grouping will become MultiIndex (the index has a multi-level hierarchy) and it will be difficult to handle, so solve it. I am doing it.

Year / Month Product ID Store ID Number of units sold Product price
0       1801  1000001     0   6.0   420
1       1801  1000001     1   2.0   325
2       1801  1000001     2   1.0   420

pivot table

It's finally the main subject. What we should do this time is that the month with no sales is not 0 and there is no data itself, and if we make a prediction as it is, the information that the sales was 0 will be omitted, so the month without data is 0 I want to fill it with. Also, I would like to add the lag feature (sales of last month or the previous month). The pivot table makes it easy to do these things.


df_pivot = df.pivot_table(index=["Product ID", "Store ID"], columns="Year month", values="Number of units sold")

If you specify the index and column and the value to be aggregated in this way, it will make a nice feeling.

Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810\
Product ID Store ID
1000001 0      6.0   3.0   1.0   2.0   NaN   4.0   3.0   2.0   1.0   NaN   
        1      2.0   1.0   NaN   2.0   NaN   2.0   1.0   NaN   NaN   1.0   
        2      1.0   NaN   1.0   NaN   1.0   2.0   3.0   4.0   2.0   1.0 

As you can see, when there is no data, it can be expressed as a missing value, so if you fill this with 0, you can express sales as 0.


df_pivot = df_pivot.fillna(0)

Next, let's look at the lag features. The writing method is the same shift as shifting one column in time series data.


sold_lag_1 = df_pivot.shift(1, axis=1)

If you do this, it will shift one to the right as shown below.

Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810\
Product ID Store ID
1000001 0      NaN   6.0   3.0   1.0   2.0   0.0   4.0   3.0   2.0   1.0   
        1      NaN   2.0   1.0   0.0   2.0   0.0   2.0   1.0   0.0   0.0   
        2      NaN   1.0   0.0   1.0   0.0   1.0   2.0   3.0   4.0   2.0    

In the same way, we were able to add a few months ago lag as a feature.


sold_lag_2 = df_pivot.shift(2, axis=1)
sold_lag_3 = df_pivot.shift(3, axis=1)

Finally, you need to add these to the original data frame. For example, the data of the number of units sold in the corresponding month filled with 0 is processed as follows.


sold_lag_0 = df_pivot.stack().reset_index().rename(columns={0:"Filled with 0 units sold"})

First, you can eliminate what was pivoted by .stack () and express it with a vertical multi-index.

Product ID Store ID Year / Month
1000001  0     1801    6.0
               1802    3.0
               1803    1.0
               1804    2.0
               1806    4.0

By doing so, I was able to eliminate the multi-index and arrange it so that it can be combined by tweaking the column names as needed. All you have to do is merge the column names together and you're done.


df_merge = pd.merge(df, sold_lag_0, how="right", on=['Year month', 'Store ID', 'Product ID'])

At this time, if you set how =" right ", you can join right outer join (join based on the table on the right side) and it is completed successfully.

Year / Month Product ID Store ID Number of units sold Product price Number of units sold 0
0       1801  1000001     0   6.0   420    6.0
1       1801  1000001     1   2.0   325    2.0
2       1801  1000001     2   1.0   420    1.0

After that, the lag features can be combined in the same way, and the price can be linearly interpolated with df_pivot.interpolate ("nearest", limit_direction ='both', axis = 1) . I can do it. In other words, more information can be extracted by performing four arithmetic operations on the lag features.

Based on what I learned this time, I would like to devote myself to achieving better results in the next competition.

Recommended Posts

<Pandas> How to handle time series data in a pivot table
How to handle time series data (implementation)
How to read time series data in PyTorch
Ingenuity to handle data with Pandas in a memory-saving manner
How to generate exponential pulse time series data in python
How to format a table using Pandas apply, pivot and swaplevel
How to display DataFrame as a table in Markdown
How to handle data frames
[Pandas] How to check duplicates and delete duplicates in a table (equivalent to deleting duplicates in Excel)
Graph time series data in Python using pandas and matplotlib
[Python] How to output a pandas table to an excel file
How to get an overview of your data in Pandas
Data science companion in python, how to specify elements in pandas
How to handle session in SQLAlchemy
How to write soberly in pandas
[Python] How to use Pandas Series
How to handle Japanese in Python
How to calculate the sum or average of time series csv data in an instant
How to extract features of time series data with PySpark Basics
[Python] How to add rows and columns to a table (pandas DataFrame)
How to create a large amount of test data in MySQL? ??
Example of how to aggregate a large amount of time series data using Python at a reasonable speed in a small memory environment
How to compare time series data-Derivative DTW, DTW-
How to get a stacktrace in python
How to display multiplication table in python
How to reassign index in pandas dataframe
How to handle consecutive values in MySQL
How to read CSV files in Pandas
Adding Series to columns in python pandas
How to change multiple columns of csv in Pandas (Unixtime-> Japan Time)
How to send a visualization image of data created in Python to Typetalk
How to stop a program in python until a specific date and time
Convenient time series aggregation with TimeGrouper in pandas
Get time series data from k-db.com in Python
How to clear tuples in a list (Python)
How to use Python Image Library in python3 series
Kaggle Kernel Method Summary [Table Time Series Data]
How to create a JSON file in Python
[Memo] I tried a pivot table in Python
A clever way to time processing in Python
How to implement a gradient picker in Houdini
[Introduction to Python] How to handle JSON format data
How to create data to put in CNN (Chainer)
How to notify a Discord channel in Python
[Python] How to draw a histogram in Matplotlib
How to create a Rest Api in Django
Data cleaning How to handle missing and outliers
How to write a named tuple document in 2020
How to count numbers in a specific range
How to handle datetime type in python sqlite3
How to read a file in a different directory
How to Mock a Public function in Pytest
Extract periods that match a particular pattern from pandas time series qualitative data
The first step to log analysis (how to format and put log data in Pandas)
How to specify a schema in Django's database settings
How to convert / restore a string with [] in python
How to apply markers only to specific data in matplotlib
I want to give a group_id to a pandas data frame
How to measure processing time in Python or Java
How to scrape horse racing data using pandas read_html
How to convert horizontally held data to vertically held data with pandas