[PYTHON] Win with Kaggle by practicing "Kaggle Wins Data Analysis Technology" --Kaggle M5 Forecasting Accuracy 59th (of 5558) Solution Summary

Introduction

Summary of solutions for Kaggle M5 Forecasting --Accuracy Competition 59th (of 5558).

My learning environment is a normal PC environment with a little memory only Memory 16G, CPU only (intel i5-3470 3.2GHz) The model is a single model of LGBM, and there is no ensemble. I haven't tuned the parameters either. I didn't do anything special, I just did what it says in "Kaggle's Winning Data Analysis Techniques". I'm surprised that I will win the top prize in the table competition, which is crowded with competitors.

It's a very simple model, so I thought it was worth sharing, and sharing what I did in this competition I will keep a record of what I thought and actually did to help those who are going to participate in data analysis competitions such as Kaggle.

Overview of the competition

Forecast sales of Walmart products up to 28 days in advance. The target stores are California (CA), Texas (TX), and Wisconsin (WI), for a total of 10 stores. The metric uses a special metric called WRMSSE, rather than a simple actual sales and forecast RMSE.

Features of WRMSSE

――If the daily sales change is drastic, the prediction error will be small, and if there is not much change, the prediction error will be large. --From a business point of view, the larger the sales on the last 28 days, the more weight is added (even if you try hard to improve the accuracy of the sales forecast of the product whose sales continue to be 0, it does not contribute much to the score) --Not only simple sales forecast errors for each product, but also the actual sales and forecast errors of the total of 12 types of groups are comprehensively taken into account (total store sales, total product category sales, etc.).

Data features

As mentioned above, there are 10 stores, products in 3 categories of food, hobby, and household, the number of products is 3049, and the data given as past sales data is about 5 years. The features given are the date, day of the week, event date, selling price of the product, and whether or not the food coupon called SNAP, which is local to the United States, can be used. I don't know what the product name is or what the specific name is (turkeys sell well during thanksgiving, or pinpoint corrections cannot be made simply) I personally thought that the amount of data given as a feature amount is small ... (Isn't it a little impossible to predict with this? You can only understand periodic features?)

Public notebook that was useful

Back to (predict) the future - Interactive M5 EDA To get an overview of the data, I started by reading this notebook.

M5 - custom validation I started Study by moving the next publicly available affordable model. This model was the first to move. I think it's a relatively lightweight model, but I was disappointed because it took me eight hours to learn all the products on my PC with this model.

M5 - Simple FE Since the original data would be a fairly large size data as it is, a compressed data set that was easy to use in pickle format was released. Basics I added my own customized features based on this pickle format.

M5 - WRMSSE Evaluation Dashboard It was very helpful in understanding WRMSSE. Based on this notebook, I calculated the WRMSSE of the prediction data of my model and confirmed it, and I visualized and confirmed the sales of the data (evaluation) to be submitted with a slight modification. However, a PDF of the official WRMSSE calculation method is posted on the competition site separately from this notebook, so I think it was necessary to read it carefully.

In addition, volunteers published a notebook of the competition outline in Japanese, which helped me to read and understand it. Thank you very much.

Strategy and milestone

First of all, as a personal background, I had a rough goal of becoming a Kaggle Expert in 2019. However, in reality, those who participated in the competition on an ad hoc basis and submitted a model that was close to winning with a little ingenuity were out of the winning range due to the punishment, so as a reflection, this year's ranking is not the first. Participating in competitions about three times a year and working harder as a regular activity, where it is customary to learn from competitions, it is not a ranking goal, but the number of times you participated properly and something from it The goal was to make it a thing.

As a strategy plan for this competition, I first considered the following strategies.

--Study similar competitions in the past --Pursuit of a model that boasts simple and essential features and can be used alone --A model that is close to winning in a public notebook + Search for predictions that have room for improvement (the distribution of a specific category is strange, and there are many outliers) and replace them with the predictions of a specialized model (after all, do it? Well last resort)

Based on this strategy, we have set the following milestones. (Actually, not all of them were done, only about half of them were done)

--Past similar competition survey recruite restaurant --Past Similar Competition Survey Favorita Grocery Sales Forecasting (1st place) --Past Similar Competition Survey InstantMarketBasektAnalys

--Golden Week Short-term intensive Handson (Determine the time to spend, data understanding, minimal dataset creation, feature selection, feature evaluation baseline creation, model creation, 1st submission creation, retrospective implementation)

--Data statistics / distribution survey understanding (statistics of all columns, uniquness, top10, product correlation, regional correlation, date correlation ,,) --Outlier survey (specific product, specific date,) --Investigation of how much data is common between train and test

--Make a simple and high-performance model by itself --Additional feature survey --Predict accuracy verification Quantification and visualization of the degree of removal for each store / product

--28 days worth of forecast individual model creation --Create rule-based out-of-stock forecast (create post-processing to fill sales = 0)

――One month ago Top baseline survey (mainly staring at areas that could be improved) --One week ago Top baseline survey (mainly staring at areas that could be improved) --Regularly look at discussion / notebook and try, incorporate, and improve ideas that are likely to be incorporated.

Feature value

M5 - Custom features I didn't use this feature.

M5 - Simple FE The following features have been added here.

--The day before the holiday --The day after the holiday --Sales from the forecast target date to 28 days before (lag feature) --Sum, min, max, mean of sales on the 7th, 14th, 30th and 60th days from the forecast target date --Average sales for 4 weeks going back every 7 days from the forecast target date (capturing the trend of the day of the week) --Average sales for 8 weeks going back every 7 days from the forecast target date (capturing the trend of the day of the week) --Average sales for 12 weeks going back every 7 days from the forecast target date (capturing the trend of the day of the week)

I'm not using log1p e? only this? ···is.

It is strange that max is effective as the contribution of features.

Data and model

For the simple reason that it does not fit in memory, only the data from March to June and the data from 2016 are used as training data. Since this also does not fit in memory, the model is trained by dividing it into food, hobby, and household data. In addition, 28x3 = 84 models are created for 28 days of forecast dates, food, hobby, and household. As the learning data for final submission, we use all the data up to the last minute.

Individual improvement support

Responding to gradual sales increase

Macro sales will increase little by little, probably due to the increase in sales floor area and population, so it was written in a discussion somewhere that the final forecast should be multiplied by 1.02 to 1.05. Since the features of my model are mainly lag features, it doesn't seem to make much sense to do this, so just in case, even if I try to visualize the sales forecast actually submitted by connecting it with past data, the correction is especially correct. I didn't seem to need it, so I didn't do it.

Submitted sales forecast

TX flood information

There was a discussion about the pros and cons of adding meteorological data as features. (There is a regulation that external data including the forecast period should not be used) There was a flood (Severe Storms And Flooding) in Texas on 5 / 22-6 / 25, the forecast period is 5 / 23-6 / 19, and you can use the events known as of 5/22. Someone wrote the story. 4 / 17-5 / 1 (d1909-1923) also had flood information on Texas, so after learning by excluding the Texas data on 4 / 17-5 / 1, 4 / 17-5 / 1 I predicted the Texas data of, and took the actual sales and RMSE to visualize the one with a large error, but I did not see anything that seemed to be the effect of the flood, so the three Texas stores targeted for this prediction Did not add flood information to the feature quantity, deciding that it was a good location unrelated to floods.

4 / 17-5 / 1 Texas data with large prediction error

Out of stock forecast

Since the items that are out of stock are 0 sales in the first place, we created a rule-based process that extracts products with 0 sales as of the day before the forecast at all 10 stores or 9 stores. However, as a feature of WRMSSE, if the sales continue to be 0, the weight becomes lighter, and even if you blindly apply the sales 0, the score is lowered, and it is not possible to predict to what point the product is actually out of stock. , I couldn't use it as a trump card.

There were some things that seemed to be out of stock, but I decided that FOODS_2_242 was out of stock, so I filled in sales = 0 and created a submission. After this post-processing, the score seems to have improved slightly to 0.58026-> 0.57987.

Looking back and impressions

That's all I really did. As I wrote at the beginning, the model is a single model of LGBM, and there is no ensemble. I haven't tuned the parameters either. I haven't removed any outliers, and I haven't preprocessed any other data. The machine learning competition has a certain luck game, and can be compared to mahjong, but when it comes up, it's not a pair of three memorizations, but hey, four memorizations, and now amateurs !! Well, I made some simple arrangements, isn't it? sorry.

Good thing

――I took a day off from work at GW for two days, and I was working for my family. The goal was to train the model from rough data analysis and go through the process up to the 1st submission, but it turned out that it would take 8 hours to learn, so I couldn't go to the 1st submission. .. However, I was keenly aware of the importance of taking the time to understand the competition. GW, summer vacation, around New Year is a chance. Looking back, even when I won the first prize two years ago, I had GW study hard on my PC. ――In the field of time series forecasting, from data analysis to creation of easy-to-use datasets, extraction of minimal datasets and creation of baselines, learning, and visualization of forecasts, there are some thin points, but you should accumulate your own knowledge. Was done ――Even if you follow "Data analysis technology that wins with kaggle", you can't win because everyone is doing it ... Before it became, I managed to win a prize while the effect of the book was working. ――Anyway, you are now on the Kaggle Expert Ranch.

What you want to improve / what you want to work on next

--Anyway, the memory is insufficient and the PC freezes. Every time the PC hardens, I get mentally sick. I have to improve my skills to make a minimal set so that I can get a faster evaluation. ――I don't know what to add features. I realized that I needed to study more broadly or step up ――I want to be able to understand DFT (Discrete Fourier Transform) and what it is in the first place, and judge whether it can be used well as a feature to grasp the periodic tendency ...

Other

If this kind of content is okay, I'd like to talk about it somewhere in LT for about 10 minutes. I have never spoken outside the company, so I would like to gain experience after training. I think I'll try to find a place for myself, but please feel free to contact me if there is demand ~ (^^ /