[PYTHON] Kaggle Summary: BOSCH (winner)

Introduction

We will update the information of Kaggle who participated in the past. Here is a summary of the information published by the superiors of BOSCH. For the competition outline and kernel, Kaggle Summary: BOSCH (intro + forum discussion), [Kaggle Summary: BOSCH (kernels)](http://qiita It is summarized in .com / TomHortons / items / 359f8e39b9cd424c2360), and here we focus on discussions on analysis results and techniques.

logo

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately)

1st place solution Beluga describes the approach. This time, the raw data did not contain a sufficient amount of information, so it seems that we performed a fairly in-depth feature generation based on the date information. (Use xgboost as the criterion) In addition, ensemble learning uses up to L3. Deep Learning didn't work, probably because of overfitting, and it seems that I didn't use the frequently used ExtraTreeRegressor after all. It was a competition where XGBoost played an active part from beginning to end.

The competition this time was about how to generate new features from the state transitions between stations, the correlation between station data, and the anonymized time stamps. It is true that the data itself is large, but I feel that it is not essential to have a large-scale computer environment.

Data analysis

Screen Shot 2016-11-21 at 2.02.33.png

First, we will spend two weeks looking at the data. Through plotting of numerical data, statistical analysis, probability transition between stations, and correlation matrix of each station, we will get an overall picture of anonymized data. StationFeaturSimilarity.png

The memory can be run with the usual 16GB, but that requires some effort. The decimal point is an int type by multiplying the number according to the decimal point. The date information is the minimum value for each station. The missing value itself functions as a feature.

Feature generation

Screen Shot 2016-11-21 at 2.07.50.png

I used Magic features. Ash noticed that consecutive lines contained duplicate and Response-corresponding features. Therefore, we generated features focusing on the order of Id of StartStation and StartTime. For more information, please refer to Kaggle Summary: BOSCH (kernels).

Screen Shot 2016-11-21 at 2.10.33.png

By examining the autocorrelation coefficient, as in kernels, we found that 0.01 time is 6 minutes. In other words, we are analyzing manufacturing data for two years. This fact did not directly contribute to the accuracy improvement, but it did help to generate some date based features.

I haven't fully utilized the date information myself, but as far as I can see, it seems to be quite useful. When I think about the information such as the difference, the count of non-missing values / missing values, and the beginning and end of the week, I feel that it should affect the occurrence of defective products.

Screen Shot 2016-11-21 at 2.11.05.png

The numerical information used is as follows

The figure below is Directed graph written using R. By summarizing the relationship between the time stamp and the non-missing value, we can see the flow from the production line to the defective product. I feel that the point was whether or not this flow could be used successfully for feature generation. __results___1_6.png There was a noticeable similarity in the probability transition matrices of S0 --S11, S12 --S23. At these similar stations, we can see the correlation of the numerical data. Therefore, we combined the features of each. (e.g. L0_S0_F0 + L0_S12_F330 etc.)

It would be nice if the raw data could be used as it is, but I think it would be difficult on a normal PC. However, it seems that this area can be managed considerably by adding processing such as selection of features that are active in XGBoost prediction and deletion of duplicate data. In addition, addition, subtraction, and integration of data with correlations are not useful for improving accuracy. Z-scale tends to process all samples at once, but is it possible to express "the degree of abnormality in a week" by doing it weekly? It's very interesting.

Screen Shot 2016-11-21 at 2.11.44.png

It seems that the winner did not use the category information either. However, it seems that the information only for the stations [24, 25, 26, 27, 28, 29, 32, 44, 47] was used focusing only on whether it was'T1' or not. If the information disclosure on the BOSCH side was solid for the category data, it might have been used a little more.

Evaluation, balance adjustment, results

Screen Shot 2016-11-21 at 2.12.11.png

We started with 5-fold cross-validation and eventually settled on 4-fold. Layer 1 of ensemble learning was learned using three different seeds. Duplications and omissions between data were used as they were without compression. Instead, downsampling from 50% to 90% reduced computational costs. The history of accuracy improvement is as follows.

results.png

Other comments

About the ensemble used It seems that he tried to incorporate ExtraTreesRegressor, but he settled on an ensemble using only multiple XGBoosts with different parameters. An example of the parameters is as follows.

early_stopping alpha booster colsample_bytree min_child_weight subsample eta objective max_depth lambda
auc 0 gbtree 0.6 5 0.9 0.03 binary:logistic 14 4

Among them, eta represents the reduction width of the step size, and by reducing this, the boosting performance is improved. However, this time, it was a strange situation that even if eta was made smaller, it did not contribute to the improvement of accuracy.

Screen Shot 2016-11-21 at 4.09.53.png

Finally, Ash summarizes the detailed ensemble. As I explained earlier, I will let it flow here.

Recommended Posts

Kaggle Summary: BOSCH (winner)
Kaggle Summary: BOSCH (kernels)
Kaggle Summary: BOSCH (intro + forum discussion)
Kaggle Summary: Outbrain # 2
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle Summary: Redhat (Part 1)
Kaggle Summary: Redhat (Part 2)
Kaggle Kernel Method Summary [Image]
Kaggle Summary: Instacart Market Basket Analysis
[Survey] Kaggle --Quora 3rd place solution summary
[Survey] Kaggle --Quora 5th place solution summary
[Survey] Kaggle --Quora 4th place solution summary
[Survey] Kaggle --Quora 2nd place solution summary