Introduction

We will update the information of Kaggle who participated in the past. Here is a summary of the information published by the superiors of BOSCH. For the competition outline and kernel, Kaggle Summary: BOSCH (intro + forum discussion), [Kaggle Summary: BOSCH (kernels)](http://qiita It is summarized in .com / TomHortons / items / 359f8e39b9cd424c2360), and here we focus on discussions on analysis results and techniques.

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately)

1st place solution Beluga describes the approach. This time, the raw data did not contain a sufficient amount of information, so it seems that we performed a fairly in-depth feature generation based on the date information. (Use xgboost as the criterion) In addition, ensemble learning uses up to L3. Deep Learning didn't work, probably because of overfitting, and it seems that I didn't use the frequently used ExtraTreeRegressor after all. It was a competition where XGBoost played an active part from beginning to end.

The competition this time was about how to generate new features from the state transitions between stations, the correlation between station data, and the anonymized time stamps. It is true that the data itself is large, but I feel that it is not essential to have a large-scale computer environment.

Data analysis

First, we will spend two weeks looking at the data. Through plotting of numerical data, statistical analysis, probability transition between stations, and correlation matrix of each station, we will get an overall picture of anonymized data.

The memory can be run with the usual 16GB, but that requires some effort. The decimal point is an int type by multiplying the number according to the decimal point. The date information is the minimum value for each station. The missing value itself functions as a feature.

Feature generation

I used Magic features. Ash noticed that consecutive lines contained duplicate and Response-corresponding features. Therefore, we generated features focusing on the order of Id of StartStation and StartTime. For more information, please refer to Kaggle Summary: BOSCH (kernels).

By examining the autocorrelation coefficient, as in kernels, we found that 0.01 time is 6 minutes. In other words, we are analyzing manufacturing data for two years. This fact did not directly contribute to the accuracy improvement, but it did help to generate some date based features.

StartStationTime
StartTime, EndTime, Duration
StationTimeDiff
Start/End part of week (mod 1680)
Number of records in next/last 2.5h, 24h, 168h for each station
Number of records in the same time (6 mins)
MeanTimeDiff since last 1/5/10 failure(s)
MeanTimeDiff till next 1/5/10 failure(s)

I haven't fully utilized the date information myself, but as far as I can see, it seems to be quite useful. When I think about the information such as the difference, the count of non-missing values / missing values, and the beginning and end of the week, I feel that it should affect the occurrence of defective products.

MeanTimeDiff since last 1/5/10 failure(s) ↑ Regarding this, the image is as follows. Mean(EndTime for latest K failures where EndTime < CurrentEndTime) - CurrentEndTime

The numerical information used is as follows

Raw numerical information or a subset selected from xgboost feature importance
Z-scale features (weekly)
Count the sign of each numerical data
Combination of features (f1 +-* f2)

The figure below is Directed graph written using R. By summarizing the relationship between the time stamp and the non-missing value, we can see the flow from the production line to the defective product. I feel that the point was whether or not this flow could be used successfully for feature generation. There was a noticeable similarity in the probability transition matrices of S0 --S11, S12 --S23. At these similar stations, we can see the correlation of the numerical data. Therefore, we combined the features of each. (e.g. L0_S0_F0 + L0_S12_F330 etc.)

It would be nice if the raw data could be used as it is, but I think it would be difficult on a normal PC. However, it seems that this area can be managed considerably by adding processing such as selection of features that are active in XGBoost prediction and deletion of duplicate data. In addition, addition, subtraction, and integration of data with correlations are not useful for improving accuracy. Z-scale tends to process all samples at once, but is it possible to express "the degree of abnormality in a week" by doing it weekly? It's very interesting.

It seems that the winner did not use the category information either. However, it seems that the information only for the stations [24, 25, 26, 27, 28, 29, 32, 44, 47] was used focusing only on whether it was'T1' or not. If the information disclosure on the BOSCH side was solid for the category data, it might have been used a little more.

Evaluation, balance adjustment, results

We started with 5-fold cross-validation and eventually settled on 4-fold. Layer 1 of ensemble learning was learned using three different seeds. Duplications and omissions between data were used as they were without compression. Instead, downsampling from 50% to 90% reduced computational costs. The history of accuracy improvement is as follows.

Other comments

About the ensemble used It seems that he tried to incorporate ExtraTreesRegressor, but he settled on an ensemble using only multiple XGBoosts with different parameters. An example of the parameters is as follows.

early_stopping	alpha	booster	colsample_bytree	min_child_weight	subsample	eta	objective	max_depth	lambda
auc	0	gbtree	0.6	5	0.9	0.03	binary:logistic	14	4

Among them, eta represents the reduction width of the step size, and by reducing this, the boosting performance is improved. However, this time, it was a strange situation that even if eta was made smaller, it did not contribute to the improvement of accuracy.

Finally, Ash summarizes the detailed ensemble. As I explained earlier, I will let it flow here.

[PYTHON] Kaggle Summary: BOSCH (winner)

Introduction

Data analysis

Feature generation

Evaluation, balance adjustment, results

Other comments