[PYTHON] Kaggle Summary: Redhat (Part 2)

Introduction

We will update the information of Kaggle who participated in the past. Here, we will pick up the data introduction of Predicting Red Hat Business Value and the prominent discussions in the forum. For data introduction and basic analysis, see Kaggle Summary: Red Hat (Part 1), which is a practical code summary. ..

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) If you find any errors when you run the sample script, it would be helpful if you could comment.

Overview

front_page.png

As explained in Part 1, the original data contains enough features. Even if you don't use elaborate methods, you can get a score of 95% or more, so on the contrary, various solutions are being tried depending on how to pack the remaining few%. Codes that utilize versatile methods and deep learning for classification problems have also appeared, so this article introduces the following four.

  1. 1st solution
  2. 2nd solution
  3. Prediction using XGBoost
  4. Deep learning with Keras

There is no sample script for 1 and 2, and 3 and 4 are benchmark sample codes.

  1. 1st solution

Description of the winner I will post it. Please refer to here for visualization of data structure and distribution. Also, Sites around here is also shown. It is easy to understand.

As you can see by reading Japanese, the long story continues endlessly and no concrete code comes out. The Japanese translation becomes more and more complicated as you go down, so if you are confused, please check the English text or the referring site. The difference with the top three groups is not closed, and it is clear that Radder spent quite stressful days.

Screen Shot 2017-01-19 at 14.30.18.png

This model is very simple. It is very important to create an appropriate cross-validation set, and create a CV set using the following method.

  1. Remove group_1 = 17304 from the train and test data. The data in this group occupy 30% of the training data, and all have outcome = 0.
  2. Use another operator for group1. This is because this group contains more than 3000 data. (This is important)
  3. Randomly create unstratified 5-fold CVs for people files.

Screen Shot 2017-01-19 at 14.30.51.png

My modeling concept is rather simple. Reduce the original problem to a few smaller problems and combine them in a second-level model. I made some models according to the following. a) Select activity from the timeline of group_1 (this time I used the first / last activity of the timeline) b) Collect all activities from groups with similar outcomes c) Compress the features. tf-idf is especially useful. The characteristics of each variable were calculated. (For people with the same attributes within each group or throughout) d) Add simple, non-simple features. (The value of group_1 id, activity in the group, people in the group, maximum / minimum date, etc.) This time, we did not use the interrelationship between feature quantities or probability information. e) Create a classifier. ** This time, I used only XGBoost. ** It was able to reach 0.84 AUC by itself. (No leak data is used)

Screen Shot 2017-01-19 at 14.32.22.png

In order to properly perform the above method, it is necessary to consider a less common CV approach. This decomposition method is based on people_id, compressed data. This CV is based on several aggregated CV split schemes. This technique works fine, but you need to create as many as 15 XGBoosts, and you have to create two levels. This way of thinking about CV is important, but I will omit the explanation here.

Here, we introduce four well-working 1st level models. (Two of them work well with public LB, the other two give the best CV scores on the 2nd level model) [2nd layer model requires very careful scripting skills](http: / /qiita.com/TomHortons/items/2a05b72be180eb83a204). Also, since the release including Leaked release is implemented in CV, is the prediction within the outcome group learned in ML? .. The 2nd level model solves two problems. The prediction probability related to the leak information and the prediction probability unrelated to the leak information. The model itself is simple, but some smart features have been introduced to capture the time changes of group / population.

In the above condition (using a decent prediction model), I was ranked 4th in the middle of the competition. I saw participants in 3rd place and above improving their scores day by day. Therefore, I thought about what they were manually developing in public LB to increase leaks (prediction omissions?). Public / private data partitioning (when testing the model) is random. Some people did a handmade test group submission and used the unaffected by leaks in a particular group to get a score. Then we determined what the probability of outcome for the entire group was. (???) In order to achieve this, we submitted a large number of submissions to obtain the largest possible group_1, and found some groups with poor prediction accuracy of the machine learning model. With that in mind, I created some excellent models. (That is, did you make fine adjustments to the model and CV method to create new models one after another, and repeatedly submit so that convenient groups were randomly included in the test data?)

And in the final submission, we simply averaged the best-performing models for LB and CV so far. Surprisingly, this gave the highest score ever.

  1. 2nd solution

Perhaps because Radder (the winner) won the championship with twists and turns, the discussions in the second place and below that proposed smarter solutions are becoming more popular in the interviews at the forum. Here, Interview in 2nd place is described.

step 1 We created some probability complement models for the groups that appeared in the training sample. The following figure shows the predicted probabilities of group 7 plotted in chronological order.

f1.gif

In the training sample, the probability is 1, and on another day the probability is decreasing. The following figure shows similar prediction results, but plots the entire data range.

f2.gif

step 2 This time there are 34224 different groups in the data. This number is the actual sample size (?) Because the group is just an object from a statistical point of view. The problem is the features that take different values within the group. We calculated histograms for all groups and all features. Histogram bins are new features. This is called the "fuzzy" version of the binary encoding (?). There are three prediction models as follows.

A) Logistic regression B) KNN C) XGBoost based public scripts

step 3 Based on the above, we will improve it with reference to the feedback of LB.

3. Prediction using XGBoost

Introducing the most effective and general method using XGBoost. Code for author code (Abriosi) Please refer to the code of here. It's very simple, it encodes the category data and predicts it with XGBoost.

I will explain only the main points. After doing pd.read (), remove char_10. (Because there are many missing values?)

act_train_data=act_train_data.drop('char_10',axis=1)
act_test_data=act_test_data.drop('char_10',axis=1)

Execute the act_data_treatment function on act_train_data and other input data from which char_10 has been removed.

act_train_data  = act_data_treatment(act_train_data)
act_test_data   = act_data_treatment(act_test_data)
people_data = act_data_treatment(people_data)

The act_data_treatment function is defined as follows.

def act_data_treatment(dsname):
    dataset = dsname
    
    for col in list(dataset.columns):
        if col not in ['people_id', 'activity_id', 'date', 'char_38', 'outcome']:
            if dataset[col].dtype == 'object':
                dataset[col].fillna('type 0', inplace=True)
                dataset[col] = dataset[col].apply(lambda x: x.split(' ')[1]).astype(np.int32)
            elif dataset[col].dtype == 'bool':
                dataset[col] = dataset[col].astype(np.int8)
    
    dataset['year'] = dataset['date'].dt.year
    dataset['month'] = dataset['date'].dt.month
    dataset['day'] = dataset['date'].dt.day
    dataset['isweekend'] = (dataset['date'].dt.weekday >= 5).astype(int)
    dataset = dataset.drop('date', axis = 1)
    
    return dataset

In short, all category data except ['people_id','activity_id','date','char_38','outcome'] is encoded, and ['date'] is divided into year, month, day and week. Then, execute the reduce_dimen function on the concatenated data of train and test.

whole=pd.concat([train,test],ignore_index=True)
categorical=['group_1','activity_category','char_1_x','char_2_x','char_3_x','char_4_x','char_5_x','char_6_x','char_7_x','char_8_x','char_9_x','char_2_y','char_3_y','char_4_y','char_5_y','char_6_y','char_7_y','char_8_y','char_9_y']
for category in categorical:
    whole=reduce_dimen(whole,category,9999999)
    
X=whole[:len(train)]
X_test=whole[len(train):]

del train
del whole

The reduce_dimen function is as follows.

def reduce_dimen(dataset,column,toreplace):
    for index,i in dataset[column].duplicated(keep=False).iteritems():
        if i==False:
            dataset.set_value(index,column,toreplace)
    return dataset

Encode the category data of input X with OneHotEncoder. This completes the 0 and 1 sparse matrix X_sparse.

enc = OneHotEncoder(handle_unknown='ignore')
enc=enc.fit(pd.concat([X[categorical],X_test[categorical]]))
X_cat_sparse=enc.transform(X[categorical])
X_test_cat_sparse=enc.transform(X_test[categorical])
from scipy.sparse import hstack
X_sparse=hstack((X[not_categorical], X_cat_sparse))
X_test_sparse=hstack((X_test[not_categorical], X_test_cat_sparse))

Convert X_sparse to DMatrix and set the parameters. Since there are many sites that introduce the contents of each parameter in detail, explanations are omitted here. There are four main ones: {'max_depth': 10,'eta': 0.02,'silent': 1,'objective':'binary: logistic'}.

print("Training data: " + format(X_sparse.shape))
print("Test data: " + format(X_test_sparse.shape))
print("###########")
print("One Hot enconded Test Dataset Script")

dtrain = xgb.DMatrix(X_sparse,label=y)
dtest = xgb.DMatrix(X_test_sparse)

param = {'max_depth':10, 'eta':0.02, 'silent':1, 'objective':'binary:logistic' }
param['nthread'] = 4
param['eval_metric'] = 'auc'
param['subsample'] = 0.7
param['colsample_bytree']= 0.7
param['min_child_weight'] = 0
param['booster'] = "gblinear"

Learn with XGBoost, make predictions, and you're done.

watchlist  = [(dtrain,'train')]
num_round = 300
early_stopping_rounds=10
bst = xgb.train(param, dtrain, num_round, watchlist,early_stopping_rounds=early_stopping_rounds)

ypred = bst.predict(dtest)
output = pd.DataFrame({ 'activity_id' : test['activity_id'], 'outcome': ypred })
output.head()

This will give you a score close to 98%.

4. Deep learning with Keras

There are quite a few people who have tried the neural approach for the time being. Basically, deep learning is rarely the best model other than image data, but probably because the correlation between XGBoost and Random Forest and the prediction result is low, it is often used to reinforce XGBoost in ensemble learning. I feel it.

At Kaggle in 2016, deep learning was mostly Keras. Here, class classification is performed using Keras.

4.1. Until model design

First, import the library.

import pandas as pd
import numpy as np
from scipy import sparse as ssp
import pylab as plt
from sklearn.preprocessing import LabelEncoder,LabelBinarizer,MinMaxScaler,OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.decomposition import TruncatedSVD,NMF,PCA,FactorAnalysis
from sklearn.feature_selection import SelectFromModel,SelectPercentile,f_classif
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import log_loss,roc_auc_score
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.cross_validation import StratifiedKFold,KFold
from keras.preprocessing import sequence
from keras.callbacks import ModelCheckpoint,Callback
from keras import backend as K
from keras.layers import Input, Embedding, LSTM, Dense,Flatten, Dropout, merge,Convolution1D,MaxPooling1D,Lambda,AveragePooling1D
from keras.layers.normalization import BatchNormalization
from keras.optimizers import SGD
from keras.layers.advanced_activations import PReLU,LeakyReLU,ELU,SReLU
from keras.models import Model

Please install Keras and tensorflow (theano) in advance. There are also many sites written in Japanese.

Next is the data path and seed settings.

seed = 1
np.random.seed(seed)
dim = 32
hidden=64

path = "../input/"

Read train, test, people data, and let the concatenated data be data. data is a variable that is used to create input, intermediate, and output layers and is deleted when modeling is complete.

train = pd.read_csv(path+'act_train.csv')
test = pd.read_csv(path+'act_test.csv')
people = pd.read_csv(path+'people.csv')
columns = people.columns
test['outcome'] = np.nan
data = pd.concat([train,test])
    
data = pd.merge(data,people,how='left',on='people_id').fillna('missing')
train = data[:train.shape[0]]
test = data[train.shape[0]:]

Up to this point, the shape of data is (2695978, 55), which is a DataFrame including people_id and activity_id.

Next, set the column information as columns and encode the data value with LabelEncoder of sklearn. Since missing is included in data, it is also encoded in the same way as normal numerical data. It seems that the program is written fairly carefully, so the program is long. Please note.

columns = train.columns.tolist()
columns.remove('activity_id')
columns.remove('outcome')
data = pd.concat([train,test])
for c in columns:
    data[c] = LabelEncoder().fit_transform(data[c].values)

train = data[:train.shape[0]]
test = data[train.shape[0]:]
data = pd.concat([train,test])
columns = train.columns.tolist()
columns.remove('activity_id')
columns.remove('outcome')

When the data preparation is finished, start setting the layer. Prepare the Input layer and Embedding layer of Keras, and prepare the layers according to the created columns. Details of each layer are described in "Input (Embedding + Flatten) + Layer + Dropout + Output" in Keras Case Study. Is omitted. Here is a visualization of the finally completed model.

Screen Shot 2017-01-18 at 9.53.33.png

4.2. From input data preparation to model.fit

Since the data used earlier was for model creation, prepare X and y for input. Classify training / test data with skf and create X_train and X_test.

X = train[columns].values
X_t = test[columns].values
y = train["outcome"].values
people_id = train["people_id"].values
activity_id = test['activity_id']
del train
del test

skf = StratifiedKFold(y, n_folds=4, shuffle=True, random_state=seed)
for ind_tr, ind_te in skf:
    X_train = X[ind_tr]
    X_test = X[ind_te]

    y_train = y[ind_tr]
    y_test = y[ind_te]
    break

X_train = [X_train[:,i] for i in range(X.shape[1])]
X_test = [X_test[:,i] for i in range(X.shape[1])]

del X

In the original code, the model is then saved using ModelCheckpoint. However, after all, checkpoint is not set in model.fit, so detailed explanation is omitted here.

Finally, do model.fit and model.predict and you're done.

model.fit(
    X_train, 
    y_train,
    batch_size=batch_size, 
    nb_epoch=nb_epoch, 
    verbose=1, 
    shuffle=True,
    validation_data=[X_test,y_test],
    callbacks = [
        model_checkpoint,
        auc_callback,
        ],
    )

The maximum prediction accuracy score is 0.98. Deep learning is not particularly good because the original data is easy to predict.

Recommended Posts

Kaggle Summary: Redhat (Part 1)
Kaggle Summary: Redhat (Part 2)
Kaggle Summary: Outbrain # 2
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle ~ Housing Analysis ③ ~ Part1
Kaggle Summary: BOSCH (kernels)
Kaggle Summary: BOSCH (winner)
Kaggle Kernel Method Summary [Image]
Kaggle Summary: Instacart Market Basket Analysis
Kaggle Summary: BOSCH (intro + forum discussion)
Mastodon Bot Creation Memo: Part 4 Summary
2014/02/28 Summary of contents demoed at #ssmjp, part 1
[Survey] Kaggle --Quora 5th place solution summary
[Survey] Kaggle --Quora 4th place solution summary
[Blender] Modeling tips Summary User Interface Part2
[Survey] Kaggle --Quora 2nd place solution summary
Kaggle Memorandum ~ NLP with Disaster Tweets Part 1 ~
Kaggle: Introduction to Manual Feature Engineering Part 1