Pipeline is convenient because you can write code concisely when you connect various preprocessing, but this time, ** a method to combine many pipelines into one and clean it up at once ** is very good. It was convenient, so I will leave it as a memorandum.
Download the demo dataset from kaggle's ** HR Analytics **.
Prepare ** input folder, output folder, model folder ** in the current directory, and save the downloaded data set ** HR_comma_sep.csv ** in ** input folder **.
HR_comma_sep.csv is a data set that predicts whether or not a person will leave the company based on the features of 9 items (left column), and there are 14,999 rows in total.
As in the kaggle competition, let's assume that 10,000 lines are trains and the remaining 4,999 lines are tests, and a training model is created with trains to predict the test results.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# -------------Creating a dataset------------------
#Read the dataset
df = pd.read_csv('./input/HR_comma_sep.csv')
#Shuffle rows, reset index, add ID
df = df.sample(frac=1, random_state=1)
df = df.reset_index(drop=True)
df = df.reset_index()
df = df.rename(columns={'index':'ID'})
#Train by number of lines,Split into test
train = df[0:10000]
valid = df[10000:]
#One-hot encoding of categorical variables
df_train = pd.get_dummies(train)
df_valid = pd.get_dummies(valid)
#Divided into correct labels and features
y = df_train['left']
X = df_train.drop(['ID','left'], axis=1)
y_valid = df_valid['left']
X_valid = df_valid.drop(['ID','left'], axis=1)
#Divided into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print('X_train.shape = ', X_train.shape)
print('y_train.shape = ', y_train.shape)
print('X_test.shape = ', X_test.shape)
print('y_test.shape = ', y_test.shape)
print('X_valid.shape = ', X_valid.shape)
print('y_valid.shape = ', y_valid.shape)
print()
After shuffling the rows of the dataset, it is divided into train and test, and the categorical variables are one-hot encoded and separated into correct labels (y, y_valid) and features (X, X_valid).
Furthermore, X and y for creating a training model are ** train_test_split
**, which are divided into training (X_train, y_train) and evaluation (X_test, y_test). This completes the preparation.
This time, ** prepare 8 pipelines of training models with preprocessing ** and combine them into one big pipeline **. By doing this, you can move eight pipelines in sequence.
# --------Pipeline settings--------
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
pipelines = {
'KNN':
Pipeline([('scl',StandardScaler()),
('est',KNeighborsClassifier())]),
'Logistic':
Pipeline([('scl',StandardScaler()),
('est',LogisticRegression(solver='lbfgs', random_state=1))]),
'SVM':
Pipeline([('scl',StandardScaler()),
('est',SVC(C=1.0, kernel='linear', class_weight='balanced', random_state=1, probability=True))]),
'K-SVM':
Pipeline([('scl',StandardScaler()),
('est',SVC(C=1.0, kernel='rbf', class_weight='balanced', random_state=1, probability=True))]),
'Tree':
Pipeline([('scl',StandardScaler()),
('est',DecisionTreeClassifier(random_state=1))]),
'RandomF':
Pipeline([('scl',StandardScaler()),
('est',RandomForestClassifier(n_estimators=100, random_state=1))]),
'GBoost':
Pipeline([('scl',StandardScaler()),
('est',GradientBoostingClassifier(random_state=1))]),
'MLP':
Pipeline([('scl',StandardScaler()),
('est',MLPClassifier(hidden_layer_sizes=(3,3),
max_iter=1000,
random_state=1))]),
}
After that, if you do ** for pipe_name, pipeline in pipelines.items ():
**, the character string at the beginning of each pipeline (for example,'KNN') will be ** pipe_name
**, respectively. Instances of the pipeline are sequentially entered into ** pipeline
**. In other words
** Create a learning model with pipeline.fit (X_train, y_train)
**
** pipeline.predict (X_test)
** predicts with training model
** pickle.dump (pipeline, open (file_name,'wb'))
** to save the training model
It can be used like this and is very convenient.
# -------Pipeline processing------
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
import pickle
scores = {}
for pipe_name, pipeline in pipelines.items():
#Learning
pipeline.fit(X_train, y_train)
#Indicator calculation
scores[(pipe_name,'test_log')] = log_loss(y_test, pipeline.predict_proba(X_test))
scores[(pipe_name,'valid_log')] = log_loss(y_valid, pipeline.predict_proba(X_valid))
scores[(pipe_name,'test_acc')] = accuracy_score(y_test, pipeline.predict(X_test))
scores[(pipe_name,'valid_acc')] = accuracy_score(y_valid, pipeline.predict(X_valid))
#Submit save(output folder)
ID=df_valid['ID']
preds = pipeline.predict_proba(X_valid) #Predicted probability
submission = pd.DataFrame({'ID': ID, 'left':preds[:, 1]})
submission.to_csv('./output/'+pipe_name+'.csv', index=False)
#Save model(model folder)
file_name = './model/'+pipe_name+'.pkl'
pickle.dump(pipeline, open(file_name, 'wb'))
#Display of indicators
df = pd.Series(scores).unstack()
df = df.sort_values('test_acc', ascending=False)
print(df)
Here, ** learning, index calculation (accuracy, logloss), submit saving (prediction probability), and model saving ** are performed for each of the eight pipelines. ** pipeline ** is super convenient when you want to do similar processing all at once.
By the way, in the case of kaggle, y_valid
is a secret (or rather, it is kaggle), so valid_acc
and valid_loss
cannot be calculated, but this time I know it, so I add it. ^^
Recommended Posts