Introduction

Basic machine learning procedure: (1) Classification model organizes the procedure for creating a basic classification model. This time, I would like to focus on the learning of classifiers and realize the selection of classifiers and ensemble learning.

Procedure so far

-Basic machine learning procedure: ① Classification model -Basic machine learning procedure: ② Prepare data -Basic machine learning procedure: ③Compare and examine feature selection methods

Analytical environment

Google BigQuery Google Colaboratory

Target data

(1) Similar to the classification model, purchase data is stored in the following table structure.

id	result	product1	product2	product3	product4	product5
001	1	2500	1200	1890	530	null
002	0	750	3300	null	1250	2000

0. Target classifier

At first, I tried to compare the performance of classifiers, but it is difficult to decide that this is absolute. I think that it is important for each to have its own characteristics and to make the best use of those characteristics, so I am learning the following four classifiers for the time being.

--RandomForestClassifier: Random forest (light, fast, accurate to some extent) --LogisticRegression: Logistic regression (this is included because it is a zero-ichi classification) --KNeighborsClassifier: k-nearest neighbor method (easy to understand with a simple model) --LGBMClassifier: LightGBM (Recent trend. Increased accuracy)

1. Learning of each classifier

Prepare the classifier specified above. The list is made in the order of name, classifier, and parameter. Weight, which we will use later, is the weight of ensemble learning.

#Classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier

model_names=["RandomForestClassifier", "LogisticRegression", "KNeighborsClassifier", "LGBMClassifier"]
estimators=[RandomForestClassifier(), LogisticRegression(), KNeighborsClassifier(), LGBMClassifier()]
parameters=[
  {
    'n_estimators': [5, 10, 50, 100, 300],
    'max_depth': [5, 10, 20],
  },
  {
    'C': list(np.logspace(0, 4, 10))
  },
  {
    'weights': ['uniform','distance'],
    'n_neighbors': [3,5,10,20,30,50],
  },
  {
    'objective': ['binary'],
    'learning_rate': [0.01, 0.03, 0.05], 
    'n_estimators': [100, 150, 200], 
    'max_depth':[4, 6, 8]      
  }
]

weights=[1,1,1,1]

We will run and train the individual models defined above. Here, it takes a long time to execute serially one by one, so refer to "Parallel processing with parallelel of scikit-learn" and parallel I'm running.

from sklearn.model_selection import GridSearchCV
from sklearn.externals.joblib import Parallel, delayed

models = []

def tuneParams(n):

  estimator = estimators[n]
  param = parameters[n]
  
  clf = GridSearchCV(
      estimator,
      param_grid=param,
      cv=5
      )
  
  clf = clf.fit(train_features, train_target)
  model = clf.best_estimator_

  return model

model = Parallel(n_jobs=-1)( delayed(tuneParams)(n) for n in range(len(estimators)) )
models.append(model)

You now have the best parameters for each classifier in your models list.

2. Ensemble learning

Ensemble learning, which uses the parameters to learn by combining multiple classifiers, is "Touch the sckit-learn ensemble learning" Voting Classifier " Will be carried out with reference to.

It feels like putting the models created in # 1 into the Voting Classifier while looping. We will perform both ensemble learning and learning with individual classifiers.

from collections import defaultdict
from sklearn.ensemble import VotingClassifier
import sklearn.metrics as metrics

def modelingEnsembleLearning(train_features, test_features, train_target, test_target, models):

  mss = defaultdict(list)

  voting = VotingClassifier(list(zip([n for n in model_names],[m for m in models[0]])), voting='soft', weights=list([w for w in weights]))
  voting.fit(train_features,train_target)

  #Estimated by ensemble
  pred_target = voting.predict(test_features)
  ms = metrics.confusion_matrix(test_target.astype(int), pred_target.astype(int))
  mss['voting'].append(ms)

  #Estimated by individual classifiers
  for name, estimator in voting.named_estimators_.items():
      pred_target = estimator.predict(test_features)
      ms = metrics.confusion_matrix(test_target.astype(int), pred_target.astype(int))
      mss[name].append(ms)
      
  return voting, mss

voting, mss = modelingEnsembleLearning(train_features, test_features, train_target, test_target, models)

3. Model evaluation

And finally, the evaluation of the model. This is because the program hasn't changed from the beginning.

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

accuracy=accuracy_score(test_target, voting.predict(test_features))
precision=precision_score(test_target.astype(int), voting.predict(test_features).astype(int))
recall=recall_score(test_target.astype(int), voting.predict(test_features).astype(int))

print("Voting")
print("Accuracy : ", accuracy*100, "%")
print("Precision : ", precision*100, "%")
print("Recall : ", recall*100, "%")

When ensemble learning, LGBMClassifier is simply higher in Accuracy, and RandomForestClassifier is higher in Recall. However, the overall good balance may be the result of ensemble learning. When that happens, we will use ensemble learning as a model.

in conclusion

Starting with Basic machine learning procedure: ① Classification model, within my understanding of the classification model, the whole procedure + individual deep digging I have been advancing about. For the time being, I would like to move on to the next step with this as a paragraph about the classification model.

[PYTHON] Basic machine learning procedure: ④ Classifier learning + ensemble learning