Basic machine learning procedure: (1) Classification model organizes the procedure for creating a basic classification model. This time, I would like to focus on the learning of classifiers and realize the selection of classifiers and ensemble learning.
-Basic machine learning procedure: ① Classification model -Basic machine learning procedure: ② Prepare data -Basic machine learning procedure: ③Compare and examine feature selection methods
Google BigQuery Google Colaboratory
(1) Similar to the classification model, purchase data is stored in the following table structure.
| id | result | product1 | product2 | product3 | product4 | product5 | 
|---|---|---|---|---|---|---|
| 001 | 1 | 2500 | 1200 | 1890 | 530 | null | 
| 002 | 0 | 750 | 3300 | null | 1250 | 2000 | 
At first, I tried to compare the performance of classifiers, but it is difficult to decide that this is absolute. I think that it is important for each to have its own characteristics and to make the best use of those characteristics, so I am learning the following four classifiers for the time being.
--RandomForestClassifier: Random forest (light, fast, accurate to some extent) --LogisticRegression: Logistic regression (this is included because it is a zero-ichi classification) --KNeighborsClassifier: k-nearest neighbor method (easy to understand with a simple model) --LGBMClassifier: LightGBM (Recent trend. Increased accuracy)
Prepare the classifier specified above. The list is made in the order of name, classifier, and parameter. Weight, which we will use later, is the weight of ensemble learning.
#Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
model_names=["RandomForestClassifier", "LogisticRegression", "KNeighborsClassifier", "LGBMClassifier"]
estimators=[RandomForestClassifier(), LogisticRegression(), KNeighborsClassifier(), LGBMClassifier()]
parameters=[
  {
    'n_estimators': [5, 10, 50, 100, 300],
    'max_depth': [5, 10, 20],
  },
  {
    'C': list(np.logspace(0, 4, 10))
  },
  {
    'weights': ['uniform','distance'],
    'n_neighbors': [3,5,10,20,30,50],
  },
  {
    'objective': ['binary'],
    'learning_rate': [0.01, 0.03, 0.05], 
    'n_estimators': [100, 150, 200], 
    'max_depth':[4, 6, 8]      
  }
]
weights=[1,1,1,1]
We will run and train the individual models defined above. Here, it takes a long time to execute serially one by one, so refer to "Parallel processing with parallelel of scikit-learn" and parallel I'm running.
from sklearn.model_selection import GridSearchCV
from sklearn.externals.joblib import Parallel, delayed
models = []
def tuneParams(n):
  estimator = estimators[n]
  param = parameters[n]
  
  clf = GridSearchCV(
      estimator,
      param_grid=param,
      cv=5
      )
  
  clf = clf.fit(train_features, train_target)
  model = clf.best_estimator_
  return model
model = Parallel(n_jobs=-1)( delayed(tuneParams)(n) for n in range(len(estimators)) )
models.append(model)
You now have the best parameters for each classifier in your models list.
Ensemble learning, which uses the parameters to learn by combining multiple classifiers, is "Touch the sckit-learn ensemble learning" Voting Classifier " Will be carried out with reference to.
It feels like putting the models created in # 1 into the Voting Classifier while looping. We will perform both ensemble learning and learning with individual classifiers.
from collections import defaultdict
from sklearn.ensemble import VotingClassifier
import sklearn.metrics as metrics
def modelingEnsembleLearning(train_features, test_features, train_target, test_target, models):
  mss = defaultdict(list)
  voting = VotingClassifier(list(zip([n for n in model_names],[m for m in models[0]])), voting='soft', weights=list([w for w in weights]))
  voting.fit(train_features,train_target)
  #Estimated by ensemble
  pred_target = voting.predict(test_features)
  ms = metrics.confusion_matrix(test_target.astype(int), pred_target.astype(int))
  mss['voting'].append(ms)
  #Estimated by individual classifiers
  for name, estimator in voting.named_estimators_.items():
      pred_target = estimator.predict(test_features)
      ms = metrics.confusion_matrix(test_target.astype(int), pred_target.astype(int))
      mss[name].append(ms)
      
  return voting, mss
voting, mss = modelingEnsembleLearning(train_features, test_features, train_target, test_target, models)
And finally, the evaluation of the model. This is because the program hasn't changed from the beginning.
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
accuracy=accuracy_score(test_target, voting.predict(test_features))
precision=precision_score(test_target.astype(int), voting.predict(test_features).astype(int))
recall=recall_score(test_target.astype(int), voting.predict(test_features).astype(int))
print("Voting")
print("Accuracy : ", accuracy*100, "%")
print("Precision : ", precision*100, "%")
print("Recall : ", recall*100, "%")
When ensemble learning, LGBMClassifier is simply higher in Accuracy, and RandomForestClassifier is higher in Recall. However, the overall good balance may be the result of ensemble learning. When that happens, we will use ensemble learning as a model.
Starting with Basic machine learning procedure: ① Classification model, within my understanding of the classification model, the whole procedure + individual deep digging I have been advancing about. For the time being, I would like to move on to the next step with this as a paragraph about the classification model.
Recommended Posts