[PYTHON] Basic machine learning procedure: ③ Compare and examine the selection method of features

Introduction

Basic machine learning procedure: (1) Classification model organizes the procedure for creating a basic classification model. This time, I would like to focus on the selection of features and compare and examine various selection methods of features.

Procedure so far

-Basic machine learning procedure: ① Classification model -Basic machine learning procedure: ② Prepare data

Analytical environment

Google BigQuery Google Colaboratory

Target data

(1) Similar to the classification model, purchase data is stored in the following table structure.

id result product1 product2 product3 product4 product5
001 1 2500 1200 1890 530 null
002 0 750 3300 null 1250 2000

Since the purpose is to select the feature amount, the horizontal axis should be about 300.

0. How to select the target feature

From Feature selection, I chose the following method.

Also, although it is not scikit-learn, it was introduced in Feature selection method Boruta using random forest and test. I would also like to use Boruta, which is one of the Wrapper Methods.

In order to compare under the same conditions, I would like to use RandomForestClassifier as the classifier used for feature selection.

1.Embedded Method(SelectFromModel)

First, use the Embedded Method used in Basic Machine Learning Procedure: (1) Classification Model. The Embedded Method embeds features in a particular model and selects the best features.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

#Change to numpy array
label = np.array(df.loc[0:, label_col]) 
features = np.array(df.loc[0:, feature_cols])

#Variable selection
clf = RandomForestClassifier(max_depth=7)

##Select variables using Embedded Method
feat_selector = SelectFromModel(clf)
feat_selector.fit(features, label)
df_feat_selected = df.loc[0:, feature_cols].loc[0:, feat_selector.get_support()]

36 variables were selected. The accuracy obtained using these variables is as follows. It's quite expensive, but I want to improve Recall a little.

2.Wrapper Method(RFE) Then use the Wrapper Method. This is a method of finding the optimal subset by turning the prediction model with a subset of features.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

#Change to numpy array
label = np.array(df.loc[0:, label_col]) 
features = np.array(df.loc[0:, feature_cols])

#Variable selection
clf = RandomForestClassifier(max_depth=7)

##Select variables using Embedded Method
feat_selector = RFE(clf)
feat_selector.fit(features, label)
df_feat_selected = df.loc[0:, feature_cols].loc[0:, feat_selector.get_support()]

146 variables were selected. Compared to the Embedded Method, there are quite a lot. The accuracy obtained using these variables is as follows. The numbers after the decimal point are not so different, but they are almost the same as the Embedded Method.

3.Wrapper Method(Boruta) The last is Boruta. Boruta is not installed as standard with Colaboratory, so pip install it first.

pip install boruta

This is also the Wrapper Method, so we will find the optimal subset. However, it takes a lot of time compared to the previous RFE. There is progress, so let's wait slowly.

from boruta import BorutaPy

#Change to numpy array
label = np.array(df.loc[0:, label_col]) 
features = np.array(df.loc[0:, feature_cols])

#Variable selection
##Here, assuming classification, random forest(RandomForestClassifier)Use
clf = RandomForestRegressor(max_depth=7)

##Select variables using Boruta
feat_selector = BorutaPy(clf, n_estimators='auto', two_step=False, verbose=2, random_state=42)
feat_selector.fit(features, label)
df_feat_selected=df.loc[0:, feature_cols].loc[0:, feat_selector.support_]

97 variables were selected. The accuracy obtained using these variables is as follows. does not change. .. ..

in conclusion

In fact, the accuracy changes considerably depending on how you select variables! I wanted to get the result, but unfortunately the result was about the same. (I wonder if the sample data was wrong)

~~ This time, we compared only three types, but in the Summary of feature selection that I referred to earlier, There are some methods that I haven't tried this time, such as Step Forward and Step backward of Wrapper Method, so I would like to try them in the future. ~~

2/26 postscript

I tried Step Forward and Step backward of Wrapper Method by referring to Summary of feature selection, but it is slow. Or rather, it doesn't end.

It may be because the features are as large as 300, or it may be due to the power of Colab, but isn't it difficult to actually use the method of adding and subtracting features?

Other than that, there seems to be something like Optuna, which is an automated framework for feature selection, so in a word, feature selection. But there are various things that I can study.

Recommended Posts

Basic machine learning procedure: ③ Compare and examine the selection method of features
Summary of the basic flow of machine learning with Python
Significance of machine learning and mini-batch learning
Basic machine learning procedure: ④ Classifier learning + ensemble learning
I considered the machine learning method and its implementation language from the tag information of Qiita
Basic machine learning procedure: ② Prepare data
Evaluation method of machine learning regression problem (mean square error and coefficient of determination)
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Compare the speed of Python append and map
About the development contents of machine learning (Example)
[Machine learning] "Abnormality detection and change detection" Let's draw the figure of Chapter 1 in Python.
Introduction to machine learning ~ Let's show the table of K-nearest neighbor method ~ (+ error handling)
Impressions of taking the Udacity Machine Learning Engineer Nano-degree
Machine learning #k-nearest neighbor method and its implementation and various
Predict the gender of Twitter users with machine learning
I tried to compare the accuracy of machine learning models using kaggle as a theme.
[Machine learning] Write the k-nearest neighbor method (k-nearest neighbor method) in python by yourself and recognize handwritten numbers.
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
Try to evaluate the performance of machine learning / regression model
Survey on the use of machine learning in real services
Examination of Forecasting Method Using Deep Learning and Wavelet Transform-Part 2-
Predict the presence or absence of infidelity by machine learning
Try to evaluate the performance of machine learning / classification model
[Machine learning] Feature selection of categorical variables using chi-square test
[Machine learning] I tried to summarize the theory of Adaboost
I want to know the features of Python and pip
Basics of Machine Learning (Notes)
Compare the fonts of jupyter-themes
Importance of machine learning datasets
Machine learning and mathematical optimization
About the features of Python
A story stuck with the installation of the machine learning library JAX
Let's compare the Fourier transform of the synthesized sound source and the composition of the Fourier transform.
Summary of recommended APIs for artificial intelligence, machine learning, and AI
How to use machine learning for work? 01_ Understand the purpose of machine learning
[Anomaly detection] Try using the latest method of deep distance learning
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
(Machine learning) I learned principal component analysis, which is a useful method for reducing a large amount of features, and considered expanding it to the manufacturing industry (especially machine equipment maintenance).