[PYTHON] Organized feature selection using sklearn

Introduction

Feature engineering is an important factor in building regression and classification models. At that time, features are often selected using domain knowledge, but I tried to use scikit-learn to add a star, so I will organize it.

Feature selection by RFE

RFE (Recursive Feature Elimination) is a recursive feature reduction method. Build a model starting with all features and remove the least important features in that model. Then build the model again and remove the least important features. This procedure is repeated until the specified number of features is reached.

The python code is below.

#Import required libraries
mport pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor

#Data set reading
boston = load_boston()

#Creating a data frame
#Storage of explanatory variables
df = pd.DataFrame(boston.data, columns = boston.feature_names)

#Add objective variable
df['MEDV'] = boston.target

#Use GBDT as an estimator. Select 5 features
selector = RFE(GradientBoostingRegressor(n_estimators=100, random_state=10), n_features_to_select=5)
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support()
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))

The execution result is as follows.

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[False False False False  True  True False  True False False  True False
  True]
X.shape=(506, 13), X_selected.shape=(506, 5)

Feature selection by SelectFromModel

This is a method of selecting features using feature_importances_, which expresses the importance of features obtained in the model.

The python code is below.

from sklearn.feature_selection import SelectFromModel

#Use GBDT as an estimator.
selector = SelectFromModel(GradientBoostingRegressor(n_estimators=100, random_state=10), threshold="median")    
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support()
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))

The execution result is as follows.

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[ True False False False  True  True False  True False False  True  True
  True]
X.shape=(506, 13), X_selected.shape=(506, 7)

Feature selection by Select KBest

This is a method to select the top k of the explanatory variables.

The python code is below.

from sklearn.feature_selection import SelectKBest, f_regression

#Select 5 features
selector = SelectKBest(score_func=f_regression, k=5) 
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support()    #Get the mask of whether or not each feature is selected
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))

The execution result is as follows.

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[False False  True False False  True False False False  True  True False
  True]
X.shape=(506, 13), X_selected.shape=(506, 5)

Feature selection by Select Percentile

This is a method to select the upper k% of the explanatory variables.

The python code is below.

from sklearn.feature_selection import SelectPercentile, f_regression

#50 of the features%choose
selector = SelectPercentile(score_func=f_regression, percentile=50) 
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support()
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))

The execution result is as follows.

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[False False  True False  True  True False False False  True  True False
  True]
X.shape=(506, 13), X_selected.shape=(506, 6)

at the end

Thank you for reading to the end. This time, we have organized the feature selection method using sklearn. In actual work, I think it is important to carry out appropriate feature engineering in combination with domain knowledge while using the library.

If you have a request for correction, we would appreciate it if you could contact us.

Recommended Posts

Organized feature selection using sklearn
Feature selection by sklearn.feature_selection
Feature detection using opencv (corner detection)
Feature selection by genetic algorithm
Feature selection by Null importances
Try using Django's template feature
[Machine learning] Feature selection of categorical variables using chi-square test
Try using Pelican's draft feature
Try using PyCharm's remote debugging feature
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
Predictive Power Score for feature selection