[PYTHON] Feature selection by sklearn.feature_selection

Univariate statistics

Calculate the relationship between each explanatory variable and the objective variable, and select the associated features with the highest certainty.

SelectKBest

Select the top k of the explanatory variables. Normally, the argument score_func specifies f_classif (default value) for classification and f_regression for regression. Specify the number of features to be selected in the argument k.

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression

boston = load_boston()
X = boston.data
y = boston.target

#Select 5 features
selector = SelectKBest(score_func=f_regression, k=5) 
selector.fit(X, y)
mask = selector.get_support()    #Get the mask of whether or not each feature is selected
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))

output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[False False  True False False  True False False False  True  True False
  True]
X.shape=(506, 13), X_selected.shape=(506, 5)

SelectPercentile

Select the top k% of the explanatory variables. Normally, the argument score_func specifies f_classif (default value) for classification and f_regression for regression. Specify the ratio (0 to 100) of the feature amount to be selected in the argument percentile.

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectPercentile, f_regression

boston = load_boston()
X = boston.data
y = boston.target

#40 of the features%choose
selector = SelectPercentile(score_func=f_regression, percentile=40) 
selector.fit(X, y)
mask = selector.get_support()
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))

output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[False False  True False False  True False False False  True  True False
  True]
X.shape=(506, 13), X_selected.shape=(506, 5)

GenericUnivariateSelect

Set the mode ('percentile', ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’) with mode, and set the parameters of each mode with param. For example

selector = GenericUnivariateSelect(mode='percentile', score_func=f_regression, param=40)

When

selector = SelectPercentile(score_func=f_regression, percentile=40) 

Are equivalent.

Model-based feature selection

Select the features using the feature_importances_ attribute, which represents the importance of the features obtained in the model.

SelectFromModel

Specify the estimator and the threshold threshold as arguments.

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

boston = load_boston()
X = boston.data
y = boston.target

#Use RandomForestRegressor as the estimator. Select one with importance of median or higher
selector = SelectFromModel(RandomForestRegressor(n_estimators=100, random_state=42), threshold="median")    
selector.fit(X, y)
mask = selector.get_support()
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))

output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[ True False False False  True  True False  True False  True  True False
  True]
X.shape=(506, 13), X_selected.shape=(506, 7)

Repeated feature selection

An operation in which features that are not used at all are added one by one until a certain standard is satisfied, or features are removed one by one from the state in which all features are used. The feature amount is selected by repeating.

RFE

RFE (Recursive Feature Elimination) starts with all features, builds a model, and removes the least important features of the model. Then create a model again and delete the least important features. This process is repeated until a predetermined number of features are reached.

For the argument, specify the estimator and the number of features n_features_to_select. (Number of features --n_features_to_select) It takes a long time to create a model => delete features.

from sklearn.datasets import load_boston
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

boston = load_boston()
X = boston.data
y = boston.target

#Use RandomForestRegressor as the estimator. Select 5 features
selector = RFE(RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=5)
selector.fit(X, y)
mask = selector.get_support()
print(boston.feature_names)
print(mask)

#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))

output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[ True False False False  True  True False  True False False False False
  True]
X.shape=(506, 13), X_selected.shape=(506, 5)

reference

Recommended Posts

Feature selection by sklearn.feature_selection
Feature selection by genetic algorithm
Feature selection by Null importances
Feature Selection Datasets
Organized feature selection using sklearn
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
Predictive Power Score for feature selection
Support vector regression and feature selection
Feature generation with pandas group by
5th Feature Engineering for Machine Learning-Feature Selection