[PYTHON] Feature selection by Null importances

Introduction

This is a memorandum because I investigated feature selection using null importance. Please point out if there is something strange. Reference: Feature Selection with Null Importances

Overview

This is done to remove the features that become noise when selecting the features and to extract the really important features. The importance of each feature is measured using training data in which the objective variable is randomly shuffled.

procedure

  1. Train the model many times with training data shuffled objective variables to create a distribution of null importance
  2. Train the model with the original training data and get the importance of each feature
  3. Calculate the actual importance score for the null importance distribution
  4. Set an appropriate threshold and select a feature

0. Preparation

Import required libraries

import pandas as pd
import numpy as np
np.random.seed(123)

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import time
import lightgbm as lgb

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

import warnings
warnings.simplefilter('ignore', UserWarning)

import gc
gc.enable()

import time

Preparing the data. This time we will use the data from the Kaggle House Price tutorial. House Prices: Advanced Regression Techniques

#Data read
data = pd.read_csv("./House_Price/train.csv")
target = data['SalePrice']

#Get categorical variables
cat_features = [
    f for f in data.columns if data[f].dtype == 'object'
]

for feature in cat_features:
    #Convert categorical variables to numbers
    data[feature], _ = pd.factorize(data[feature])
    #Convert type to category
    data[feature] = data[feature].astype('category')

#For the time being, the features including missing values are deleted.
drop_cols = [f for f in data.columns if data[f].isnull().any(axis=0) == True]
# drop_cols.append('SalePrice') #Delete the objective variable
data = data.drop(drop_cols, axis=1)

1. Create a distribution of null importance

Prepare a function that returns the importance of features. This time, I used LightGBM as in the article I referred to.

def get_feature_importances(data, cat_features, shuffle, seed=None):
    #Get features
    train_features = [f for f in data if f not in 'SalePrice']
    
    #Shuffle objective variable if necessary
    y = data['SalePrice'].copy()
    if shuffle:
        y = data['SalePrice'].copy().sample(frac=1.0)
    
    #Training with LightGBM
    dtrain = lgb.Dataset(data[train_features], y, free_raw_data=False, silent=True)
    params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2'},
        'num_leaves': 128,
        'learning_rate': 0.01,
        'num_iterations':100,
        'feature_fraction': 0.38,
        'bagging_fraction': 0.68,
        'bagging_freq': 5,
        'verbose': 0
    }
    clf = lgb.train(params=params, train_set=dtrain, num_boost_round=200, categorical_feature=cat_features)

    #Get the importance of features
    imp_df = pd.DataFrame()
    imp_df["feature"] = list(train_features)
    imp_df["importance"] = clf.feature_importance()
    
    return imp_df

Create a distribution of Null Importance.

null_imp_df = pd.DataFrame()
nb_runs = 80
start = time.time()
for i in range(nb_runs):
    imp_df = get_feature_importances(data=data, cat_features=cat_features, shuffle=True)
    imp_df['run'] = i + 1 
    null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)

2. Get the importance of normal features

actual_imp_df = get_feature_importances(data=data, cat_features=cat_features, shuffle=False)

3. Calculate importance score

Use the importance of a normal feature divided by the 75th percentile of the null importance distribution

feature_scores = []
for _f in actual_imp_df['feature'].unique():
    f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance'].values
    f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance'].mean()
    imp_score = np.log(1e-10 + f_act_imps / (1 + np.percentile(f_null_imps, 75)))
    feature_scores.append((_f, imp_score))

scores_df = pd.DataFrame(feature_scores, columns=['feature', 'imp_score'])

4. Select features

Set an appropriate threshold and select the feature amount. This time, I decided to use the one with a score of 0.5 or more.

sorted_features = scores_df.sort_values(by=['imp_score'], ascending=False).reset_index(drop=True)
new_features = sorted_features.loc[sorted_features.imp_score >= 0.5, 'feature'].values
print(new_features)

# ['CentralAir' 'GarageCars' 'OverallQual' 'HalfBath' 'OverallCond' 'BsmtFullBath']

Finally

It was used by the top prizewinners in the competition I participated in the other day, so I checked it. I think there are various other methods for selecting features, so I would like to investigate.

Recommended Posts

Feature selection by Null importances
Feature selection by sklearn.feature_selection
Feature selection by genetic algorithm
Feature Selection Datasets
Organized feature selection using sklearn
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
Predictive Power Score for feature selection
Support vector regression and feature selection
Feature generation with pandas group by