Introduction

During this time, I obtained the second grade of statistical test, and as part of the test preparation, I studied a statistical method called "chi-square test". I wondered if this "chi-square test" would be useful for selecting features in machine learning, and found that it was already implemented in SelectKBest of scikit-learn, so I would like to make a note of how to use it.

What is a chi-square test?

Also known as the "test of independence", it is for testing that event A and event B are "independent".

Assuming that event A and event B are independent, check how "impossible" the measured value is, and "impossible" = "event A and event B are not independent and have some relationship" My understanding is that it is a test of that.

The following sites will be very helpful for a little more detailed explanation.

Test of Independence-Most Popular Chi-square Test (Statistics WEB)

Comparison of implementation example by Python and correct answer rate

Data used this time

Use the following data, including categorical data in the explanatory variables. This data records whether a person's annual income exceeds 50K, along with information such as the person's age, occupation, and gender.


Data name	Adult Data Set
URL	https://archive.ics.uci.edu/ml/datasets/adult
The number of data	32561

Record the code to read the above data below. Since the chi-square test is for testing how much the frequency of a certain feature deviates from the expected value, numerical data (quantitative data) is excluded and learning is performed using only categorical data.

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

#Reading data and specifying column names
#Data reading and data processing
columns=['age','workclass','fnlwgt','education','education-num',
 'marital-status','occupation','relationship','race','sex','capital-gain',
 'capital-loss','hours-per-week','native-country','Income']
data = pd.read_csv('adult.data.csv', header=None).sample(frac=1).dropna()
data.columns=columns
print(data.shape)
data = data.replace({' <=50K':0,' >50K':1})
data.head()


#Explanatory variable(Category data)And separated by objective variable
data_x = data[['workclass','education','marital-status','occupation','relationship','race','sex','native-country']]
data_y = data['Income']

#Category data is a dummy variable
data_dm = pd.get_dummies(data_x)

#Divided into training data and test data
X_train, X_test, Y_train, Y_test = train_test_split(data_dm, data_y, train_size=0.7)

When feature quantity is not selected

First, let's check the result when the features are not selected, that is, when the model is trained using all the features.

The model used this time is gradient boosting, and it is trained by default without parameter tuning.

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.pipeline import Pipeline 

scaler = StandardScaler()
clf = GradientBoostingClassifier()

#A series of processing(Pipeline)Define
#Standardization → discriminator training
estimator = [
    ('scaler',scaler),
    ('clf',clf)
]

pipe = Pipeline(estimator)
pipe.fit(X_train, Y_train)
print('Correct answer rate: ' + str(pipe.score(X_test, Y_test)))

Correct answer rate: 0.828743986078

I haven't tuned any parameters, so it's accurate. Next, let's look at the results of feature selection using the chi-square test.

Feature selection using chi-square test

Feature selection by chi-square test can be easily implemented by using scikit-learn's "Select KBest". I will.

This class narrows down the number of features based on the evaluation index specified by the parameter score_func. The method used is the usual fit / transform. The number of features after narrowing down can be specified by the parameter k.

There are several evaluation indexes that can be selected, but since the chi-square test is used here, "chi2" is specified.

We first train all the features, then reduce the features one by one, and finish the iteration when the correct answer rate of the model starts to drop. (It seems to be a method called "Backward feature Elimination")

from sklearn.feature_selection import SelectKBest, chi2
max_score = 0
for k in range(len(data_dm.columns)):
    print(len(data_dm.columns)-k)
    select = SelectKBest(score_func=chi2, k=len(data_dm.columns)-k) 
    scaler = StandardScaler()
    clf = GradientBoostingClassifier()

    #A series of processing(Pipeline)Define
    #Feature extraction → standardization → classifier training
    estimator = [
        ('select',select),
        ('scaler',scaler),
        ('clf',clf)
    ]

    pipe_select = Pipeline(estimator)
    pipe_select.fit(X_train, Y_train)
    score = pipe_select.score(X_test, Y_test)
    if score < max_score:
        break
    else:
        max_score = score
        pipe_fix = pipe_select

print('Correct answer rate: ' + str(pipe_fix.score(X_test, Y_test)))

Correct answer rate: 0.828948715324

The correct answer rate has increased a little. It seems that the original data contained unnecessary features, and it seems that the features can be excluded by selecting the features by the chi-square test.

By the way, the excluded features can be confirmed below.

mask = -pipe_fix.steps[0][1].get_support()
data_dm.iloc[:,mask].columns

Index(['native-country_ Greece', 'native-country_ Holand-Netherlands','native-country_ Thailand'],dtype='object')

Model-based feature selection

For comparison, I will also introduce another feature selection method. Some models used in machine learning hold which features contribute significantly to accuracy. For example, Lasso regression, Ridge regression. Here, let's take a look at an implementation example of feature selection by Random Forest and the result.

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

select = SelectFromModel(RandomForestClassifier(n_estimators=100, n_jobs=-1))
scaler = StandardScaler()
clf = GradientBoostingClassifier()

#A series of processing(Pipeline)Define
#Feature extraction → standardization → classifier training
estimator = [
    ('select',select),
    ('scaler',scaler),
    ('clf',clf)
]

pipe_rf = Pipeline(estimator)
pipe_rf.fit(X_train, Y_train)
print('Correct answer rate: ' + str(pipe_rf.score(X_test, Y_test)))

Correct answer rate: 0.826287235132

The correct answer rate is lower than when all features are used.

I think that the accuracy is not good because the parameters of the model used for feature selection are not tuned. It's not the main subject, so I won't dig deeper in this article.

Summary

In this article, I introduced a brief explanation of the chi-square test and an implementation example using SelectKBest. My current understanding is that feature selection by the chi-square test is valid only for classification problems because the features are categorical data. For features that include numerical data or for regression problems, use feature selection based on evaluation indicators such as Pearson's correlation coefficient and ANOVA. (By the way, I often use feature selection by random forest, which is easy to use regardless of category data or numerical data.)

Finally

If you have any questions or concerns about the description, please comment.

[Appendix 1] Reason for making a dummy variable before fitting to SelectKBest

The chi2 document specified by score_func in SelectKBest (sklearn.feature_selection.chi2 ), You will find the following description.

X : {array-like, sparse matrix}, shape = (n_samples, n_features_in)

The data passed as a parameter is written as a sparse array, that is, an array with most elements of 0.

If you read the source, you can see that the following code is used to calculate the total value for each column, that is, the frequency of its features. In order to do this calculation, you can see that an error occurs if you pass the data including the character string before making it into a dummy variable as a parameter.

feature_count = X.sum(axis=0).reshape(1, -1)

Also, even if the feature quantity contains numerical data (quantitative data), no error will occur, but a mysterious calculation will be performed. (For example, if there is a feature called age, the total age of all lines will be calculated and recognized as a frequency).

[Appendix 2] Confirmation of features selected by Select KBest

After fitting, the SelectKBest instance will have a list of which features were selected with the get_support () method. When the transform method is executed, it seems that the features whose mask is True are selected and returned from the features of the input data.

The following source is an implementation example when the number of features is narrowed down to 10.

from sklearn.feature_selection import SelectKBest, chi2
select = SelectKBest(score_func=chi2, k=10) 
select.fit(X_train, Y_train)

mask = select.get_support()
mask

array([ True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False], dtype=bool)

You can check which feature is selected with the following code, for example.

data_dm.iloc[:,mask].columns

Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss','hours-per-week', 'marital-status_ Married-civ-spouse','marital-status_ Never-married', 'relationship_ Husband','relationship_ Own-child'],dtype='object')

[PYTHON] [Machine learning] Feature selection of categorical variables using chi-square test