[PYTHON] Predictive Power Score for feature selection

Feature selection

In this article, we will introduce the Predictive Power Score, an index that can be used for feature selection, which was released in April this year, and the library ppscore that implements it.

By the way, when creating a prediction model, selecting the features and explanatory variables to be used is called feature selection.

--Reduce noise of unpredictable data --Reduce redundant data to reduce computational costs

It is done for the purpose of.

The method of feature selection can be roughly classified as follows. Predictive Power Score is equivalent to Wrapper method, but it also has the goodness of Filter method from the implementation contents.

Method Overview Feature
Filter Select the feature amount by calculating the statistics of the data itself, setting a threshold value and cutting off. Relatively the least computationally expensive and suitable for large datasets
Embedded Perform feature selection and model construction at the same time like regularization Has intermediate characteristics between the Filter method and the Wrapper method
Wrapper Select features useful for prediction by repeating model construction and feature selection and trying Since the model is actually constructed, it is possible to precisely select features useful for prediction, but the calculation cost is high.

Predictive Power Score

Major features

Predictive Power Score (hereinafter referred to as PPS) is a concept developed by a software company called 8080Labs based in Germany. It can be used in a form similar to the analysis using the correlation matrix of Pearson correlation coefficients, which is often used in EDA, etc., so that it can be used more universally. It seems that it is being developed with the motivation of making a good index.

PPS has the following features.

--Applicable to both categorical and numeric variables --PPS has a value between 0 and 1, 0 corresponds to the feature x, which has no power to predict the target y, and 1 means that x can predict y perfectly. Corresponds to --Pearson correlation coefficient etc. is based on the linear relationship between x and y, but PPS can evaluate even non-linear relationships. --However, the interaction between variables is not considered. --As will be described later, because a simple model is built, the calculation speed is inferior to that of the Filter method, but the ordinary Wrapper method (variable importance and Permutation Importance Faster than methods like jp / blog / permutation-importance /) --The value of PPS is between 0 and 1, but the comparison between PPS calculated for different targets has no strict mathematical meaning. --The calculation of PPS is implemented only when MAE and F1 are used, and if you want to try other indicators, you need to implement it yourself.

How to calculate PPS

  1. Determine the relationship between two features or between a feature and a target variable as either a regression problem or a classification problem, depending on the data type and the number of levels (cardinality).
  2. Exclude missing values, perform one-hot encoding of category features, label encoding of targets, etc.
  3. Calculate PPS according to different PPS definition formulas for each regression problem and classification problem.

The definition formula mentioned in 2. is as shown in the table below.

task PPS calculation definition formula
Regression PPS = 1 - (MAE_model / MAE_naive)
Classification PPS = (F1_model - F1_naive) / (1 - F1_naive)

MAE_model and MAE_naive are MAE when predicting y using x, respectively, and MAE when predicting the median of y. The reason for calculating _naive is to set the criteria for normalizing the PPS to be in the 0 to 1 range. In the case of F1_naive, the weighted F1 for the most frequent class is calculated. You may be wondering here, but how do you make "prediction"?

It is equivalent to the Wrapper method, but it also has the goodness of the Filter method from the implementation contents.

As mentioned above, the calculation of PPS can be classified into the Wrapper method in the context of feature selection, but the prediction is made by constructing a decision tree model, and the score is mediated by the model. (Building a model by cross-validation when calculating scores) However, calculate PPS once. Once you've done that, you can put the narrowed-down feature set into a more complex model, so you can use it as a Filter method. This is because, as the developers say, we are building a simple decision tree model with one variable when calculating the score, and the decision tree itself is faster than SVM, GBDT, NN, etc. .. Another reason why decision trees are used is that they capture non-linear relationships and have relatively robust predictive performance.

Usage when selecting features

--Exclude low PPS features such as variable importance --A PPS matrix is created between features, and those with high PPS between features may be features that contain redundant information, similar to finding multicollinearity in a correlation matrix. Is high, so leave only the important ones --If you want to select features more precisely, you can try increasing or decreasing variables in Greedy, or randomly adding or decreasing variables.

Usage other than feature selection

  1. Find patterns in the data

  2. Detection of Data leakage

  3. Since PPS is calculated for both categorical variables and numerical variables, it is convenient to find relationships including non-linearity between various variables.

  4. If the PPS is significantly higher than other features, it is possible to suspect that it is a feature that contributes to leakage that contains a lot of information that cannot be used at the time of prediction.

Introducing the library

Installation

pip install ppscore

Import and score calculation

import ppscore as pps
pps.score(df, "feature_column", "target_column")

Output of PPS matrix

pps.matrix(df)

I actually applied it to the data

Usage data and environment

Telco Customer Churn is a dataset of customer information and cancellation information of Internet services. The environment uses Kaggle's notebook. image.png There is a blue button on the dataset page called "New Notebook" that you can click to launch the notebook in a way that allows you to access the dataset immediately.

Preparation

Install it.

!pip install ppscore

Import the library and load the data.

import numpy as np 
import pandas as pd 
import ppscore as pps
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
PATH ='/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(f'{PATH}')
df.shape

Check the column name.

list(df.columns)

Churn is the target.

['customerID',
 'gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'tenure',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'MonthlyCharges',
 'TotalCharges',
 'Churn']

Check the data type.

df.dtypes
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

Calculation of PPS

pps.score(df, 'InternetService', 'Churn')

The results are returned in dictionary format.

{'x': 'InternetService',
 'y': 'Churn',
 'task': 'classification',
 'ppscore': 1.625853361551631e-07,
 'metric': 'weighted F1',
 'baseline_score': 0.6235392486748098,
 'model_score': 0.6235393098818076,
 'model': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=None, max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=None, splitter='best')}

Visualization of PPS matrix

Looking at the matrix, in the Churn row, the PPS of each feature for the target is represented by a heatmap, especially the PPS of tenure, MonthlyCharges, and TotalCharges. The meaning of each is the service usage period, monthly usage fee, and cumulative usage fee, which are features that are closely related to cancellation. Also, if you look at the Monthly Charges line, the contrast of the PPS from the ʻInternet Service to the Stream TVis high. As you can see from the data type above, these features are categorical variables, but it is convenient to be able to see the relationship with the numeric variableMonthlyCharges together. It is easy to interpret, and you can see that the option subscription status of various Internet services is strongly related to the usage fee. In addition, looking at the PPS between the features of ʻInternet Service and StreamTV, it is inferred that they have dark contrast and similar information to each other, and dimension compression and dimension reduction can be considered. I will.

The matrix calculation took 1 minute 55 seconds in the data frame (7043, 21). It's not too early, but if you have 10,000 units of data, you can wait for a while, and if the number of data increases, you can try sampling to get a trend.

df_matrix = pps.matrix(df)
plt.figure(figsize=(18,18))
sns.heatmap(df_matrix, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
plt.show()

image.png

Finally

So far, we have introduced the Predictive Power Score (PPS). Since it can be easily applied to data and the relationships between data are easily visualized, it was found that it can be used for both EDA and feature selection. The implementation of ppscore uses MAE and F1 to calculate PPS, but you could try other indicators while incorporating the concept of PPS.

Reference information

Recommended Posts

Predictive Power Score for feature selection
5th Feature Engineering for Machine Learning-Feature Selection
Feature Selection Datasets
Feature preprocessing for modeling
Feature selection by sklearn.feature_selection
Feature selection by genetic algorithm
pp4 (python power for anything)
Feature selection by Null importances
Organized feature selection using sklearn