[PYTHON] Probability prediction of imbalanced data

Introduction

When classifying with machine learning, you may want to get the probability of belonging to those classes as well as the classification results. If the number of positive data is extremely small compared to the number of negative data (such data is called unbalanced data), if you build a prediction model using all of that data, the prediction result will also be negative. In many cases, it tends to be difficult to accurately classify positive data. Therefore, we often build a model using undersampled data so that the number of negative data is equal to the number of positive data. This makes it possible to classify positive example data with high accuracy, but because the balance between the number of positive and negative data is different from the original data, the probability prediction result is biased by undersampling. I will end up.

Some people have already summarized how to deal with this problem in the blog below, but I will summarize as a memorandum how to remove and correct the bias of the probability output by the model built with undersampled data. .. In this article, we simply use a logistic regression model as a model for probability prediction.

-[Correct the bias of the prediction probability when dealing with imbalanced data with Undersampling + bagging and visualize the result](https://tjo.hatenablog.com/entry/2019/08/04/ 150431) -Bias of prediction probability due to downsampling

Bias correction method due to undersampling

The method of correcting bias by undersampling is described in the paper [Calibrating Probability with Undersampling for Unbalanced Classification]. Proposed at (https://www3.nd.edu/~dial/publications/dalpozzolo2015calibrating.pdf).

Now consider a binary classification task that predicts the objective variable $ Y $ ($ Y $ takes either 0 or 1) from the explanatory variable $ X $. The original dataset $ (X, Y) $ is unbalanced data with an extremely small number of positive examples, and the dataset in which the number of negative examples is equal to the number of positive examples by undersampling is $ (X_s). , Y_s) $. Also, if some data (sample) contained in $ (X, Y) $ is also contained in $ (X_s, Y_s) $, it takes 1 and 0 if it is not included in $ (X_s, Y_s) $. Introduces a sampling variable $ s $ that takes.

Original dataset(X, Y)When an explanatory variable is given to a model constructed using, the probability of predicting it as a positive example isp(y=1|x)It is expressed as. Also, undersampled datasets(X_s, Y_s)When an explanatory variable is given to a model constructed using, the probability of predicting it as a positive example isp(y=1|x,s=1)It is expressed as.p=p(y=1|x), p_s=p(y|x,s=1)ThenpWhenp_sThe relationship of is as follows.

p=\frac{\beta p_s}{\beta p_s-p_s+1}

Here, $ \ beta = N ^ + / N ^-$ ($ N ^ + $ is the number of positive data, $ N ^-$ is the number of negative data).

Derivation

The following is a detailed explanation of the formula, so if you are not interested, please skip it.

Bayes' theorem, andp(s|y,x)=p(s|y)From the undersampled dataset(X_s, Y_s)The probability predicted by the model built using is written as follows.

p(y=1|x,s=1)=\frac{p(s=1|y=1)p(y=1|x)}{p(s=1|y=1)p(y=1|x)+p(s=1|y=0)p(y=0|x)}

Now, the number of regular data is extremely small, and all the data for which $ y = 1 $ is sampled, so if $ p (s = 1 | y = 1) = 1 $,

p(y=1|x,s=1)=\frac{p(y=1|x)}{p(y=1|x)+p(s=1|y=0)p(y=0|x)}

Can be written. further,p=p(y=1|x), p_s=p(y|x,s=1), \beta=p(s=1|y=0)Then

p_s=\frac{p}{p+\beta(1-p)}

It will be. Finally, when $ p $ is transformed so that it is on the left side,

p=\frac{\beta p_s}{\beta p_s-p_s+1}

It will be. The last equation means the probability $ predicted by the model built with the undersampled data $ p_s $ to remove the bias by correcting the probability $ predicted by the model built with the original data. It means that you can calculate p $.

Where $ \ beta = p (s = 1 | y = 0) $ represents the probability that negative example data will be sampled. Now, since the negative example data is sampled by the same number as the positive example data, $ \ beta = N ^ + / N ^-$ ($ N ^ + $ is the number of positive example data, $ N ^-$ is It can be approximated to the number of data in the negative example).

Code example

In the following, we will perform an experiment to correct the prediction probability while showing an actual code example. (The operating environment of the following code is Python 3.7.3, pandas 0.24.2, scikit-learn 0.20.3.)

The experiment is performed according to the following flow.

  1. Build a model using the imbalanced data as it is and verify the classification accuracy.
  2. Build a model using the undersampled data and verify that the classification accuracy is improved, but the probability prediction accuracy is low.
  3. Verify whether the probability prediction accuracy can be improved by applying a correction that removes the bias due to undersampling.

Here, the Adult Dataset Published in the UCI Machine Learning Repository. Use edu / ml / machine-learning-databases / adult /). This dataset is a dataset for classifying whether an individual's annual income is 50,000 $ or more based on data such as gender and age.

First, load the data to be used. Here, in the Adult Dataset Save adult.data and adult.test locally as CSV files, and use the former as training data and the latter as verification data.

import numpy as np
import pandas as pd

#Data reading
train_data = pd.read_csv('./adult_data.csv', names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                                         'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                                         'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'obj'])
test_data = pd.read_csv('./adult_test.csv', names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                                         'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                                         'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'obj'],
                       skiprows=1)
data = pd.concat([train_data, test_data])

#Explanatory variable X,Machining the objective variable Y
X = pd.get_dummies(data.drop('obj', axis=1))
Y = data['obj'].map(lambda x: 1 if x==' >50K' or x==' >50K.' else 0) #Objective variable is 1 or 0

#Divided into training data and verification data
train_size = len(train_data)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:] 
Y_train, Y_test = Y.iloc[:train_size], Y.iloc[train_size:]

Looking at the percentage of positive cases in the training data, it is about 24%, which is less than the negative cases, and can be said to be imbalanced data.

print('positive ratio = {:.2f}%'.format((len(Y_train[Y_train==1])/len(Y_train))*100))
#output=> positive ratio = 24.08%

If you build a model using this training data as it is, you can see that the classification accuracy is as low as AUC = 0.57 and the recall rate (Recall) is as low as 0.26. It is thought that the number of negative examples in the training data is large, the prediction results are often negative, and the recall rate (the rate at which positive data can be correctly classified as positive) is low.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, recall_score

#Model building
lr = LogisticRegression(random_state=0)
lr.fit(X_train, Y_train)

#Verify classification accuracy
prob = lr.predict_proba(X_test)[:, 1] #Predict the probability that the objective variable is 1
pred = lr.predict(X_test) #Classified as 1 or 0
auc = roc_auc_score(y_true=Y_test, y_score=prob)
print('AUC = {:.2f}'.format(auc))
recall = recall_score(y_true=Y_test, y_pred=pred)
print('recall = {:.2f}'.format(recall))

#output=> AUC = 0.57
#output=> recall = 0.26

Next, undersampling is performed so that the number of negative example data in the training data is equal to the number of positive example data, and when a model is constructed using this data, the classification accuracy is greatly improved to AUC = 0.90 and recall = 0.86. You can see that

#Undersampling
pos_idx = Y_train[Y_train==1].index
neg_idx = Y_train[Y_train==0].sample(n=len(Y_train[Y_train==1]), replace=False, random_state=0).index
idx = np.concatenate([pos_idx, neg_idx])
X_train_sampled = X_train.iloc[idx]
Y_train_sampled = Y_train.iloc[idx]

#Model building
lr = LogisticRegression(random_state=0)
lr.fit(X_train_sampled, Y_train_sampled)

#Verify classification accuracy
prob = lr.predict_proba(X_test)[:, 1]
pred = lr.predict(X_test)
auc = roc_auc_score(y_true=Y_test, y_score=prob)
print('AUC = {:.2f}'.format(auc))
recall = recall_score(y_true=Y_test, y_pred=pred)
print('recall = {:.2f}'.format(recall))

#output=> AUC = 0.90
#output=> recall = 0.86

At this time, let's look at the prediction accuracy of the probability. You can see that the log loss is 0.41 and the calibration plot passes below the 45 degree line. The fact that the calibration plot passes below the 45 degree line means that the predicted probability is greater than the actual probability. Now, since the model is constructed using the undersampled data so that the number of negative example data is equal to the number of positive example data, learning is performed with the ratio of the number of positive example data larger than the actual number. It is thought that the probability is rather high.

import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.calibration import calibration_curve
from sklearn.metrics import  log_loss

def calibration_plot(y_true, y_prob):
    prob_true, prob_pred = calibration_curve(y_true=y_true, y_prob=y_prob, n_bins=20)

    fig, ax1 = plt.subplots()
    ax1.plot(prob_pred, prob_true, marker='s', label='calibration plot', color='skyblue') #Create calibration plot
    ax1.plot([0, 1], [0, 1], linestyle='--', label='ideal', color='limegreen') #Plot the 45 degree line
    ax1.legend(bbox_to_anchor=(1.12, 1), loc='upper left')
    plt.xlabel('predicted probability')
    plt.ylabel('actual probability')
    
    ax2 = ax1.twinx() #Added 2 axes
    ax2.hist(prob, bins=20, histtype='step', color='orangered') #Plot the histogram of the score
    plt.ylabel('frequency')
    plt.show()

prob = lr.predict_proba(X_test)[:, 1]
loss = log_loss(y_true=Y_test, y_pred=prob)
print('log loss = {:.2f}'.format(loss))
calibration_plot(y_true=Y_test, y_prob=prob)

#output=> log loss = 0.41

image.png

Now, let's remove the bias due to undersampling and correct the probability. Calculate $ \ beta $ and $ If you correct the probability according to p = \ beta p_s / (\ beta p_s-p_s + 1 $), you can see that the log loss improved to 0.32 and the calibration plot was almost on the 45 degree line. Note that $ \ beta $ uses the number of positive / negative examples of training data (the number of positive / negative examples of verification data is unknown).

beta = len(Y_train[Y_train==1]) / len(Y_train[Y_train==0])
prob_corrected = beta*prob / (beta*prob - prob + 1)

loss = log_loss(y_true=Y_test, y_pred=prob_corrected)
print('log loss = {:.2f}'.format(loss))
calibration_plot(y_true=Y_test, y_prob=prob_corrected)

#output=> log loss = 0.32

image.png

It was confirmed that the bias due to undersampling can be removed and the probability can be corrected. That's all for verification.

in conclusion

In this article, we have briefly summarized how to correct the probabilities predicted by a model built using undersampled data. I would appreciate it if you could point out any mistakes.

reference

-SQL / R / Python Practical Techniques for Preprocessing Complete Data Analysis

Recommended Posts

Probability prediction of imbalanced data
Numerical summary of data
Sampling in imbalanced data
Preprocessing of prefecture data
Selection of measurement data
Tuning experiment of Tensorflow data
Visualization of data by prefecture
Prediction of Nikkei 225 with Pytorch 2
Fourier transform of raw data
Average estimation of capped data
Prediction of Nikkei 225 with Pytorch
Is the probability of precipitation correct?
How to deal with imbalanced data
Data prediction competition in 3 steps (titanic)
Memory-saving matrix conversion of log data
Prediction of sine wave with keras
Differentiation of time series data (discrete)
10 selections of data extraction by pandas.DataFrame.query
Animation of geographic data by geopandas
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data handling 2 Analysis of various data formats
4/22 prediction of sine wave with keras
Summary of probability distributions that often appear in statistics and data analysis
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data