[PYTHON] LightGBM scores fitted in beta distribution

Preface

LightGBM is an option for learning binary classification models these days. However, I was always worried that the score spit out there did not necessarily indicate a probability value. If I wanted to do something about this, I came up with a method that seemed to be good as it was.

What is binary classification?

Binary classification is a machine learning task that guesses whether it is zero or one. As a term, I think that the term positive or negative is more like machine learning, so I will write it that way below.

A common task is to diagnose a disease, which is a type of task that determines whether a person is affected (positive) or not (negative) from the test results. When the gender and age of the examinee and various tests are added to the binary classification model, the model is required to show the possibility that the person is affected as a score, but there are two functions, positive or negative. In order to judge by one, it is necessary to divide the score by an appropriate threshold. The "possibility of being affected" does not have to be a probability value because it can be arranged in the order of the scores and separated by an appropriate threshold value.

However, there are times when you want a high probability value rather than a high or low "probability of being affected". For example, when there are service A and service B, it is better not to appeal only one to the user. This is also an application of binary classification, but when it comes to appealing those with a high participation probability, the binary classification scores of A and B must be comparable. This means that the scale and bias must be the same.

Distribution of predicted values

Learning with LightGBM

I made a model by throwing Santander data from kaggle into LightGBM. This is a dataset of whether or not to purchase financial products, and the percentage of positive cases is 10%, which is a moderate imbalance. The objective variable is 1 or 0, but for ease of writing, write 1 as positive and 0 as negative.

The parameters look like this. If is_unbalance is included for unbalanced data, the predicted value will spread appropriately because it is sampled and learned so that the positive and negative values are halved.

Parameters	value
objective	binary
num_leaves	15
is_unbalane	True
num_boost_round	100

Score and percentage of cases

Let's summarize the scores for each data point in increments of 0.01 and take the percentage of positive examples. Then you can draw a correlation diagram between the score and the percentage of regular cases. By the way, rmse is the error from the diagonal.

When this overlaps with the diagonal lines of (0,0) and (1,1), the predicted value directly predicts the percentage of regular cases, which makes me happy in various applications.

Score distribution

Looking at the distribution separately for positive and negative, it looks like this.

The percentage of positive cases is 0 in the range of predicted values where positive and negative cases overlap. Will be from 1. In the non-overlapping range, the percentage of positive cases is 0 if there are only negative cases, and the ratio of positive cases is 1 if the distribution is only positive cases. That is why the above correlation diagram looks like a sigmoid.

Beta distribution

By the way, this shape is similar to the beta distribution. If the shape is like a beta distribution, the random variable is 0 to 1, which is the same as the beta distribution. So let's find the alpha and beta from the mean and variance.

e = \frac{a}{a+b}

v = \frac{ab}{ (a+b)^2 (a+b+1)}

If you solve this,

a = \frac{e^2 (1-e)}{v}-e

b = \frac{(1-e)}{e}a

If you draw a random sampling of beta distribution with the above parameters, it looks like this.

The shape is a little different around 0.1 and 0.9, but it's okay.

Forecast of positive percentage

If you have alpha and beta for each of the positive and negative, you can calculate the probability density for each positive and negative from any random variable from 0 to 1. Furthermore, if you have the positive and negative numbers of the original data, you can calculate the positive and negative ratios for any random variable.

Y = \frac{N_p\times Beta_p}{N_p\times Beta_p + N_n\times Beta_n}

When plotting the predicted value from the distribution against the original predicted value from LightGBM, it looks like this.

Furthermore, the correlation diagram between the predicted value from the beta distribution and the positive example ratio mentioned earlier looks like this.

You can see that it is quite close to the diagonal.

So, with the two-stage approach of LightGBM and beta distribution, we were able to bring the predicted value and the percentage of regular cases closer to each other diagonally.

Summary

Scores from binary classification models often do not go diagonally when plotted as a percentage of regular cases.
If the score can be fitted to an appropriate parametric distribution, the percentage of positive cases can be calculated from the score.

Afterword

I was planning to publish a book in Technical Book 8 on 3/1, but it was canceled due to the virus in the example.