[PYTHON] I tried to create a simple credit score by logistic regression.

motivation

In October 2019, SAS Japan invited us to a conference called "SAS Analytics Experience @ Milano", and Dr. Terisa listened to a lecture on Credit Score. 集合写真.jpg [* Photo taken when I went to SAS Analytics Experience @ Milano * in October 2019]

At the performance, we had them share the following three topics, and learned that the mechanism is simple but has a great business impact.

  1. What is a credit score in the first place?
  2. What are the current mainstream and state-of-the-art algorithms? (Using machine learning, deep learning, reinforcement learning)
  3. In what situations can the credit score be used except for "judgment of credit for loan loans"?

I haven't been involved in creating credit scores in practice, but the technical background was very intuitive and easy to understand, so I'll summarize it in this article.

However, in the lecture, the explanation of the method itself shared is not possible due to confidentiality, so I will explain the method described in the book recommended by Dr. Terisa instead of the simplified version. .. [Reference: https://www.wiley.com/en-us/Credit+Risk+Scorecards%3A+Developing+and+Implementing+Intelligent+Credit+Scoring-p-9781119201731]

The method described in this book is not much different from the (state-of-the-art) method that Dr. Terisa explained in the performance, and many companies / startups in Europe and the United States also use the method described in this book. It seems that it is a credible method because it seems to be referring to.

What is a credit score? First organize.

A credit score is someone who wants to borrow money! When I thought Can the person be trusted? (Are you capable of repayment?) Is quantified. It means quantifying and visualizing social credit.

信用スコアのイメージ.png

The ScoreCard Point in the figure above is the number that forms your credit score. In order to create this Score Card Point, we will be doing various things with Python.

It may seem difficult at first glance, but the method is very simple. You can create a credit score in just 3 steps below.

  1. Hierarchy of columns used to create credit score (binning)
  2. WoE calculation at each level of each column
  3. Calculation of ScoreCard Points at each level of each column

Let's take a closer look at the contents of each step!

An overview of the dataset used to create a credit score

Students and practitioners from all over the world are competing for their abilities, and a dataset called "Give Me Some Credit" is shared from the site of an online data analysis competition called Kaggle, so we will use that data. .. This was used in the competition in 2011. Below, we will share a summary of the data in a table. 特徴量.png [Click here for dataset] https://www.kaggle.com/brycecf/give-me-some-credit-dataset

Using this data set, create a logistic regression model with "SeriousDlpin2yrs" as the target variable (explained variable), and use the coefficients given to the explanatory variables to calculate the credit score. (I will explain in more detail later)

The important point is to "create a credit score by learning AI", so Of course, was the necessary data set repayable? / Is the repayment delayed? You need a dataset that contains this information.

Explanation and implementation of STEP1 and 2

Steps 1 and 2 are like pre-processing in creating a credit score. ステップ1と2.png

Let's take a closer look from STEP1

The work here is quite difficult (muddy). The features (for example, monthly income) used to create the credit score must be hierarchized (binning). Like 100-200,000, 200-300,000.

** Why do we have to layer? ** Speaking of As explained in "What is a credit score? First, sort out" above, The credit score is 50 points for people with a monthly income of 100,000 to 200,000! , 200-300,000 people get 55 points! It is calculated like this. That is, from the perspective of the data preparation side, it means that the preprocessing of "layering the features" is performed.

** I think you will be asked the question, "Then, how should we stratify?" **.

As will be described later in Step 2, In the academic world, the criteria for stratification is said to be "set to maximize the difference between DOG and DOB." [Reference: https://www.wiley.com/en-us/Credit+Risk+Scorecards%3A+Developing+and+Implementing+Intelligent+Credit+Scoring-p-9781119201731]

However, it seems that many practitioners of credit score creation judge by ** "Try and Error" **. Dr. Terisa said so.

Judging by Try and Error is ** "Once you specify the hierarchy yourself, give a credit score, and compare it with the domain and common sense to determine if the credit score is Make Sense" **.

I would like to say, "Formulate with an optimization problem that maximizes the difference between DOG and DOB", but this ** Try and Error methodology ** is quite convincing.

** Because I think that those who are active in the practical world of credit score creation have a good sense of stratification. Where should I separate my income and age? From their / her experience, I'm sure there is a sense that credit score creation will work. ** **

I don't have that feeling yet, but I did it with Try and Error. For the time being, I wanted to experience the Workflow of a practitioner of credit score creation.

Next, let's take a closer look at Step2

In order to understand Step2, it is necessary to understand the following two concepts.

1. What is WoE (Weight of Evidence)?

――Indicator that informs "Good Customer or Bad Customer, which one is useful for predicting?"

Comes out with this definition, Good Customer refers to "people who have" never "delayed debt repayment or defaulted in the past", and Bad Customer refers to "people who have delayed debt repayment or defaulted in the past". It refers to a person who "has" become.

A positive value helps predict a Good Customer, and a negative value helps predict a Bad Customer. You can understand it better by looking at the image of WoE actually calculated below.

実際に計算した結果.png

How much debt / debt does DebtRatio have for your assets? It is an index showing. (DP = liabilities / assets)

Looking at the figure above, the higher the Dept Ratio, the smaller the WoE value, isn't it? In other words, you can read the message ** "The more debt you have for your assets, the more you are defaulting." **. I think this is intuitive and straightforward.

2. What is DOG / DOB?

Let's take a look at the slides in Step 1 & 2 above. Notice the DOG and DOB formulas in Step 2. ステップ1と2.png In the formula here, "the number of credits / non-credits in each category" indicates, for example, "How many people are good / bad customers in the category with monthly income of 200,000 to 300,000 yen?".

"Total credit / non-credit" indicates "How many people are good / bad customers in the entire dataset?"

In the column of the dataset used this time, it is the number when "SeriousDlqin2yrs" = 0/1. I think it's difficult to imagine with just words, so please see the figure below as well.

image.png how is it? I understand? If you find something difficult to understand, feel free to ask a question :)

Finally, I will share the implementation method (python) up to Step 1-2 below.


#Define binning
def binning(col, list_bins_func):
    binned_df_list = []
    for each in list_bins_func:
        binned_df_list.append(df[(df[col]>=each[0])&(df[col]<each[1])])
    return binned_df_list

#Define a function that performs Binning and WoE calculations
def calc_woe_runner(col, list_bins):
    #Actually execute binning
    list_binned_df  = binning(col, list_bins)
    each_num         = np.zeros(len(list_binned_df))
    dist_good         = np.zeros(len(list_binned_df))
    dist_bad           = np.zeros(len(list_binned_df))
    good_number  = np.zeros(len(list_binned_df))
    bad_number    = np.zeros(len(list_binned_df))
    
    #Calculate DOG and DOB
    for i, each in enumerate(list_binned_df):
        each_num[i]        = len(each)
        good_number[i]  = len(each[each["SeriousDlqin2yrs"] == 0])
        bad_number[i]    = len(each[each["SeriousDlqin2yrs"] == 1])
    
    dist_good   = good_number/good_number.sum()
    dist_bad    = bad_number/bad_number.sum()
    dist_total  =  (good_number + bad_number)/len(df)
    
    # WOE(Weight of Evidence)To calculate
    woe = np.log(dist_good/dist_bad)*100
    
    return col,woe,dist_total, good_number.sum(), good_number, bad_number.sum(),bad_number, dist_good, dist_bad

#Do the above
#Definition of variables used in binning function
col_list = ["age", "DebtRatio", 'MonthlyIncome']

age_bin_list = [[0,30], [30,40], [40,50],
                [50,60], [60,70], [70,80], 
                [80,90], [90,130]]

deptRatio_bin_list = [[0,0.2], [0.2,0.4], [0.4,0.6],
                      [0.6,0.8], [0.8,1.0], [1.0,1.2], 
                      [1.2,1.4], [1.4,1.6]]

monthlyIncome_bin_list = [[0,2000], [2000,4000], [4000,6000],
                          [6000,8000], [8000,10000], [10000,12000], 
                          [12000,14000], [14000,160000]]

list_combined = [age_bin_list, deptRatio_bin_list, monthlyIncome_bin_list]

# Actually calculate woe
col_list_for_df = []
woe_list_for_df = []
iv_list_for_df  = []
df_woe_list     = []
good_list_sum   = []
good_list_each  = []
bad_list_sum    = []
bad_list_each   = []
dist_good_list  = []
dist_bad_list   = []
total_dist_list = []
df_woe_concat   = pd.DataFrame()
i = 0
for col, each_bin_for_col in zip(col_list,list_combined):
    col_list_for_df, woe_list_for_df, total_dist_list, good_list_sum, good_list_each, bad_list_sum, bad_list_each, dist_good_list, dist_bad_list = calc_woe_runner(col, each_bin_for_col)
    col_df = pd.DataFrame(data=[col_list_for_df]*len(list_combined[0]), columns=["col"])
    
    woe_list_for_df = pd.DataFrame(data=woe_list_for_df, columns=["WoE"])
    good_list_df    = pd.DataFrame(data=good_list_each, columns=["Num_good"], dtype=int)
    bad_list_df     = pd.DataFrame(data=bad_list_each, columns=["Num_bad"], dtype=int)
    dist_good_df    = pd.DataFrame(data=dist_good_list, columns=["Distr_good"])
    dist_bad_df     = pd.DataFrame(data=dist_bad_list, columns=["Distr_bad"])
    total_dist_df   = pd.DataFrame(data=total_dist_list, columns=["Distr_total"])
    l = []
    for e in np.array(list_combined[i]):
        l.append(str(e[0]) + "-" + str(e[1]))
    bin_value_df = pd.DataFrame(data=l, columns=["Attribute"])
    
    df_woe_concat = pd.concat([col_df, bin_value_df,good_list_df, 
                               bad_list_df,dist_good_df, dist_bad_df, 
                               woe_list_for_df, total_dist_df], axis=1)
    df_woe_list.append(df_woe_concat)
    i += 1
df_woe = pd.concat(df_woe_list, axis=0)

When you run the above code, you should see an Output like the one below. 最終結果.png

Explanation and implementation of step 3

In Step3, we will actually calculate the Scorecard Point. The calculation formula is as follows.

Scorecard Point = (β×WoE+ α/n)×Factor + Offset/n

The following terms (Terms) that appear here do not need to be calculated already.

  1. WoE has been calculated in Step2.
  2. Factor is a Scaling Factor, so it is a constant.
  3. Offset is a Scaling Factor, so it is a constant.
  4. n is a constant because it is the number of features used to predict SeriousDlqin2yrs.

Since , the only terms to be calculated are β and α. </ u> </ b> These β and α are calculated after modeling by logistic regression.

In other words, taking the dataset used this time as an example,

  1. Modeling with logistic regression using "SeriousDlqin2yrs" as the target variable
  2. After modeling, obtain the coefficients and intercept terms of the variables used in Scorecard Point (in this case, age, DebtRatio, Monthly Income).
  3. Its coefficient becomes β and its intercept term becomes α

** In other words, step 3 is just modeling with logistic regression, and you only have to get the coefficients and intercepts after modeling. ** **

Finally, I will share the implementation method of Step 3 below.


#Train a logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
print("AUC:{}".format(roc_auc_score(lr.predict(X_test), y_test)))

#Actually calculate the credit score: Score = (β×WoE+ α/n)×Factor + Offset/n
df_woe_with_score = df_woe.reset_index().copy()
df_woe_with_score.iloc[:, 3:-1] = df_woe_with_score.iloc[:, 3:-1].astype(float)

#Define Scaling factor
n = len(default_features_list)

alpha = lr.intercept_[0]
beta_age    = lr.coef_[0][0]    #Age column factor
beta_dept   = lr.coef_[0][1]    #DebtRation coefficient
beta_income = lr.coef_[0][2] #Monthly Income Factor

#Scaling for a maximum total Scorecard Point of 600
factor      = 20/np.log(2)
offset      = 600-factor*np.log(20)

print("factor:{0}, offset:{1}".format(factor, offset))

#Scorecard Point calculation
df_woe_with_score["score"] = None
score_list = []
for i in range(len(df_woe)):
    woe = df_woe_with_score["WoE"].iloc[i]
    if df_woe_with_score.iloc[i]["col"] == "age":
        score = (beta_age*woe+(alpha/n))*factor + (offset/n)
        df_woe_with_score["score"].iloc[i] = round(score, 1)
    elif df_woe_with_score.iloc[i]["col"] == "DebtRatio":
        coef = beta_dept.copy()
        score = (beta_dept*woe+(alpha/n))*factor + (offset/n)
        df_woe_with_score["score"].iloc[i] = round(score, 1)
    elif df_woe_with_score.iloc[i]["col"] == "MonthlyIncome":
        coef = beta_income.copy()
        score = (beta_income*woe+(alpha/n))*factor + (offset/n)
        df_woe_with_score["score"].iloc[i] = round(score,1)

And I was able to create a credit score safely! 信用スコア完成図.png

Thank you for reading this far.

Recommended Posts