[PYTHON] Predicting Credit Card Defaults Feature Engineering

Previous article

Please take a look at the previous content! https://qiita.com/lindq_yu/items/4f8e3e1d28df0c693d4f

Feature engineering

Confirmation of data properties when engineering features. This data is PAY_AMT, BILL_AMT, AGE, LIMIT BAL are numerical data Category data for SEX, MARRIAGE, EDUCATION PAY has categorical variables such as revolving payment, payment was possible, and was not possible, but there are numerical values in which the delay month is 1 month, 2 months, 3 months, and so on.

I would like to process these data firmly.

Category data

There are several methods for feature engineering of categorical variables, but the typical One-hot encoding is used.

In the description of this data, there is no "0" data in EDUCATION and MARRIAGE, but it is in this dataset. EDUCATION

5 and 6 are unknown. Originally, these two seem to be meaningful unkown. For business situations (I don't really know because I'm a student ...), I can ask the person in charge of input, the questionnaire creator, and the dataset creator, but I don't know this time, so I include unkown and "0". And match with others (since there are only 14 "0" data)

MARRIGE

This item also has 3 as others, which is a question that can include implications. Normally, this is also something to check with the person in charge, but unfortunately it cannot be done, so "0" will be included in others.

Process the dataset based on the above.

↓ Data set description image.png

python


#Data extraction
category=dataset.loc[:,["SEX","MARRIAGE","EDUCATION"]]

#Counting the number of SEX appearances
#print("SEX value count")
#print(category["SEX"].value_counts())
#print("")

#Counting the number of appearances of MARRIAGE
#print("MARRIAGE value count")
#print(category["MARRIAGE"].value_counts())
#print("")

#Counting the number of occurrences of EDUCATION
#print("EDUCATION")
#print(category["EDUCATION"].value_counts())

#MARRIAGE"0" -unknown-To"3" -others-Conversion to
category["MARRIAGE"] = category["MARRIAGE"].replace(0,3)

#EDUCATION"0" -unknown- ,"5" -unknown- ,"6" -unknown-To"3" -others-Conversion to
category["EDUCATION"] = category["EDUCATION"].replace(0,4)
category["EDUCATION"] = category["EDUCATION"].replace(5,4)
category["EDUCATION"] = category["EDUCATION"].replace(6,4)

#Confirmation of category
category

Click here for the converted result ↓ image.png

onehot_category Convert this categorical variable to data using onehot_category.

python


from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories="auto", sparse=False, dtype=np.float32)
onehot_X = enc.fit_transform(category)
onehot_category= pd.DataFrame(data = onehot_X,columns = ["male","female","graduate school","university", "high school", "EDU-others","married", "single","MARR-others"])
onehot_category

Conversion completed! image.png

Numerical data

Numerical data should be left untouched. Numerical data also has conversion, but there are models with higher accuracy if it is not, so leave it for now.

PAY

Category

--Revo count (Revo) --The number of times you paid successfully (Could) --Could not --Number of times no payment was made (not)

Continuous value

--Number of months you couldn't pay

Create variables for the number of revolving credits (Revo), the number of successful payments (Could), the number of times payments could not be made (Could not), and the number of times payments were not made (not). At the continuous value, 0 is substituted, assuming that there is no delay in the month when there was no payment, which was paid successfully with revolving payment. Designed so that the remaining value is the number of delayed months.

python


l = []
for i in range(1,7):
    l.append("PAY_" + str(i))

PAY=dataset.loc[:,l]

PAY["Revo"] = PAY[PAY == 0].count(1)
PAY["Could"] = PAY[PAY == -1].count(1)
PAY["Not"] = PAY[PAY == -2].count(1)
PAY["Could not"] =6-PAY["Not"]-PAY["Could"]-PAY["Revo"]


for i in l:
    PAY[i] = PAY[i].replace(-1,0)
    PAY[i] = PAY[i].replace(-2,0)

image.png

Complete!

Dataset merge

The created variable. Merge adjusted variables

python


#Numerical data
l = []
l.append("AGE")
l.append("LIMIT_BAL")
for i in range(1,7):
    l.append("PAY_AMT" + str(i))
for i in range(1,7):
    l.append("BILL_AMT" + str(i))

merge_data = dataset.loc[:,l]

#Category data
merge_data = merge_data.join(onehot_category)
#PAY
merge_data = merge_data.join(PAY)

merge_data

So far, we have been doing feature engineering. Next time, I would like to put it in a machine learning model!

Recommended Posts

Predicting Credit Card Defaults Feature Engineering
HJvanVeen's "Feature Engineering" Note