At first

I wrote this article because I thought that posting this unfinished code would catch the eye of many people and give me improvements and remedies such as what I did wrong and what I should have done. So, honestly, I think there are many questions such as why you are doing this, but I would be happy if you could see it with a warm eye.

Self-introduction and competition

This time I participated in the competition held from October 1st. https://signate.jp/competitions/295

To briefly introduce myself, I started attending an AI programming school in April of this year. I am currently in the process of changing jobs, have no programming experience, and am from the Faculty of Arts.

First of all, I participated this time a little late and started slowly from October 13th. For the first week, I just looked at the data and wrote the code with reference to what I learned how to do this. However, I couldn't even submit because of repeated errors ...

Then, one week before the end of the competition, I finally got the Kaggle Start Book and decided to copy it and make something like that. Regarding EDA, SIGNATE opened QUEST for free, so I referred to that.

Library used this time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import optuna
import optuna.integration.lightgbm as lgb 
from sklearn.model_selection import  train_test_split
from sklearn.metrics import log_loss

Feature extraction

For the time being, import the data and look at the contents. (The development environment used Kaggle's notebook)

train = pd.read_csv("../input/signatecomp/train.csv",header=0)
test = pd.read_csv("../input/signatecomp/test.csv",header=0)

print(train.info())
print(train.head())
print(train.info())
print(train.head())

Let's look at the characteristics of the data from here. First look at the numeric variables.

test.hist(figsize=(20,20), color='r')

Next, let's look at categorical variables.

emplength_var = train['employment_length'].value_counts()

#Specify graph title
emplength_var.plot.bar(title="employment_frequency of length")

#Specify x-axis name
plt.xlabel('employment_length')

#Specify y-axis name
plt.ylabel('count')

#Code required to display the created graph
plt.show()

#Visualization of purpose bar chart
purpose_var = train['purpose'].value_counts() 
purpose_var.plot.bar()

#Displaying a bar graph for purpose
plt.show()

# application_Visualization of type bar chart
application_var = train['application_type'].value_counts() 
application_var.plot.bar()

# application_Display of type bar graph
plt.show()

#Visualization of grade bar chart
grade_var = train['grade'].value_counts()
grade_var.value_counts() 

#Display of grade bar graph
plt.show()

Next, let's look at the relationship between the objective variable and the categorical variable.

#Index (row) term column, loan_Cross tabulation with status column as column
cross_term = pd.crosstab(train['term'],train['loan_status'], margins = True)

#Divide the ChargedOff column by the All column and divide the variable c_Substitute for rate
c_rate = cross_term['ChargedOff'] / cross_term['All']

#Divide the FullyPaid column by the All column and divide the variable f_Substitute for rate
f_rate = cross_term['FullyPaid'] / cross_term['All']

#Variable c_rate and variable f_variable cross rate_New column c in term_rate、f_Substitute as rate respectively
cross_term['c_rate'] = c_rate
cross_term['f_rate'] = f_rate

#Display of cross-tabulation table
print(cross_term)

#Divide the ChargedOff column by the All column and divide the variable c_Substitute for rate
c_rate = cross_term['ChargedOff'] / cross_term['All']

#Divide the FullyPaid column by the All column and divide the variable f_Substitute for rate
f_rate = cross_term['FullyPaid'] / cross_term['All']

#Variable c_rate and variable f_variable cross rate_New column c in term_rate、f_Substitute as rate respectively
cross_term['c_rate'] = c_rate
cross_term['f_rate'] = f_rate

#Variable cross_Remove the All line from term and the variable cross_Reassign to term
cross_term = cross_term.drop(index = ["All"])

#Show cross tabulation
print(cross_term)

#Create a DataFrame for only the columns you want to use for the stacked bar chart
df_bar = cross_term[['c_rate', 'f_rate']]

#Create a stacked bar chart
df_bar.plot.bar(stacked=True)

#Graph title settings
plt.title('Bad debt rate and repayment rate for each repayment period')

#x-axis label settings
plt.xlabel('period')
#y-axis label settings
plt.ylabel('Percentage')

#Graph display
plt.show()

Apply this work to all categorical variables (It's okay to put all the code, but I'm sorry because it's almost the same work, but I will omit it)

Feature addition

For the time being, I took the average and converted the credit_score logarithmically because it was biased.

#Feature addition
train["log_cre"] = np.log(train.credit_score - train.credit_score.min() + 1)
test["log_cre"] = np.log(test.credit_score - test.credit_score.min() + 1)
train['loam_median'] = train['loan_amnt'] - train['loan_amnt'].median()
train['inter_median'] = train['interest_rate'] - train['interest_rate'].median()
test['loam_median'] = test['loan_amnt'] - test['loan_amnt'].median()
test['inter_median'] = test['interest_rate'] - test['interest_rate'].median()

Data preprocessing

This time I did label encoding.

#Convert train data

Label_Enc_list = ['term','grade','purpose','application_type',"employment_length","loan_status"]

#Implementation of Label Encoding
import category_encoders as ce

ce_oe = ce.OrdinalEncoder(cols=Label_Enc_list,handle_unknown='impute')
#Convert letters to ordinal
train = ce_oe.fit_transform(train)
#Change the value from the beginning of 1 to the beginning of 0
for i in Label_Enc_list:
    train[i] = train[i] - 1

#Convert test data
from sklearn.preprocessing import LabelEncoder

category = test.select_dtypes(include='object')

for col in list(category):
  le = LabelEncoder()
  le.fit(test[col])
  le.transform(test[col])
  test[col] = le.transform(test[col])

print(train.head())
print(test.head())

Modeling

#Get the values of the objective and explanatory variables of train
target = train['loan_status'].values
features = train.drop(['id','loan_status'],axis=1).values

#test data
test_X = test.drop(['id'],axis=1).values

#Divide train into training data and verification data
(features , val_X , target , val_y) = train_test_split(features, target , test_size = 0.2)

    

def objective(trial):
    lgb_params = {'objective': 'binary',
                  'max_bin': trial.suggest_int("max_bin", 255, 500), 
                  "learning_rate": 0.05,
                  "num_leaves": trial.suggest_int("num_leaves", 32, 128)
                 }
    lgb_train = lgb.Dataset(features, target) #For learning
    
    lgb_eval = lgb.Dataset(val_X, val_y,reference=lgb_train) #For Boosting
    
     #Learning
    model = lgb.train(lgb_params, lgb_train,
                      valid_sets=[lgb_train,lgb_eval],
                      num_boost_round=1000,
                      early_stopping_rounds=10,
                      verbose_eval=10)
    
    y_pred = model.predict(val_X,
                           num_iteration=model.best_iteration) 
    score = log_loss(val_y,y_pred)
    return score

study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=0))
study.optimize(objective, n_trials=20)
study.best_params

lgb_params = {'boosting_type': 'gbdt',
              'objective': 'binary',
                  'max_bin': study.best_params["max_bin"], 
                  "learning_rate": 0.05,
                  "num_leaves": study.best_params["num_leaves"]
                 }
lgb_train = lgb.Dataset(features, target) #For learning
    
lgb_eval = lgb.Dataset(val_X, val_y,reference=lgb_train) #For Boosting

#Learning
model = lgb.train(lgb_params, lgb_train, valid_sets=[lgb_train,lgb_eval],
                  num_boost_round=1000,
                  early_stopping_rounds=10,
                  verbose_eval=10)


pred = model.predict(test_X,num_iteration=model.best_iteration)

File submission

In the start book, I used to classify binary on the condition that it is larger than 0.5, but I changed the condition because I got a score when I displayed about 50 lines and specified it as larger than 0.1, but I do not know how to handle it It was ... I intended to make an assignment again, but I didn't know how to specify it, so I am doing the stupid thing of adding a line once, opening the csv file and deleting it (; _;) (Is it okay to specify the column normally with Header = 0?)

pred1 = (pred > 0.1).astype(int)
submit = pd.read_csv("../input/signatecomp/submit.csv")
#Prediction result file output
submit.loc[:,0] = pred1[1:]
submit.to_csv("submit1.csv", index = False)
print("Your submission was successfully saved!")

result

I was promoted if F1Score = 0.355 was exceeded, but I could not be promoted because it was 0.3275218.

Reflections

First, I was wasting a lot of time because I was trying to analyze data on a near zero basis.

Next, I learned a certain amount of knowledge and implementation because I learned at the programming school for E qualification, but I was sorry that I neglected to actually solve the problem and review it. The result was

Lastly, I was very bad at finding similar competitions in Kaggle, how to implement code, how to do data engineering, and so on.

Learn

When I entered the competition for the first time, I got a lot of things, such as how weak I was, what I should do from now on, how to deal with errors, etc.

Finally

Thank you to everyone who has seen this far. I am still immature, so I will continue to work harder.

[PYTHON] After attending school, I participated in SIGNATE's BEGINNER limited competition for the first time.