[PYTHON] After attending school, I participated in SIGNATE's BEGINNER limited competition for the first time.

At first

I wrote this article because I thought that posting this unfinished code would catch the eye of many people and give me improvements and remedies such as what I did wrong and what I should have done. So, honestly, I think there are many questions such as why you are doing this, but I would be happy if you could see it with a warm eye.

Self-introduction and competition

This time I participated in the competition held from October 1st. https://signate.jp/competitions/295

To briefly introduce myself, I started attending an AI programming school in April of this year. I am currently in the process of changing jobs, have no programming experience, and am from the Faculty of Arts.

First of all, I participated this time a little late and started slowly from October 13th. For the first week, I just looked at the data and wrote the code with reference to what I learned how to do this. However, I couldn't even submit because of repeated errors ...

Then, one week before the end of the competition, I finally got the Kaggle Start Book and decided to copy it and make something like that. Regarding EDA, SIGNATE opened QUEST for free, so I referred to that.

Library used this time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import optuna
import optuna.integration.lightgbm as lgb 
from sklearn.model_selection import  train_test_split
from sklearn.metrics import log_loss

Feature extraction

For the time being, import the data and look at the contents. (The development environment used Kaggle's notebook)

train = pd.read_csv("../input/signatecomp/train.csv",header=0)
test = pd.read_csv("../input/signatecomp/test.csv",header=0)

print(train.info())
print(train.head())
print(train.info())
print(train.head())

Let's look at the characteristics of the data from here. First look at the numeric variables.

test.hist(figsize=(20,20), color='r')

Next, let's look at categorical variables.

emplength_var = train['employment_length'].value_counts()

#Specify graph title
emplength_var.plot.bar(title="employment_frequency of length")

#Specify x-axis name
plt.xlabel('employment_length')

#Specify y-axis name
plt.ylabel('count')

#Code required to display the created graph
plt.show()

#Visualization of purpose bar chart
purpose_var = train['purpose'].value_counts() 
purpose_var.plot.bar()

#Displaying a bar graph for purpose
plt.show()

# application_Visualization of type bar chart
application_var = train['application_type'].value_counts() 
application_var.plot.bar()

# application_Display of type bar graph
plt.show()

#Visualization of grade bar chart
grade_var = train['grade'].value_counts()
grade_var.value_counts() 

#Display of grade bar graph
plt.show()

Next, let's look at the relationship between the objective variable and the categorical variable.

#Index (row) term column, loan_Cross tabulation with status column as column
cross_term = pd.crosstab(train['term'],train['loan_status'], margins = True)

#Divide the ChargedOff column by the All column and divide the variable c_Substitute for rate
c_rate = cross_term['ChargedOff'] / cross_term['All']

#Divide the FullyPaid column by the All column and divide the variable f_Substitute for rate
f_rate = cross_term['FullyPaid'] / cross_term['All']

#Variable c_rate and variable f_variable cross rate_New column c in term_rate、f_Substitute as rate respectively
cross_term['c_rate'] = c_rate
cross_term['f_rate'] = f_rate
#Display of cross-tabulation table
print(cross_term)

#Divide the ChargedOff column by the All column and divide the variable c_Substitute for rate
c_rate = cross_term['ChargedOff'] / cross_term['All']

#Divide the FullyPaid column by the All column and divide the variable f_Substitute for rate
f_rate = cross_term['FullyPaid'] / cross_term['All']

#Variable c_rate and variable f_variable cross rate_New column c in term_rate、f_Substitute as rate respectively
cross_term['c_rate'] = c_rate
cross_term['f_rate'] = f_rate

#Variable cross_Remove the All line from term and the variable cross_Reassign to term
cross_term = cross_term.drop(index = ["All"])

#Show cross tabulation
print(cross_term)
#Create a DataFrame for only the columns you want to use for the stacked bar chart
df_bar = cross_term[['c_rate', 'f_rate']]

#Create a stacked bar chart
df_bar.plot.bar(stacked=True)

#Graph title settings
plt.title('Bad debt rate and repayment rate for each repayment period')

#x-axis label settings
plt.xlabel('period')
#y-axis label settings
plt.ylabel('Percentage')

#Graph display
plt.show()

Apply this work to all categorical variables (It's okay to put all the code, but I'm sorry because it's almost the same work, but I will omit it)

Feature addition

For the time being, I took the average and converted the credit_score logarithmically because it was biased.

#Feature addition
train["log_cre"] = np.log(train.credit_score - train.credit_score.min() + 1)
test["log_cre"] = np.log(test.credit_score - test.credit_score.min() + 1)
train['loam_median'] = train['loan_amnt'] - train['loan_amnt'].median()
train['inter_median'] = train['interest_rate'] - train['interest_rate'].median()
test['loam_median'] = test['loan_amnt'] - test['loan_amnt'].median()
test['inter_median'] = test['interest_rate'] - test['interest_rate'].median()

Data preprocessing

This time I did label encoding.

#Convert train data

Label_Enc_list = ['term','grade','purpose','application_type',"employment_length","loan_status"]

#Implementation of Label Encoding
import category_encoders as ce

ce_oe = ce.OrdinalEncoder(cols=Label_Enc_list,handle_unknown='impute')
#Convert letters to ordinal
train = ce_oe.fit_transform(train)
#Change the value from the beginning of 1 to the beginning of 0
for i in Label_Enc_list:
    train[i] = train[i] - 1

#Convert test data
from sklearn.preprocessing import LabelEncoder

category = test.select_dtypes(include='object')

for col in list(category):
  le = LabelEncoder()
  le.fit(test[col])
  le.transform(test[col])
  test[col] = le.transform(test[col])

print(train.head())
print(test.head())

Modeling

#Get the values of the objective and explanatory variables of train
target = train['loan_status'].values
features = train.drop(['id','loan_status'],axis=1).values

#test data
test_X = test.drop(['id'],axis=1).values

#Divide train into training data and verification data
(features , val_X , target , val_y) = train_test_split(features, target , test_size = 0.2)

    

def objective(trial):
    lgb_params = {'objective': 'binary',
                  'max_bin': trial.suggest_int("max_bin", 255, 500), 
                  "learning_rate": 0.05,
                  "num_leaves": trial.suggest_int("num_leaves", 32, 128)
                 }
    lgb_train = lgb.Dataset(features, target) #For learning
    
    lgb_eval = lgb.Dataset(val_X, val_y,reference=lgb_train) #For Boosting
    
     #Learning
    model = lgb.train(lgb_params, lgb_train,
                      valid_sets=[lgb_train,lgb_eval],
                      num_boost_round=1000,
                      early_stopping_rounds=10,
                      verbose_eval=10)
    
    y_pred = model.predict(val_X,
                           num_iteration=model.best_iteration) 
    score = log_loss(val_y,y_pred)
    return score
study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=0))
study.optimize(objective, n_trials=20)
study.best_params
lgb_params = {'boosting_type': 'gbdt',
              'objective': 'binary',
                  'max_bin': study.best_params["max_bin"], 
                  "learning_rate": 0.05,
                  "num_leaves": study.best_params["num_leaves"]
                 }
lgb_train = lgb.Dataset(features, target) #For learning
    
lgb_eval = lgb.Dataset(val_X, val_y,reference=lgb_train) #For Boosting

#Learning
model = lgb.train(lgb_params, lgb_train, valid_sets=[lgb_train,lgb_eval],
                  num_boost_round=1000,
                  early_stopping_rounds=10,
                  verbose_eval=10)


pred = model.predict(test_X,num_iteration=model.best_iteration) 

File submission

In the start book, I used to classify binary on the condition that it is larger than 0.5, but I changed the condition because I got a score when I displayed about 50 lines and specified it as larger than 0.1, but I do not know how to handle it It was ... I intended to make an assignment again, but I didn't know how to specify it, so I am doing the stupid thing of adding a line once, opening the csv file and deleting it (; _;) (Is it okay to specify the column normally with Header = 0?)

pred1 = (pred > 0.1).astype(int)
submit = pd.read_csv("../input/signatecomp/submit.csv")
#Prediction result file output
submit.loc[:,0] = pred1[1:]
submit.to_csv("submit1.csv", index = False)
print("Your submission was successfully saved!")

result

I was promoted if F1Score = 0.355 was exceeded, but I could not be promoted because it was 0.3275218.

Reflections

First, I was wasting a lot of time because I was trying to analyze data on a near zero basis.

Next, I learned a certain amount of knowledge and implementation because I learned at the programming school for E qualification, but I was sorry that I neglected to actually solve the problem and review it. The result was

Lastly, I was very bad at finding similar competitions in Kaggle, how to implement code, how to do data engineering, and so on.

Learn

When I entered the competition for the first time, I got a lot of things, such as how weak I was, what I should do from now on, how to deal with errors, etc.

Finally

Thank you to everyone who has seen this far. I am still immature, so I will continue to work harder.

Recommended Posts

After attending school, I participated in SIGNATE's BEGINNER limited competition for the first time.
MongoDB for the first time in Python
Code that I wish I had remembered when I participated in AtCoder for the first time (Reflection 1 for the next time)
For the first time in Numpy, I will update it from time to time
I tried using scrapy for the first time
I tried python programming for the first time.
I tried Mind Meld for the first time
Looking back on the machine learning competition that I worked on for the first time
Until you win the silver medal (top 3%) in the competition you participated in within a month for the first time in data science!
What I got into Python for the first time
I tried Python on Mac for the first time.
Register a task in cron for the first time
I tried python on heroku for the first time
AI Gaming I tried it for the first time
Kaggle for the first time (kaggle ①)
Kaguru for the first time
Summary of stumbling blocks in Django for the first time
I tried the Google Cloud Vision API for the first time
SIGNATE [1st _Beginner Limited Competition] Participated in bank customer targeting
I participated in the ISUCON10 qualifying!
[For self-learning] Go2 for the first time
See python for the first time
Start Django for the first time
I participated in Kaggle's NFL competition
What I learned by writing a Python Pull Request for the first time in my life
I tried logistic regression analysis for the first time using Titanic data
A useful note when using Python for the first time in a while
Let's try Linux for the first time
GTUG Girls + PyLadiesTokyo Meetup I went to machine learning for the first time
How to use MkDocs for the first time
Signate_ Review of the 1st Beginner Limited Competition
Try posting to Qiita for the first time
I want to create a lunch database [EP1] Django study for the first time
I want to create a lunch database [EP1-4] Django study for the first time
I will install Arch Linux for the time being.
I participated in competitive programming in a week after I started programming
Participated in the first ISUCON with the team "Lunch" # ISUCON10 Qualifying