1. Purpose

In a previous post, I used kaggle's Kickstarter Projects to verify the accuracy of each model. [I tried to compare the accuracy of machine learning models using kaggle as a theme] (https://qiita.com/Hawaii/items/4f0dd4d9cfabc4f6bb38)

Click here for kaggle's Kickstarter Projects site https://www.kaggle.com/kemical/kickstarter-projects

This time, while referring to that as well, ** the purpose is to disseminate the process of improving accuracy using the Kaggle Start Book released in March **. This time I learned a lot of new things, so I decided to divide it into the first part and the second part. Today's post is the first part.

◆ Subjects to be dealt with

Mainly, I would like to focus on the parts that I did not know. Specifically, there are three types: ** "Pandas-Profiling, LightGBM, Ensemble Learning" **.

→ This time, we will work on Pandas-Profiling and LightGBM.

◆ Others

It also describes the problems with each subject, the error, and the process of solving it, so if you did not go well, please read it.

2. Data analysis-Pandas Profiling-

Pandas Profiling was introduced in the Kaggle startbook, and I didn't know it at all, so I tried it.

(1) Installation

I referred to the following site. https://qiita.com/h_kobayashi1125/items/02039e57a656abe8c48f

It seems that Pandas-Profiling needs to be installed with pip etc., so I also imported it.

pip install pandas-profiling

-If you can install without any problems above, please skip it-

Basically, this seems to be OK, but in fact I got an error here, so I searched various sites and tried it. However, I couldn't get out of this error and it took me a day, so I will write about my experience. So, for those who have followed the exact same path as me, I would like to mention that I was able to solve it.

We cannot guarantee that the method is correct, so please check it yourself before trying it.

Here is the reference. https://gammasoft.jp/support/pip-install-error/

I hit cause 2 of this, so

pip install pandas-profiling --user

I wrote that and tried to install it. Then, although it was not an error, a warning appeared in red, and once I closed the jupyter notebook and started it up again, it did not start up ...

The error I was getting was "Attribute Error:'module' object has no attribute's'", so I checked this as well, and when I did the following, I was able to launch it again safely!

pip uninstall attr
pip install attrs

I hurriedly reinstalled anaconda just before this pip, and after doing the above, I did the following again and it was installed without error.

pip intall pandas-profiling

I don't have any knowledge about this, so I'm sorry for the fluffy writing ... I hope it helps you a little.

(2) Import what you need

There are some that I don't need directly this time, but I'm importing them all at once.

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#pandas_profiling
import pandas_profiling as pdp

(3) Data reading

df = pd.read_csv(r"～\ks-projects-201801.csv")

As it is written in the start book, pandas-profiling takes time if it is a huge amount of data, so we will sample the data.

There are about 380,000 lines of data this time, so when I pandas-profiling all the data, it did not end in 1.5 hours, and I felt that I was able to sleep with processing and wake up.

#Sampling 30% of the whole
df_sample=df.sample(frac=0.3,random_state=1234)

(4) Execution of Pandas-Profiling

Originally, ``` df_sample.profile_report ()` `` seems to be OK, but for some reason I did not get an error, but the result is not displayed and I spit it out in HTML format in this way I tried (I think that the file will be created in HTML format in the same place where I am working).

report = pdp.ProfileReport(df_sample)
report.to_file('profile_report.html')

I was able to implement it safely! !! Certainly, this seems to give you a rough idea of what the data looks like. I would like to continue using it as appropriate.

3. Model construction-LightGBM-

Next, let's build a model of LightGBM. I've heard the name, but it was implemented because there wasn't much description in books etc. I didn't have it, so I'll try it.

This time, we will implement it only in an orthodox way without tuning hyperparameters.

◆ Reference site In addition to the Kaggle Startbook, I also referred to the following sites.

https://blog.amedama.jp/entry/2018/05/01/081842#scikit-learn-%E3%82%A4%E3%83%B3%E3%82%BF%E3%83%BC%E3%83%95%E3%82%A7%E3%83%BC%E3%82%B9

(1) Import what you need

Basically the same as pandas-profiling, but since LightGBM is also imported I will go from the beginning.

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#pandas_profiling
import pandas_profiling as pdp

#LightGBM import
import lightgbm as lgb

(2) Data reading

df = pd.read_csv(r"C:\\ks-projects-201801.csv")

(3) Pretreatment

I won't go into details, but the code comments describe what I'm doing.

#The state of the objective variable is narrowed down to data only for success or failure (the data is deleted because there is a category such as abort in the middle)
df = df[(df["state"] == "successful") | (df["state"] == "failed")]

#On top of that, success is set to 1 and failure is set to 0.
df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

#Processing of date data. start date(launched)And end date(deadline)Because there is, take the difference and recruitment period(days)I have to
df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days


#Although omitted this time, as a result of data analysis, explanatory variables that seem unnecessary are deleted.
df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)

#Category variable processing
df = pd.get_dummies(df,drop_first = True)

(4) Data division

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

(5) Implementation of LightGBM

(I) Data set creation

lgb_train = lgb.Dataset(X_train,y_train)
lgb_eval = lgb.Dataset(X_test,y_test)

params = {"objective":"binary"}

(Ii) Model construction

model = lgb.train(params,lgb_train,valid_sets=[lgb_train,lgb_eval],verbose_eval=10,num_boost_round=1000,early_stopping_rounds=10)

(Iii) Accuracy verification

#Predict with test data, y_Store results in pred
y_pred = model.predict(X_test,num_iteration=model.best_iteration)
#y_pred is 0.If it is greater than 5, make it an integer 1.
y_pred  = (y_pred>0.5).astype(int)

y_pred_max = np.argmax(y_pred) 

#accuracy(Accuracy)To calculate
accuracy = sum(y_test == y_pred_max) / len(y_test)
print(accuracy)

Then, the accuracy was ** 0.597469 **, and LightGBM was successfully implemented!

(Iv) Precautions

What I stumbled upon this time was that I didn't write `y_pred = (y_pred> 0.5) .astype (int)`, so the precision was initially 0.

→ It is written well in the book, but I skipped it because I was writing the code while referring to other sites as well.

The LightGBM result is output as a continuous value from 0 to 1, while y_test is 0 or 1 because I initially set 0 for failure and 1 for success. The accuracy was 0 at first because I compared the two purely, but I was able to get the accuracy safely by replacing the value larger than 0.5 with 1.

4. Conclusion

What did you think. I didn't know pandas-profiling at all, so I think it could be used for data analysis. This time I was able to implement LightGBM, which I had been interested in for the first time, so I hope it will be helpful for super beginners as well.

Next time, I will try ensemble learning.

[PYTHON] I studied with Kaggle Start Book on the subject of kaggle [Part 1]