In a previous post, I used kaggle's Kickstarter Projects to verify the accuracy of each model. [I tried to compare the accuracy of machine learning models using kaggle as a theme] (https://qiita.com/Hawaii/items/4f0dd4d9cfabc4f6bb38)
This time, while referring to that as well, ** the purpose is to disseminate the process of improving accuracy using the Kaggle Start Book released in March **. This time I learned a lot of new things, so I decided to divide it into the first part and the second part. Today's post is the first part.
Mainly, I would like to focus on the parts that I did not know. Specifically, there are three types: ** "Pandas-Profiling, LightGBM, Ensemble Learning" **.
→ This time, we will work on Pandas-Profiling and LightGBM.
It also describes the problems with each subject, the error, and the process of solving it, so if you did not go well, please read it.
Pandas Profiling was introduced in the Kaggle startbook, and I didn't know it at all, so I tried it.
I referred to the following site. https://qiita.com/h_kobayashi1125/items/02039e57a656abe8c48f
It seems that Pandas-Profiling needs to be installed with pip etc., so I also imported it.
pip install pandas-profiling
Basically, this seems to be OK, but in fact I got an error here, so I searched various sites and tried it. However, I couldn't get out of this error and it took me a day, so I will write about my experience. So, for those who have followed the exact same path as me, I would like to mention that I was able to solve it.
Here is the reference. https://gammasoft.jp/support/pip-install-error/
I hit cause 2 of this, so
pip install pandas-profiling --user
I wrote that and tried to install it. Then, although it was not an error, a warning appeared in red, and once I closed the jupyter notebook and started it up again, it did not start up ...
The error I was getting was "Attribute Error:'module' object has no attribute's'", so I checked this as well, and when I did the following, I was able to launch it again safely!
pip uninstall attr
pip install attrs
pip intall pandas-profiling
I don't have any knowledge about this, so I'm sorry for the fluffy writing ... I hope it helps you a little.
There are some that I don't need directly this time, but I'm importing them all at once.
#numpy,Import pandas
import numpy as np
import pandas as pd
#Import to perform some processing on date data
import datetime
#Import for training and test data split
from sklearn.model_selection import train_test_split
#Import for standardization
from sklearn.preprocessing import StandardScaler
#Import for accuracy verification
from sklearn.model_selection import cross_val_score
#pandas_profiling
import pandas_profiling as pdp
df = pd.read_csv(r"~\ks-projects-201801.csv")
As it is written in the start book, pandas-profiling takes time if it is a huge amount of data, so we will sample the data.
#Sampling 30% of the whole
df_sample=df.sample(frac=0.3,random_state=1234)
Originally, ``` df_sample.profile_report ()` `` seems to be OK, but for some reason I did not get an error, but the result is not displayed and I spit it out in HTML format in this way I tried (I think that the file will be created in HTML format in the same place where I am working).
report = pdp.ProfileReport(df_sample)
report.to_file('profile_report.html')
I was able to implement it safely! !! Certainly, this seems to give you a rough idea of what the data looks like. I would like to continue using it as appropriate.
Next, let's build a model of LightGBM. I've heard the name, but it was implemented because there wasn't much description in books etc. I didn't have it, so I'll try it.
◆ Reference site In addition to the Kaggle Startbook, I also referred to the following sites.
https://blog.amedama.jp/entry/2018/05/01/081842#scikit-learn-%E3%82%A4%E3%83%B3%E3%82%BF%E3%83%BC%E3%83%95%E3%82%A7%E3%83%BC%E3%82%B9
#numpy,Import pandas
import numpy as np
import pandas as pd
#Import to perform some processing on date data
import datetime
#Import for training and test data split
from sklearn.model_selection import train_test_split
#Import for standardization
from sklearn.preprocessing import StandardScaler
#Import for accuracy verification
from sklearn.model_selection import cross_val_score
#pandas_profiling
import pandas_profiling as pdp
#LightGBM import
import lightgbm as lgb
df = pd.read_csv(r"C:\\ks-projects-201801.csv")
I won't go into details, but the code comments describe what I'm doing.
#The state of the objective variable is narrowed down to data only for success or failure (the data is deleted because there is a category such as abort in the middle)
df = df[(df["state"] == "successful") | (df["state"] == "failed")]
#On top of that, success is set to 1 and failure is set to 0.
df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)
#Processing of date data. start date(launched)And end date(deadline)Because there is, take the difference and recruitment period(days)I have to
df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days
#Although omitted this time, as a result of data analysis, explanatory variables that seem unnecessary are deleted.
df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)
#Category variable processing
df = pd.get_dummies(df,drop_first = True)
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
lgb_train = lgb.Dataset(X_train,y_train)
lgb_eval = lgb.Dataset(X_test,y_test)
params = {"objective":"binary"}
model = lgb.train(params,lgb_train,valid_sets=[lgb_train,lgb_eval],verbose_eval=10,num_boost_round=1000,early_stopping_rounds=10)
#Predict with test data, y_Store results in pred
y_pred = model.predict(X_test,num_iteration=model.best_iteration)
#y_pred is 0.If it is greater than 5, make it an integer 1.
y_pred = (y_pred>0.5).astype(int)
y_pred_max = np.argmax(y_pred)
#accuracy(Accuracy)To calculate
accuracy = sum(y_test == y_pred_max) / len(y_test)
print(accuracy)
Then, the accuracy was ** 0.597469 **, and LightGBM was successfully implemented!
What I stumbled upon this time was that I didn't write `y_pred = (y_pred> 0.5) .astype (int)`
, so the precision was initially 0.
→ It is written well in the book, but I skipped it because I was writing the code while referring to other sites as well.
The LightGBM result is output as a continuous value from 0 to 1, while y_test is 0 or 1 because I initially set 0 for failure and 1 for success. The accuracy was 0 at first because I compared the two purely, but I was able to get the accuracy safely by replacing the value larger than 0.5 with 1.
What did you think. I didn't know pandas-profiling at all, so I think it could be used for data analysis. This time I was able to implement LightGBM, which I had been interested in for the first time, so I hope it will be helpful for super beginners as well.
Next time, I will try ensemble learning.
Recommended Posts