[PYTHON] I tried to compare the accuracy of machine learning models using kaggle as a theme.

1. Purpose

Once you've learned basic python programming, let's work on kaggle! At that time, let's refer to the kernel! I think that there are many articles and books, and in fact, I think that it is a way to get insanely powerful.

However, from the point of view of a true beginner, I thought ** "It's too difficult to read the kernel and I don't understand the meaning ..." ** "No, it's not such an advanced technology, but basic machine learning for the time being. I just wanted to make a model ... "**, and I found it difficult to be myself.

So, in this article, how much will the accuracy of various machine learning models change by gradually increasing the level, such as ** "super-basic method" and "slightly devised method" **? I would like to verify **. By doing so, ** "I see, how do you make a super-basic machine learning model?" "How can I raise the level a little more?" I will share what I learned. What you do ** is the purpose of this article.

2. Introduction

(1) Kaggle to use (classification problem)

In my qiita article, I use kaggle's Kickstarter Projects dataset, which I often see. https://www.kaggle.com/kemical/kickstarter-projects

(2) Machine learning model to be compared this time

I tried to collect orthodox things.

・ Logistic regression ・ SVM ・ Decision tree ・ Random forest ・ AdaBoost

(3) Stage of verification

A: No adjustment (default) B: (Only required models) Regularization C: (Only required models) Standardization D: Hyperparameter tuning E: Feature selection

Here is a summary of (2) and (3) and the accuracy results of all the patterns that will be verified later.

◆ How to read the table

From the perspective, pattern 1 is the default version of logistic regression with an accuracy of 0.52958, pattern 2 is the version with only the regularization of logistic regression with accuracy of 0.59815, and pattern 3 is the version with regularization and standardization with logistic regression. And the accuracy is 0.66181 ...

◆ Notes

Basically, it is assumed that the accuracy will increase as the pattern progresses in each model, but this time the feature selection of E uses the built-in method. → Please note that the built-in method is based on a linear model, so the accuracy does not necessarily increase (in fact, there were patterns where it was better not to do E with multiple patterns).

◆ For super beginners

I think it's a good idea to look at the A version of each model first and then gradually advance the pattern for each model.

model pattern accuracy
pattern 1 Logistic regression A 0.52958
Pattern 2 Logistic regression B 0.59815
Pattern 3 Logistic regression B+C 0.66181
Pattern 4 Logistic regression B+C+D 0.66181
Pattern 5 Logistic regression B+C+D+E 0.66185
Pattern 6 SVM A 0.61935
Pattern 7 SVM C 0.64871
Pattern 8 SVM C+D 0.65393
Pattern 9 SVM C+D+E 0.65066
Pattern 10 Decision tree A 0.63727
Pattern 11 Decision tree D 0.66376
Pattern 12 Decision tree D+E 0.65732
Pattern 13 Random forest A 0.64522
Pattern 14 Random forest D 0.67762
Pattern 15 Random forest D+E 0.66308
Pattern 16 AdaBoost A 0.63947
Pattern 17 AdaBoost D 0.67426
Pattern 18 AdaBoost D+E 0.659367

(4) Reference

For each machine learning model, we will only implement it this time, but we have posted a series of articles that understand the background from mathematics, so we hope that you can refer to that as well.

[[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [Machine learning] Understanding decision trees from both scikit-learn and mathematics [[Machine learning] Understanding Random Forest] (https://qiita.com/Hawaii/items/5831e667723b66b46fba)

3. Finally, build a machine learning model

(1) Before that

I will do the processing common to all models here.

(I) Import

Let's import it all at once. Imagine that this common process is done at the beginning of each pattern, and then the code for each pattern is written in succession.

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#Import for hyperparameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV

#Import for feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

#Import for logistic regression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix

#Import for SVM
from sklearn.svm import SVC

#Import for decision tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz

#Import for Random Forest
from sklearn.ensemble import RandomForestClassifier

#Import for AdaBoost
from sklearn.ensemble import AdaBoostClassifier

(Ii) Reading data


df = pd.read_csv(r"C:~~\ks-projects-201801.csv")

(Iii) Data appearance

From the following, you can see that it is the dataset of (378661, 15). Since the amount of data is quite large, models that take a long time to process are trained using some of the data.

df.shape

Also, let's take a quick look at the data in .head.

df.head()

(Iv) Data molding

◆ Number of recruitment days

I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".

df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days

◆ About the objective variable

I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.

df = df[(df["state"] == "successful") | (df["state"] == "failed")]

Then replace success with 1 and failure with 0.

df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

◆ Delete unnecessary lines

Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..

df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)

◆ Category variable processing

Perform categorical variable processing with pd.get_dummies.

df = pd.get_dummies(df,drop_first = True)

(2) Patterns 1-5 [Logistic regression]

(I) Pattern 1 ~ Default ~

I would like to implement logistic regression without any adjustments.

First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, build a model for the random forest. The reason why I put the argument in SGDClassifier while saying that I do nothing is that I can not make a logistic regression model unless loss is log, and after this I will verify the accuracy when there is regularization, so The penalty is set to none here.

clf = SGDClassifier(loss = "log", penalty = "none",random_state=1234)
clf.fit(X_train,y_train)

Finally, let's verify the accuracy with test data.

clf.score(X_test, y_test)

Then the accuracy is ** 0.52958 **.

(Ii) Pattern 2-Implemented only regularization-

I will omit what regularization is here, but I will verify with L1 regularization and L2 regularization whether the accuracy will improve if only regularization is performed.

First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

◆ L1 regularization Set penalty to l1 and check the accuracy.

clf_L1 = SGDClassifier(loss = "log", penalty = "l1", random_state=1234)
clf_L1.fit(X_train,y_train)

clf_L1.score(X_test, y_test)

Then, the accuracy was ** 0.52958 **, which was the same as before.

◆ L2 regularization Similarly, set the penalty to l2 and check the accuracy.

clf_L2 = SGDClassifier(loss = "log", penalty = "l2", random_state=1234)
clf_L2.fit(X_train,y_train)
clf_L2.score(X_test, y_test)

Then, the accuracy was ** 0.59815 **, which was higher than that of pattern 1. L2 regularization may be more suitable for this data.

(Iii) Pattern 3 ~ Regularization + Standardization implementation ~

Let's verify what kind of accuracy will be obtained by adding standardization processing to the regularization. After performing the standardization process, we will add L1 regularization and L2 regularization.

First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, standardize. Since the features are small this time, only goals and days that should be standardized are standardized, and the columns that have been processed with categorical variables by get_dumiies are not standardized. However, it seems that there is no problem even if the entire data is standardized (I asked the teacher of the machine learning course I am taking).

stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

After that, L1 regularization processing, L2 regularization processing and accuracy verification are performed.

#L1 regularization and accuracy
clf_L1 = SGDClassifier(loss = "log", penalty = "l1", random_state=1234)
clf_L1.fit(X_train,y_train)
clf_L1.score(X_test, y_test)

#L2 regularization and accuracy

clf_L2 = SGDClassifier(loss = "log", penalty = "l2", random_state=1234)
clf_L2.fit(X_train,y_train)
clf_L2.score(X_test, y_test)

The accuracy of L1 regularization is ** 0.66181 **, and the accuracy of L2 regularization is ** 0.65750 **, improving the accuracy at once. After standardization, L1 regularization seems to be a better match. Here, I will keep a record of ** 0.66181 ** of L1 regularization, which had good accuracy.

(Iv) Pattern 4-Regularization + Standardization + Hyperparameter tuning-

Let's perform further hyperparameter tuning on pattern 3. Hyperparameter tuning refers to exploring and deciding what numerical values we have to decide for ourselves to create a machine learning model.

Here, we will use Gridsearch.

It would be nice if all parameters could be searched in every range, but that would take too much time, so this time we will tune the penalty and alpha that seem to be important in the SGD Classifier.

First is standardization processing + division of training data and test data.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
parameters = {'penalty':['l1', 'l2'], 'alpha':[0.0001,0.001, 0.01, 0.1, 1, 10, 100],'loss':['log']} #Edit here
model = SGDClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train, y_train)
print(clf.best_params_)

This time, I wanted to compare the accuracy for each pattern shown in the table at the beginning, so I was careful to ** use GridSearch including the default value **. This is because if the default value is not included, the accuracy comparison will not be possible.

The penalty is l1 or l2, and alpha contains the default value of 0.0001, which is fairly wide. As before, if you do not specify loss in log, it will not be logistic regression in the first place, so specify it in log.

Then, {'alpha': 0.0001,'loss':'log','penalty':'l1'} was displayed, and he searched for the best parameter.

Note that the alpha of this best parameter is actually the exact same value as the default value when you look at sklearn's site **. In addition, in the L1 regularization of pattern 3, loss is set to log and penalty is set to l1, so in theory, even if hyperparameter search is performed, ** the same parameters as L1 regularization of pattern 3 are used. **.

In other words, let's continue, thinking that the accuracy should be 0.66181, which is the same as the L1 regularization of pattern 3.

Now let's build the model again using this best parameter. The following means using "** clf.best_paramas_" to train the SGD Classifier with the best parameters I just mentioned.

clf_2 = SGDClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train,y_train)

Finally, check the accuracy with the test data.

clf_2.score(X_test, y_test)

It became ** 0.66181 **, and it became the same system as pattern 3 as hypothesized.

(V) Pattern 5 ~ Regularization + Standardization + Hyperparameter tuning + Feature selection ~

Up to pattern 4, I made a model with the features I chose arbitrarily, but here I will select the features using the method called the built-in method.

First, standardize and divide the data.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, we will select the features using the built-in method.

estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

The selected features will be overwritten and updated in the training data and test data.

X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

Check the accuracy of the SGDClassifier with the overwritten training data (at this point, we have not tuned the hyperparameters yet).

classifier = SGDClassifier(random_state=1234)
classifier.fit(X_train_selected, y_train)

From here, perform hyperparameter tuning. The method is basically the same as before, but the training data (X_train_selected) that overwrites the contents of .fit is used.

parameters = {'penalty':['l1', 'l2'], 'alpha':[0.0001,0.001, 0.01, 0.1, 1, 10, 100],'loss':['log']} 
model = SGDClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_)

Train the SGD Classifier again with the best parameters you have given.

clf_2 = SGDClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train_selected,y_train)

Finally, check the accuracy.

clf_2.score(X_test_selected, y_test)

With ** 0.66185 **, we were able to achieve the best accuracy ever.

This is the end of logistic regression. Let's summarize the accuracy once.

model pattern accuracy
pattern 1 Logistic regression A 0.52958
Pattern 2 Logistic regression B 0.59815
Pattern 3 Logistic regression B+C 0.66181
Pattern 4 Logistic regression B+C+D 0.66181
Pattern 5 Logistic regression B+C+D+E 0.66185
Pattern 6 SVM A
Pattern 7 SVM C
Pattern 8 SVM C+D
Pattern 9 SVM C+D+E
Pattern 10 Decision tree A
Pattern 11 Decision tree D
Pattern 12 Decision tree D+E
Pattern 13 Random forest A
Pattern 14 Random forest D
Pattern 15 Random forest D+E
Pattern 16 AdaBoost A
Pattern 17 AdaBoost D
Pattern 18 AdaBoost D+E

Now let's move on to SVM.

(3) Patterns 6-9 [SVM]

I also tried SVM and realized it was a mess, but it takes a lot of time to process. Therefore, if we were to build a model and tune hyperparameters using all training data like logistic regression, we wouldn't have enough time, so we paid attention to making the data smaller.

* Supplement-Estimated how long it will take to process data- *

I trained all the data from the first shot and experienced many times that it did not end even if it took hours, so I first tried with a very small number of data (and parameters), and the time it took Make a note of it. So, I think it's better to start the process after setting a guideline for how long it will take because this amount is about many times as much.

(I) Pattern 6 ~ Default ~

I'm going to implement SVM without any adjustments. First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

The major changes I have made in processing SVMs with other models are as follows, but it takes a lot of time to train SVMs with all training data like logistic regression.

Therefore, let's treat the training data once divided into 70% of the total, and then treat 1.5% (= 10.5% of all data) as training data.

Even with 1.5%, it took more than 3 hours to complete the process. ..

#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model building
clf = SVC(random_state=1234)
clf.fit(X_train_sample, y_train_sample) 

Now that we have a model, let's check the accuracy with test data. It is important to note that the accuracy of the test data is not reduced from 30% of the total, but the accuracy is confirmed using all the test data. This is because I thought that if the number of test data was reduced only for SVM, it would not be possible to compare the accuracy with other models.

clf.score(X_test,y_test)

The accuracy was ** 0.61935 **.

(Ii) Pattern 7-Implementation of standardization only-

Since SVM does not require regularization itself, it starts with the implementation of standardization.

First, let's standardize and divide the data. After that, it is the same as pattern 6.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model building
clf = SVC(random_state=1234)
clf.fit(X_train_sample, y_train_sample)

#Accuracy evaluation
clf.score(X_test, y_test)

The accuracy is ** 0.64871 **, which is better than pattern 6.

(Iii) Pattern 8 ~ Standardization + Hyperparameter tuning ~

Next, let's implement standardization + hyperparameter tuning.

First is standardization and data partitioning.

stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, perform hyperparameter tuning. Again, it takes a lot of time to search using all the training data, so 3% (= 2.1% of the total) of the 70% of the training data is used for hyperparameter tuning.

Originally, I would like to use more data to adjust the parameters, but since the processing is not really finished, I chose this value.

#Training data 3%Sampling
X_train_grid = pd.DataFrame(X_train).sample(frac = 0.03,random_state=1234)
y_train_grid = pd.DataFrame(y_train).sample(frac = 0.03,random_state=1234)

Then, perform hyperparameter tuning.

parameters = {'kernel':['linear', 'rbf'], 'C':[0.001, 0.01,0.1,1,10]} #Edit here
model = SVC(random_state=1234)
clf = GridSearchCV(model, parameters, cv=2,return_train_score=False)
clf.fit(X_train_grid, y_train_grid)
print(clf.best_params_, clf.best_score_)

The point that is different from logistic regression is that the argument cv of GridSearch is set to 2 + return_train_score is set to False. Originally, cv was set to 3 and return_train_score was not set in particular, but since the process does not end forever, I checked it on the site and set it.

At this point, we have searched for the "optimal parameters"! We will train using the training data with this optimum parameter and verify the accuracy with the test data.

#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model training
clf = SVC(**clf.best_params_,random_state=1234)
clf.fit(X_train_sample, y_train_sample) 

#Check accuracy with test data
clf.score(X_test, y_test)

The accuracy was ** 0.65393 **.

(Iv) Pattern 9-Standardization + Hyperparameter tuning + Feature selection-

Finally, add feature selection.

Again, the training data is further divided into pattern 8 for feature selection, so be careful not to confuse which training data you are using for what.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

This is the feature selection. Next, hyperparameter tuning is performed based on the selected features.

#3 of training data overwritten with selected features for hyperparameter tuning%Sampling
X_train_grid = pd.DataFrame(X_train_selected).sample(frac = 0.03,random_state=1234)
y_train_grid = pd.DataFrame(y_train).sample(frac = 0.03,random_state=1234)

#Hyperparameter tuning
parameters = {'kernel':['linear', 'rbf'], 'C':[0.001, 0.01,0.1,1,10]} #Edit here
model = SVC(random_state=1234)
clf = GridSearchCV(model, parameters, cv=2,return_train_score=False)
clf.fit(X_train_grid, y_train_grid)
print(clf.best_params_, clf.best_score_)

At this point, hyperparameter tuning using the features selected by the built-in method has been completed, and the best parameters have been determined.

Let's train the SVM model with the training data (X_train_selected) overwritten with this selected feature and the best parameters with 1.5% of the training data of 70% of the total.

#Sampling 30% from the overwritten training data
X_train_sample = pd.DataFrame(X_train_selected).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model construction with an additional 30% of sample data from 70% of the training data
clf = SVC(**clf.best_params_,random_state=1234)
clf.fit(X_train_sample, y_train_sample) 

#Accuracy evaluation with test data
clf.score(X_test_selected, y_test)

The accuracy is ** 0.65066 **.

Here, let's summarize the accuracy again.

model pattern accuracy
pattern 1 Logistic regression A 0.52958
Pattern 2 Logistic regression B 0.59815
Pattern 3 Logistic regression B+C 0.66181
Pattern 4 Logistic regression B+C+D 0.66181
Pattern 5 Logistic regression B+C+D+E 0.66185
Pattern 6 SVM A 0.61935
Pattern 7 SVM C 0.64871
Pattern 8 SVM C+D 0.65393
Pattern 9 SVM C+D+E 0.65066
Pattern 10 Decision tree A
Pattern 11 Decision tree D
Pattern 12 Decision tree D+E
Pattern 13 Random forest A
Pattern 14 Random forest D
Pattern 15 Random forest D+E
Pattern 16 AdaBoost A
Pattern 17 AdaBoost D
Pattern 18 AdaBoost D+E

(4) Patterns 10-12 [Decision tree]

Next is the decision tree.

(I) Pattern 10 ~ Default ~

Divide the data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

We build a model of the decision tree and verify the accuracy.

clf = DecisionTreeClassifier(random_state=1234)
clf = clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The accuracy is now ** 0.63727 **.

(Ii) Pattern 11 ~ Hyperparameter tuning ~

The decision tree does not require regularization or standardization, so we start with hyperparameter tuning.

First is data division.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, perform gridsearch for hyperparameter tuning. After that, we will build the model with the best parameters and verify the accuracy of the test data.

#GridSearch
parameters = {'criterion':['gini', 'entropy'], 'max_depth':[i for i in range(1, 11)],'max_features':['auto','sqrt','log2'], 'min_samples_leaf':[i for i in range(1, 11)],'random_state':[1234]} #Edit here
model = DecisionTreeClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)

#Build a model with the best parameters
clf = DecisionTreeClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)

#Accuracy verification
clf.score(X_test,y_test)

The accuracy was ** 0.66376 **.

(Iii) Pattern 12 ~ Hyperparameter tuning + Feature selection ~

As always, start with data splitting.

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

#Hyperparameter tuning
parameters = {'criterion':['gini', 'entropy'], 'max_depth':[i for i in range(1, 11)],'max_features':['auto','sqrt','log2'], 'min_samples_leaf':[i for i in range(1, 11)],'random_state':[1234]}
model = DecisionTreeClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf_2 = DecisionTreeClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train_selected,y_train)

#Check accuracy with test data
clf_2.score(X_test_selected, y_test)

The accuracy is ** 0.65732 **.

model pattern accuracy
pattern 1 Logistic regression A 0.52958
Pattern 2 Logistic regression B 0.59815
Pattern 3 Logistic regression B+C 0.66181
Pattern 4 Logistic regression B+C+D 0.66181
Pattern 5 Logistic regression B+C+D+E 0.66185
Pattern 6 SVM A 0.61935
Pattern 7 SVM C 0.64871
Pattern 8 SVM C+D 0.65393
Pattern 9 SVM C+D+E 0.65066
Pattern 10 Decision tree A 0.63727
Pattern 11 Decision tree D 0.66376
Pattern 12 Decision tree D+E 0.65732
Pattern 13 Random forest A
Pattern 14 Random forest D
Pattern 15 Random forest D+E
Pattern 16 AdaBoost A
Pattern 17 AdaBoost D
Pattern 18 AdaBoost D+E

(5) Patterns 13 to 15 [Random Forest]

(I) Pattern 13 ~ Default ~

Divide the data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Random forest model construction is performed and accuracy verification is performed.

clf = RandomForestClassifier(random_state=1234)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The accuracy is now ** 0.64522 **.

(Ii) Pattern 14 ~ Hyperparameter tuning ~

Like decision trees, random forests do not require regularization and standardization.

As before, we will perform hyperparameter tuning after data partitioning. The difference from the previous model is that the search range is narrowed a little (each index is searched in the range of 1 to 5). This alone took about 35 minutes, so I felt that it would be difficult to partition the processing in consideration of my own time if I expanded the range further, so I narrowed the range.

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Hyperparameter tuning
parameters = {'max_depth':[2,4,6,None],  'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
model = RandomForestClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = RandomForestClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)

#Check accuracy with test data
clf.score(X_test, y_test)

Since it took a long time to narrow down the numerical values in the hyperparameter search, we are switching to the search of only even numbers or only odd numbers on the assumption that the default values are included. The accuracy was ** 0.67762 **.

(Iii) Pattern 15 ~ Hyperparameter tuning + Feature selection ~

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

#Hyperparameter tuning
parameters = {'max_depth':[2,4,6,None],  'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
model = RandomForestClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = RandomForestClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train_selected, y_train)

#Check accuracy with test data
clf.score(X_test_selected, y_test)

The accuracy is now ** 0.66308 **.

Let's check the accuracy again.

model pattern accuracy
pattern 1 Logistic regression A 0.52958
Pattern 2 Logistic regression B 0.59815
Pattern 3 Logistic regression B+C 0.66181
Pattern 4 Logistic regression B+C+D 0.66181
Pattern 5 Logistic regression B+C+D+E 0.66185
Pattern 6 SVM A 0.61935
Pattern 7 SVM C 0.64871
Pattern 8 SVM C+D 0.65393
Pattern 9 SVM C+D+E 0.65066
Pattern 10 Decision tree A 0.63727
Pattern 11 Decision tree D 0.66376
Pattern 12 Decision tree D+E 0.65732
Pattern 13 Random forest A 0.64522
Pattern 14 Random forest D 0.67762
Pattern 15 Random forest D+E 0.66308
Pattern 16 AdaBoost A
Pattern 17 AdaBoost D
Pattern 18 AdaBoost D+E

(6) Patterns 16-18 [AdaBoost]

(I) Pattern 16 ~ Default ~

Divide the data.

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

We build a model of AdaBoost and verify the accuracy.

clf = AdaBoostClassifier(DecisionTreeClassifier(random_state=1234))
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The accuracy is now ** 0.63947 **.

(Ii) Pattern 17 ~ Hyperparameter tuning ~

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Hyperparameter tuning
parameters = {'learning_rate':[0.1,0.5,1.0]}
model = AdaBoostClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = AdaBoostClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)

#Check accuracy with test data
clf.score(X_test, y_test)

The accuracy was ** 0.67426 **.

(Iii) Pattern 18 ~ Hyperparameter tuning + Feature selection ~

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

#Hyperparameter tuning
parameters = {'learning_rate':[0.1,0.5,1.0]}
model = AdaBoostClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = AdaBoostClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train_selected, y_train)

#Check accuracy with test data
clf.score(X_test_selected, y_test)

The accuracy was ** 0.659367 **.

model pattern accuracy
pattern 1 Logistic regression A 0.52958
Pattern 2 Logistic regression B 0.59815
Pattern 3 Logistic regression B+C 0.66181
Pattern 4 Logistic regression B+C+D 0.66181
Pattern 5 Logistic regression B+C+D+E 0.66185
Pattern 6 SVM A 0.61935
Pattern 7 SVM C 0.64871
Pattern 8 SVM C+D 0.65393
Pattern 9 SVM C+D+E 0.65066
Pattern 10 Decision tree A 0.63727
Pattern 11 Decision tree D 0.66376
Pattern 12 Decision tree D+E 0.65732
Pattern 13 Random forest A 0.64522
Pattern 14 Random forest D 0.67762
Pattern 15 Random forest D+E 0.66308
Pattern 16 AdaBoost A 0.63947
Pattern 17 AdaBoost D 0.67426
Pattern 18 AdaBoost D+E 0.659367

4. Conclusion

What did you think.

Surprisingly, I think that there are few sites that introduce super-basic model building methods, and I always think, "I don't want to know such advanced things, I just want to make a model once!" I did.

This article focuses on my own problems, so I hope it helps to deepen my understanding.

Recommended Posts

I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to compress the image using machine learning
[Machine learning] I tried to summarize the theory of Adaboost
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to predict the presence or absence of snow by machine learning.
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried using Tensorboard, a visualization tool for machine learning
[TF] I tried to visualize the learning result using Tensorboard
Matching app I tried to take statistics of strong people & tried to create a machine learning model
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API
I tried to get a database of horse racing using Pandas
I tried to get the index of the list using the enumerate function
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
I tried to get a list of AMI Names using Boto3
[Kaggle] I made a collection of questions using the Titanic tutorial
[Pokemon Sword Shield] I tried to visualize the judgment basis of deep learning using the three family classification as an example.
[Kaggle] I tried ensemble learning using LightGBM
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to get the batting results of Hachinai using image processing
I tried calling the prediction API of the machine learning model from WordPress
I tried how to improve the accuracy of my own Neural Network
I tried to classify guitar chords in real time using machine learning
I tried using the trained model VGG16 of the deep learning library Keras
I tried to extract and illustrate the stage of the story using COTOHA
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
A beginner of machine learning tried to predict Arima Kinen with python
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to perform a cluster analysis of customers using purchasing data
I tried to display the altitude value of DTM in a graph
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to verify the result of A / B test by chi-square test
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
[Fabric] I was addicted to using boolean as an argument, so make a note of the countermeasures.
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
I tried to understand the learning function in the neural network carefully without using the machine learning library (second half).
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried using the image filter of OpenCV
I tried to make a ○ ✕ game using TensorFlow
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] I tried to judge the member image of the idol group using Keras
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
[Azure] I tried to create a Linux virtual machine in Azure of Microsoft Learn
I tried to predict the change in snowfall for 2 years by machine learning
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried to process and transform the image and expand the data for machine learning
[Python] I tried to get the type name as a string from the type function
I made a script to record the active window using win32gui of Python
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique