Model, prediction, solution

Now you are ready to train your training data, build your model, and make predictions. There are over 60 machine learning algorithms that you can use to model.

In practice, you need to understand the type of problem and the requirements of the solution and choose the right algorithm. The chart shows what kind of algorithm should be selected according to the type of problem to be solved and the number of data items. It is open to the public. Let's make good use of this information.

scikit-learn argorithm cheat sheet

Prediction problems can be divided into two categories: classification and regression.

1, Classification

Classification can predict data by classifying it Regression allows you to predict numbers from your data. For example, on the left side of the figure below, a classification line is drawn from the data and it is divided into class A and class B.

Since the new data ☆ belongs to the class A side, it can be predicted to be class A. On the right side of the figure below, a regression line is drawn from the data to obtain the predicted value of the new data ☆. These mechanisms are used to identify whether an image is a dog or a cat, predict whether equipment will break down, predict sales, etc. It has been applied and is overflowing with us.

Below are some typical classification and regression algorithms.

Logistic regression
k-NN or k-Nearest Neighbors
Support vector machine
Naive Bayes classifier
Decision tree
Random forest
Perceptron
neural network

Here is an example of data preparation.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


#Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier


# 1. X_For train, train excluding Survived_Substitute df.
X_train = train_df.drop("Survived", axis=1)

# 2. Y_The train contains only Survived_Substitute df.
Y_train = train_df["Survived"]

# 3. X_For test, test excluding PassengerId_Substitute df.
X_test  = test_df.drop("PassengerId", axis=1).copy()

# 4. X_train、Y_train、X_Outputs the number of rows and columns of test.
print(X_train.shape, Y_train.shape, X_test.shape)

Logistic regression

Logistic regression uses a logistic function (sigmoid function) to perform binary classification. In other words, it is possible to classify whether the objective variable Survived is 0 or 1 and use it for prediction. Use explanatory variables such as Pclass and Age other than Survived to create the model. The sigmoid function generally has the shape shown in the figure below, and takes a value between 0 and 1.

Logistic regression with python sklearn. fit and predict, regularize with argument C

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train_lr, X_test_lr, Y_train_lr, Y_test_lr = train_test_split(X_train, Y_train, test_size=0.2)


# 1.Training data X_train_lr and Y_train_Create a classification model by logistic regression using lr.
logreg = LogisticRegression()
logreg.fit(X_train_lr, Y_train_lr)

# 2.Test data X_test_Apply the created model to lr and the accuracy of the model.
acc_log = round(logreg.score(X_test_lr, Y_test_lr) * 100, 2)

print(acc_log)

from sklearn.linear_model import LogisticRegression

# 1.Calculate the partial regression coefficient for each feature
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Partial regression coefficient"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by="Partial regression coefficient", ascending=False)

Support vector machine

Support Vector Machine (SVM) is an algorithm for classification and regression. It takes learning because the data is converted from non-linear to linear using a method called the kernel method. Processing time is short.

The larger the separation line width (margin), the less likely it is that overfitting will occur and the more generalized the model will be. On the other hand, the smaller the width, the more likely it is that overfitting will occur and the model will only fit specific data. If you use SVM, you can create a model by adjusting multiple parameters including width. The model has a relatively high classification accuracy.

# 1.Create a classification model using the SVC function and check the accuracy of the model.
svc = SVC()
svc.fit(X_train, Y_train)

Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)

print(acc_svc)

# 2.Create a classification model with the LinearSVC function and check the accuracy of the model
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

print(acc_linear_svc)

k-Nearest Neighbors (k-nearest neighbor method)

The k-Nearest Neighbors (or k-NN for short) method is a classification algorithm.

New data is categorized by a majority vote of already classified data. For example, in the figure below, if there are 3 data (k = 3) that take a majority vote, the new data will be classified in red. Also, if there are 5 data (k = 5) that take a majority vote, the new data will be classified in blue.

# 1. k-Create a classification model by the NN method and check the accuracy of the model.
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)

Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

print(acc_knn)

Naive bayes classifier

Naive Bayes is a classification algorithm. It is often used mainly for document classification, and its application to spam filters is well known. Naive Bayes uses conditional probabilities, just as "naive" means "simple" in Japanese. It can be implemented concisely. Now consider categorizing articles into categories A and B.

Suppose you have a labeled word document matrix. From this word document matrix Create a word occurrence distribution for each of Category A and Category B. You can see that the shape of the distribution is different for each category and the characteristics are different.

Create a distribution for unlabeled articles in the same way, depending on how close it is to the distribution (model) created earlier. You can assign categories.

# 1.Create a classification model by Naive Bayes and check the accuracy of the model.
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)

Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

print(acc_gaussian)

perceptron

Perceptron is a classification algorithm. It was created by imitating the function of neurons (nerve cells) in the human brain. The input data is weighted and added, and the activation function is applied to convert and output.

As an activation function

Often used by step functions.

The step function converts the output value to 1 if the input value is 0 or more, and converts the output value to 0 if the input value is less than 0. When the node value is ●, the output value is ▲, that is, 1 if converted by the step function.

You can create a neural network by combining multiple perceptrons.

# 1.Create a classification model by Perceptron and check the accuracy of the model.
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)

print(acc_perceptron)

Stochastic gradient descent

In the process of creating a model with Perceptron The model is trained so that the error between the output and the objective variable (correct answer) is small. At this time, the weights are updated using the gradient descent method so that the error is small.

The bottom of the error curve is the optimum value (optimal solution) of the weight that minimizes the error. Starting from the initial value of the weight ×, increasing the weight value and updating it will give the optimum weight value ●.

The amount of weight update can be obtained by calculating the slope surrounded by the square in the figure. Draw a line tangent to the error curve (tangent) and calculate the slope (derivative coefficient) based on the amount of change in error and weight.

The weight is updated either when the error is below a certain value or the specified number of times. In the gradient descent method, the optimum value is searched for while descending toward the bottom of the curve. In particular, when the training data is divided into several parts and calculated multiple times, it is called the stochastic gradient descent method.

# 1.Create a classification model by stochastic gradient descent and check the accuracy of the model.
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)

Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

print(acc_sgd)

Decision tree

Decision trees are algorithms that perform classification and regression. With a decision tree, the rules extracted from the data are represented in a tree structure. It is known as an easy-to-use algorithm because it is intuitive and easy to explain to people.

The objective variable in the figure is label, and the explanatory variables are a1 to a4. The decision tree is made up of nodes.

The top node is called the "tree root node" and the end node is called the "leaf node". Branching of decision trees is a condition of the rule. For example, if the value of node a3 exceeds 2.45 and the value of node a4 exceeds 1.75, Ayame flowers are classified as "Iris-virginica".

In addition, the variables that affect the classification are arranged in order from the root node to the leaf node. The more monochromatic the leaf node is, the higher the purity. Attempting to increase purity complicates the rules and improves the accuracy of the model, but can lead to overfitting.

# 1.Create a classification model based on a decision tree and check the accuracy of the model.
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

print(acc_decision_tree)

Random forest

Random forest is an algorithm for classification and regression. One of the ensemble learning methods (classifiers composed of multiple classifiers) Trees are gathered together and called a forest because they build a large number of decision trees (n_estimators = 100).

# 1.Create a classification model by random forest and check the accuracy of the model.
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

print(acc_random_forest)

Model evaluation

Rank the accuracy of all models and choose the best model for problem solving.

#List the accuracy of each model
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest',  
              'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest,  
              acc_linear_svc, acc_decision_tree]})

models.sort_values(by='Score', ascending=False)

#Save the random forest model to a CSV file.
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })

submission.to_csv('./8010_titanic_data/submission.csv', index=False)

Python: Ship Survival Prediction Part 3