What is supervised learning (classification)?

What is "classification"?

Machine learning is divided into three main areas Among them, it corresponds to 1-1.

1, supervised learning Learn the model from the training data with the correct label Make predictions for unknown data. Supervised learning is divided into the following two categories.

1-1, Classification problem It learns the data divided by category and predicts the category (discrete value) of unknown data. This content will help you understand the algorithm for this classification problem and implement a simple problem. Practical application examples include email spam judgment.

1-2, Regression problem (supervised learning (regression)) Unlike the classification problem, this predicts continuous values. Stock price forecasts are categorized here.

2, unsupervised learning (unsupervised learning) For data without a correct label or data with an unknown structure It means that the machine finds the structure and relationship of data. Examples include retail customer trends and clustering.

3, Reinforcement learning The purpose is to improve performance based on interaction with the environment. Set rewards for actions and train them to take actions that benefit their goals depending on their condition. An example is a competitive AI such as Go.

Binary and multinomial classifications

Classification problems can be broadly divided into binary and multinomial classification problems.

Binary classification (also called binary classification or two-class classification)

The categories (called classes) to classify are two classification problems. It can be identified only by "belonging / not belonging" to either group. If a straight line can distinguish between classes, it is called a linear classification, otherwise it is called a non-linear classification.

Multinomial classification (also called multiclass classification)

It is a classification problem with three or more classes. This cannot be identified simply by "belonging / not belonging" to any one group. In many cases, it cannot be identified simply by a straight line.

Classification flow

Machine learning has a series of flows as shown below. In the "2. Algorithm selection" section, you will select various "classification algorithms".

In the "supervised learning (classification)" model, the optimum classification algorithm is selected and the model is created according to the purpose. Tuning is required for maximum performance.

Data preprocessing Data shaping and manipulation
Algorithm selection Select an algorithm and create a model
Model learning Selection of hyperparameters to tune Parameter tuning
Model prediction (inference) Model accuracy verification using unknown data Incorporate into WEB services and put AI model into practice

How to prepare data (1)

When you actually move the code and learn about various classification methods, you need to prepare data that can be classified. At the practical level, it is necessary to obtain some actual measured value and shape it. This time, I will omit that part and create fictitious classification data for practice by myself. I will introduce how to get sample data.

To create fictitious data suitable for classification

scikit-learn.of the datasets module
make_classification()Is used.

Supervised learning classification requires data and a label that indicates which class the data belongs to. If you use make_classification (), you can set any number of data and label type as arguments.

#Module import
from sklearn.datasets import make_classification
#Data X,Generation of label y
X, y = make_classification(n_samples=XX, n_classes=XX, n_features=XX, n_redundant=XX, random_state=XX

Each argument of the above function is as follows.

n_samples
#Number of data to prepare
n_classes
#Number of classes. If not specified, the value will be 2.
n_features
#Number of data features
n_redundant
#Number of features (extra features) unnecessary for classification
random_state
#Random number seed (element that determines the pattern of random numbers)

There are other arguments, but in this content we will create the data that defines them. In addition, a "label (y)" is prepared to indicate which class the data belongs to, but basically the label is prepared by an integer value. For example, in the case of binary classification, the label of each data will be "0" or "1".

How to prepare data (2)

The scikit-learn library is mainly used to implement classification algorithms.

However, the scikit-learn library includes not only these, but also data preprocessing and model adjustment. Functions for evaluation are also provided.

There are also several datasets available for algorithm experimentation and testing. Data can be called by specifying the module. Here, we will introduce one of them, the method of acquiring Iris data.

What is Iris data? Four characteristic quantities (unit: cm) of "septal length", "septal width", "petal length", and "petal width" of 150 iris (a type of flower) sample. And 3 kinds of varieties (0 ~ 2) are stored.

Here, we will use only two features, "the length of the corolla" and "the length of the petals", to visualize the data.

# scikit-learn library import of detaset module
from sklearn import datasets
import numpy as np

#Get data
iris = datasets.load_iris()
#Stores 0th and 2nd columns of iris
X = iris.data[:, [0, 2]]
#Stores iris class labels
y = iris.target

Also, to evaluate the performance of the trained model with unknown data Divide the dataset into training and test data.

Using train_test_split () in scikit-learn's model_selection module as follows: The X and y arrays are randomly split into 30% test data and 70% training data.

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(
    X, y, test_size=0.3, random_state=42)

Learning and prediction

In machine learning, there are multiple algorithms. An outline of a series of steps from learning from teacher data according to an algorithm to predicting a label. It is called a model.

It's hard to implement all the machine learning models yourself, There are many libraries in Python that specialize in machine learning. Among them, scikit-learn is a library in which machine learning models are prepared in advance.

Now, let's first look at how to use the fictitious model Classifier as an example.

#Module import
#Refer to different modules for each model(The following example)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

#Model building
model = Classifier()
#Model learning
model.fit(train_X, train_y)
#Forecasting data by model
model.predict(test_X)

#Model accuracy rate
#The correct answer rate is(The number of data for which the model's predicted classification matches the actual classification) ÷ (Total number of data)Is calculated by
model.score(test_X, test_y)

When writing actual machine learning code

Classifier()The part of will be replaced with the actual model.

By using scikit-learn, it is attractive to be able to practice machine learning fairly simply as described above.

Main method

Logistic regression

Logistic regression is a method of classifying data by finding boundaries of linearly separable data by learning.

The graph above seems to be able to make a straight line to identify the colors.

Data that can be divided into groups of data categories by straight lines like this

It is called linearly separable data.

The feature is that the boundary line becomes a straight line. Therefore, it is used for data with few classes such as binary classification. You can also calculate the probability that your data will be classified into classes. It is mainly used when you want to know the probability of classification such as "precipitation probability of weather forecast" from these characteristics.

The disadvantage is that the training data must be linearly separable before it can be classified. It is also not suitable for high-dimensional sparse data.

Also, the boundaries learned from the training data will pass right next to the data at the edge of the class. Another drawback is that it does not easily become a generalized boundary (low generalization ability).

Logistic regression model

scikit-learn library linear_In the model submodule
LogisticRegression()It is defined as.

When training with a logistic regression model, write code similar to the following to call the model.

#Call the model from the package
from sklearn.linear_model import LogisticRegression

#Build a model
model = LogisticRegression()

#Train the model
# train_data_detail is a collection of information used to predict the category of data
# train_data_label is the label of the class to which the data belongs
model.fit(train_data_detail, train_data_label)

#Let the model predict
model.predict(data_detail)

#Correct answer rate of model prediction result
model.score(data_detail, data_true_label)

In the visualization work, use the trained model How to divide the data by making predictions for all the fine plot points in the graph It can be shown by color.

Use the matplotlib library to visualize the graph. Visualize and compare other learning models in a similar way.


#Plot all data on a scatter plot and separate colors for each label
plt.scatter(X[:, 0], X[:, 1], c=y, marker=".",
            cmap=matplotlib.cm.get_cmap(name="cool"), alpha=1.0)
#Determine the range of the graph
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
#Graph 0.Stores the coordinates of the intersection when separated by 02
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.02),
                       np.arange(x2_min, x2_max, 0.02))
#All xx1,Predict with a learning model for xx2 pairs
Z = model.predict(np.array([xx1.ravel(), xx2.ravel()]).T).reshape((xx1.shape))
#Coordinate(xx1, xx2)Draw Z on
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=matplotlib.cm.get_cmap(name="Wistia"))
#Specify range, label, title, grid
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
plt.title("classification data using LogisticRegression")
plt.xlabel("Sepal length")
plt.ylabel("Petal length")
plt.grid(True)
plt.show()
np.The meshgrid method is x, y, ...Create a coordinate matrix of n.

import numpy as np

#X in the figure below
x = np.array([1, 2, 3])

#Y in the figure below
y = np.array([4, 5])

x1, y1 = np.meshgrid(x,y)

print(x1)
print()
print(y1)
#output
[[1 2 3]
 [1 2 3]]

[[4 4 4]
 [5 5 5]]

Linear SVM

SVM (Support Vector Machine) finds data boundaries like logistic regression This is a method for classifying data. Its greatest feature is a vector called a support vector.

A support vector is a group of data that is close to other classes. Based on the support vector, draw a boundary line at the position where the distance is the largest. Boundaries are drawn to maximize the distance from one class to another (maximize margin)

Compared to logistic regression, SVMs have a classification boundary drawn at the farthest point between the two classes. It tends to be generalized and the classification prediction of data tends to improve. Another feature is that it is easy to make a path because it is only necessary to consider the support vector to determine the boundary line.

As a drawback (1) As the amount of data increases, the amount of calculation increases, so learning and prediction tend to be slower than other methods. (2) As with logistic regression, unless the input data is linearly separable (a state in which a straight boundary surface can be drawn), classification cannot be performed correctly.

An SVM that draws a line straight and classifies it is called a linear SVM.

Linear SVM, scikit-learn LinearSVC()It can be implemented with.

from sklearn.svm import LinearSVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

#Data generation
X, y = make_classification(n_samples=100, n_features=2,
                           n_redundant=0, random_state=42)

#Divide the data into teacher data and the data you want to predict
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

#Model building
model = LinearSVC()
#Model learning
model.fit(train_X, train_y)

#Output correct answer rate
model.score(test_X, test_y)

The accuracy rate is for test_X and test_y. Since the correct answer rate for train_X and train_y is not calculated Even if the output accuracy rate is 100%, some graphs may be misclassified.

Boundary visualization can be performed with the following code, similar to the logistic regression method.

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

#Below is the visualization work
plt.scatter(X[:, 0], X[:, 1], c=y, marker=".",
            cmap=matplotlib.cm.get_cmap(name="cool"), alpha=1.0)

x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.02),
                       np.arange(x2_min, x2_max, 0.02))
Z = model.predict(np.array([xx1.ravel(), xx2.ravel()]).T).reshape((xx1.shape))
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=matplotlib.cm.get_cmap(name="Wistia"))
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
plt.title("classification data using LinearSVC")
plt.grid(True)
plt.show()

Non-linear SVM

Linear SVM is an excellent model that is easy to make sense and has high generality, but it has the disadvantage that it cannot be used unless the input data is linearly separated.

Non-linear SVM is a model developed to eliminate the shortcomings of SVM.

As shown in the above figure, by manipulating data by performing mathematical processing according to a conversion formula called a kernel function. The input data may be linearly separable.

Non-linear SVM is a model that performs such processing and uses SVM.

The operation by the kernel function is complicated, but you don't have to calculate the operation. It is also called a kernel trick because it is possible to classify if the inner product after manipulating the data is obtained.

scikit-SVC in learn svm submodule()Is used.

import matplotlib
from sklearn.svm import SVC
from sklearn.datasets import make_gaussian_quantiles
import matplotlib.pyplot as plt

#Data generation
#Since this data is not linearly separable, prepare other data
data, label = make_gaussian_quantiles(n_samples=1000, n_classes=2, n_features=2, random_state=42)

#Model building
#Use SVC instead of LinearSVC to classify non-linearly separable data
model = SVC()
#Model learning
model.fit(data,label)

#Calculation of correct answer rate
model.score(data,label)

If this is also output in the same way, it will be as follows

This is the contrast between non-linear and linear.

Decision tree

The decision tree is different from the logistic regression and SVM introduced so far. Focus on each of the data elements (explanatory variables) It is a method that tries to determine the class to which the data belongs by dividing the data at a certain value in the element.

In the decision tree, you can see how much each explanatory variable affects the objective variable. It branches by repeating the division, but it can be considered that the variable that is divided first has a greater influence.

The disadvantage is that I am not good at linearly separable data. (For example, in 2D data, the boundaries cannot be drawn diagonally), which means that the learning is too close to the teacher data (not generalized).

scikit-DecisionTreeClassifier in the learn tree submodule()Is used.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, random_state=42)

#Divide into training data and test data
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

#Model building
model = DecisionTreeClassifier()

#Model learning
model.fit(train_X, train_y)

#Calculation of correct answer rate
model.score(test_X, test_y)

The result of visualizing the boundary in the same way is as follows.

Random forest

This is a method of making multiple simplified versions of the above-mentioned decision tree and deciding the classification result by majority vote. Train multiple simple classifiers in one classifier It is also a type of learning called ensemble learning.

While all the explanatory variables used in the decision tree were used, each decision tree in Random Forest It tries to determine the class to which the data belongs using only a small number of randomly determined explanatory variables. Then, the most common class among the classes output from multiple simple decision trees is output as a result.

Random forests, like decision trees, are less sensitive to outliers. It can also be used to classify datasets with complex discriminants that are not linearly separable.

The disadvantage is that, as with the decision tree, if the number of data is small relative to the number of explanatory variables, the binary tree cannot be divided and the prediction accuracy will decrease.

Use RandomForestClassifier () in the ensemble submodule of scikit-learn.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, random_state=42)

#Divide into training data and test data
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

#Model building
model = RandomForestClassifier()

#Model learning
model.fit(train_X, train_y)

#Calculation of correct answer rate
model.score(test_X, test_y)

Regarding the above, the result of visualizing the boundary is as follows. You can see that it has a boundary similar to that of a decision tree. This is because Random Forest performs ensemble learning of decision tree algorithms (to improve generalization performance by fusing multiple learners learned individually).

k-NN

k-NN, also known as the k-nearest neighbor method, finds some data similar to the predictive data. This is a method of determining the classification result by majority vote.

It is a type of learning called lazy learning, and is characterized by a learning cost (amount of calculation required for learning) of 0.

Different from the methods introduced so far k-NN does not learn from teacher data, but directly refers to teacher data at the time of prediction to predict labels. The method for predicting the result is as follows.

Rearrange the teacher data according to the similarity with the data used for prediction. 2, Refer to the k data set in the classifier in descending order of similarity. 3, The most common class among the classes to which the referenced teacher data belongs is output as the prediction result.

The features of k-NN are that the learning cost is 0 as mentioned above, that the algorithm is relatively simple but it is easy to obtain high prediction accuracy, and that it is easy to express boundaries of complicated shapes. Can be mentioned. As a drawback

1, If the number of natural numbers kk specified in the classifier is increased too much, the identification range will be averaged and the prediction accuracy will decrease. 2, Since the calculation is performed every time the prediction is made, the amount of calculation increases as the amount of teacher data and prediction data increases, resulting in a slow algorithm.

The image below shows the difference in the classification process due to the difference in the number of kk. When k = 3k = 3, the gray dots are predicted to be light blue because there are more light blue dots around, but when k = 7k = 7, there are more green dots, so they are green. It turns into a prediction that it is a point.

scikit-KNeighborsClassifier in learn submodule neighbors()Is used.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, random_state=42)

#Divide into training data and test data
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

#Model building
model = KNeighborsClassifier()
#Model learning
model.fit(train_X, train_y)

#Calculation of correct answer rate
model.score(test_X, test_y)

A visualization of the boundaries of this model is, for example:

Python: Supervised Learning (Classification)