Aidemy　2020/9/24

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of supervised learning. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ Overview of supervised learning ・ Types of supervised learning (classification)

About supervised learning (classification)

What is supervised learning (classification)?

・ As a premise, supervised learning is a "method of giving learning data and correct answer (teacher) data and thinking until the answer is correct", and the purpose is to predict unknown data through this. ・ Supervised learning can be divided into "classification problems" and "regression problems". This time we will look at the classification problem. -The classification problem is __ "learning the data divided by category and predicting the category (discrete value) of unknown data" __. For example, "recognition of handwritten characters from 0 to 9", "identification of what appears in an image", "prediction of author of a sentence", "identification of gender in a face photograph" and the like can be mentioned.

-Classification problems are divided into __ "binary classification" and "multinomial classification" __. -The binary classification is to classify by whether or not it belongs to one group, like gender recognition. In some cases, the classes can be identified by a straight line (linear classification). -Multinomial classification has many classes that can be classified, such as recognition of numbers.

Machine learning flow

・ Data preprocessing → Algorithm selection → Model learning → Model prediction

-In supervised learning (classification), select "classification algorithm" in the algorithm selection.

Create data

-To create data suitable for classification, import and use the make_classification () method. -__X, y = make_classification (n_samples = number of data, n_classes = number of classes (default value: 2), n_features = features, n_redundant = extra features, random_state = random seed) __

The variable X stores the data itself, and y stores the label of the class.
A feature quantity is the number of features that can be used to divide a class. In the example of gender recognition, when the characteristics of "hair length", "height", and "shoulder width" are divided, if the former two are used in the actual classification, n_features is 2 and n_redundant is 1.

from sklearn.datasets import make_classification
#Create data with 50 data, 3 classes, 2 features, 1 extra features, and 0 seed
X,y=make_classification(n_samples=50,n_classes=3,n_features=2,n_redundant=1,random_state=0)

Get sample data

-You can call the sample data set provided in the scikit-learn library (sklearn).

#Call the Iris data, which is a sample of iris.

#Module import (dataset to get Iris data,Train to use the holdout method_test_Import split from sklearn)
from sklearn import datasets
from sklearn.model_serection import train_test_split
import numpy as np

#Get Iris data
iris=datasets.load_iris()

#Divide into training data and test data (holdout method):test rate 30%)
X=iris.data[:,[0,2]]  #Of the features of Iris, the 0th and 2nd columns ("Gaku length" and "Petal length")
                      #(= Learning data)
y=iris.target         #Iris class label (= teacher data with correct varieties written)

train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=0.3,random_state=0)

Model building

・ What you learn and predict is called a model. With scikit-learn, you can call a model prepared like Ruby on Rails and let it learn and make predictions. -Creating a model: __Model () __ ・ Learning: __model name.fit (train learning data, train teacher data) __ ・ Prediction: __model name.predict (data) __

#Import a model called LogisticRegression.
from sklearn.liner_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
#Data creation (50 data, 2 classes, 3 features) and classification into train and test
X,y = make_classification(n_samples=50,n_classes=2,n_features=3,n_redundant=0,random_state=0)
train_X,test_X,train_y,test_y = train_test_split(X,y,random_state=0)
#Modeling, learning, forecasting
model = LogisticRegression(random_state=0)
model.fit(train_X,train_y)
pred_y = model.predict(test_X)
print(pred_y) #[1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 1 1]
print(test_y) #[1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1]

Classification method (make boundaries between classes)

Six types to handle this time

-__ Logistic regression : Boundary line is straight line → linear classification only. Low generalization ability. - Linear SVM__: Boundary is straight-> linear classification only. High generalization ability. Learning / prediction is slow. -__ Non-linear SVM : Converts the non-linear classification to linear classification and processes it as a linear SVM. -Decision tree: Determines the class for each element of data. It is not easily affected by outliers. Linear classification only. Not generalized. - Random forest : Determine the class using a decision tree for random data. Non-linear classification is also possible. - K-NN__: Extracts teacher data similar to the prediction data and outputs the most common class as the prediction result. Learning cost is 0. High prediction accuracy. It slows down as the amount of data increases.

Logistic regression

-Since the boundary line is straight, only linear classification can be handled. The generalization ability is low because the boundary line is close to the data. -Model creation is done by __LogisticRegression () __, learning is done by fit (), and prediction is done by predict (). See the previous section for details. If you want to know the correct answer rate, use __model.score (pred_y, test_y) __.

-Here, the prediction result of the model is shown in a graph (scatter plot) and visualized by color coding. ・ (Review) Scatter plot creation: plt.scatter (x-axis data, y-axis data, c = [list], marker = "marker type", cmap = "color system") -__Np.meshgrid (x, y) __ that appears below is a function that converts coordinates (x, y) into a matrix and passes it.

#Import plt to make a graph and np to get coordinates
import matplotlib.pyplot as plt
import numpy as np
#plt.Create a scatter plot with scatter (the 0th column of the training data X is the x-axis and the 1st column is the y-axis)
plt.scatter(X[:,0],X[:,1],c=y,marker=".",cmap=matplotlib.cm.get_cmap(name="cool"),alpha=1.0)
#Specify the range of x-axis (x1) and y-axis (x2) to be specified next
x1_min,x1_max = X[:,0].min()-1, X[:,0].max()+1
x2_min,x2_max = X[:,1].min()-1, X[:,1].max()+1
#np.With meshgrid, the graph is 0.Store the x-coordinate of the intersection of x1 and x2 at the points separated by 02 in xx1 and the y-coordinate in xx2 (np).arrange(minimum value,Maximum value,interval)）
xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),np.arange(x2_min,x2_max,0.02))
#Coordinates (xx1,Predict with model for the array of xx2) and plt.Draw the result with comtourf
Z=model.predict(np.array([xx1.ravel(),xx2.ravel()]).T).reshape((xx1.shape))
plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=matplotlib.cm.get_cmap(name="Wistia"))
#Set the graph range, title, label name, and grid and output
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
plt.title("classification data using LogisticRegression")
plt.xlabel("Sepal length")
plt.ylabel("Petal length")
plt.grid(True)
plt.show()

Linear SVM

-Boundary line is straight line → linear classification only. High generalization ability. Learning / prediction is slow. -SVM is a "support vector machine". Support vectors are data that are close in distance to other classes, and are easily generalized because the boundaries are drawn so that these distances are the longest.

-Linear SVM can be implemented with __LinearSVC () __. Other than that, it can be implemented in the same way as logistic regression.

Non-linear SVM

-Convert non-linear classification to linear classification and process it as a linear SVM. -Use "kernel function" to convert to linear SVM. -For non-linear SVM, use __SVC () __ imported by from sklearn.svm import SVC. Other than that, it is the same as logistic regression.

Decision tree

-Determine the class for each element of data. It is not easily affected by outliers. Linear classification only. Not generalized. -Use __DecisionTreeClassifier () __ imported by from sklearn.tree import DecisionTreeClassifier.

Random forest

-Create multiple random data decision trees and output the class with the largest number of classification results as a result. It is also one of ensemble learning. Non-linear classification is also possible. -Use __RandomForestClassifier () __ imported with from sklearn.ensemble import RandomForestClassifier.

k-NN -Extract k teacher data similar to the prediction data, and output the most common class as the prediction result. Learning cost is 0. High prediction accuracy. As the amount of data increases, the accuracy decreases and the speed decreases. -__ From sklearn.neighbors import Use KNeighborsClassifier () __ imported by KNeighborsClassifier.

Summary

-Supervised learning (classification) is to learn data and predict classification based on the data. -To create data suitable for classification, import and use the make_classification () method. -You can call and use the sample dataset provided in the scikit-learn library (sklearn) without creating data. (Ex) Iris data Iris) -Models for learning and predicting boundaries include logistic regression, linear SVM, non-linear SVM, decision tree, random forest, and k-NN, each of which has its own characteristics.

This time is over. Thank you for reading this far.

[PYTHON] Supervised learning 1 Basics of supervised learning (classification)