[PYTHON] Supervised learning 1 Basics of supervised learning (classification)

Aidemy 2020/9/24

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of supervised learning. Nice to meet you.

What to learn this time ・ Overview of supervised learning ・ Types of supervised learning (classification)

About supervised learning (classification)

What is supervised learning (classification)?

・ As a premise, supervised learning is a "method of giving learning data and correct answer (teacher) data and thinking until the answer is correct", and the purpose is to predict unknown data through this. ・ Supervised learning can be divided into "classification problems" and "regression problems". This time we will look at the classification problem. -The classification problem is __ "learning the data divided by category and predicting the category (discrete value) of unknown data" __. For example, "recognition of handwritten characters from 0 to 9", "identification of what appears in an image", "prediction of author of a sentence", "identification of gender in a face photograph" and the like can be mentioned.

-Classification problems are divided into __ "binary classification" and "multinomial classification" __. -The binary classification is to classify by whether or not it belongs to one group, like gender recognition. In some cases, the classes can be identified by a straight line (linear classification). -Multinomial classification has many classes that can be classified, such as recognition of numbers.

Machine learning flow

・ Data preprocessing → Algorithm selection → Model learning → Model prediction

-In supervised learning (classification), select "classification algorithm" in the algorithm selection.

Create data

-To create data suitable for classification, import and use the make_classification () method. -__X, y = make_classification (n_samples = number of data, n_classes = number of classes (default value: 2), n_features = features, n_redundant = extra features, random_state = random seed) __

from sklearn.datasets import make_classification
#Create data with 50 data, 3 classes, 2 features, 1 extra features, and 0 seed
X,y=make_classification(n_samples=50,n_classes=3,n_features=2,n_redundant=1,random_state=0)

Get sample data

-You can call the sample data set provided in the scikit-learn library (sklearn).

#Call the Iris data, which is a sample of iris.

#Module import (dataset to get Iris data,Train to use the holdout method_test_Import split from sklearn)
from sklearn import datasets
from sklearn.model_serection import train_test_split
import numpy as np

#Get Iris data
iris=datasets.load_iris()

#Divide into training data and test data (holdout method):test rate 30%)
X=iris.data[:,[0,2]]  #Of the features of Iris, the 0th and 2nd columns ("Gaku length" and "Petal length")
                      #(= Learning data)
y=iris.target         #Iris class label (= teacher data with correct varieties written)

train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=0.3,random_state=0)

Model building

・ What you learn and predict is called a model. With scikit-learn, you can call a model prepared like Ruby on Rails and let it learn and make predictions. -Creating a model: __Model () __ ・ Learning: __model name.fit (train learning data, train teacher data) __ ・ Prediction: __model name.predict (data) __

#Import a model called LogisticRegression.
from sklearn.liner_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
#Data creation (50 data, 2 classes, 3 features) and classification into train and test
X,y = make_classification(n_samples=50,n_classes=2,n_features=3,n_redundant=0,random_state=0)
train_X,test_X,train_y,test_y = train_test_split(X,y,random_state=0)
#Modeling, learning, forecasting
model = LogisticRegression(random_state=0)
model.fit(train_X,train_y)
pred_y = model.predict(test_X)
print(pred_y) #[1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 1 1]
print(test_y) #[1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1]

Classification method (make boundaries between classes)

Six types to handle this time

-__ Logistic regression : Boundary line is straight line → linear classification only. Low generalization ability. - Linear SVM__: Boundary is straight-> linear classification only. High generalization ability. Learning / prediction is slow. -__ Non-linear SVM : Converts the non-linear classification to linear classification and processes it as a linear SVM. -Decision tree: Determines the class for each element of data. It is not easily affected by outliers. Linear classification only. Not generalized. - Random forest : Determine the class using a decision tree for random data. Non-linear classification is also possible. - K-NN__: Extracts teacher data similar to the prediction data and outputs the most common class as the prediction result. Learning cost is 0. High prediction accuracy. It slows down as the amount of data increases.

Logistic regression

-Since the boundary line is straight, only linear classification can be handled. The generalization ability is low because the boundary line is close to the data. -Model creation is done by __LogisticRegression () __, learning is done by fit (), and prediction is done by predict (). See the previous section for details. If you want to know the correct answer rate, use __model.score (pred_y, test_y) __.

-Here, the prediction result of the model is shown in a graph (scatter plot) and visualized by color coding. ・ (Review) Scatter plot creation: plt.scatter (x-axis data, y-axis data, c = [list], marker = "marker type", cmap = "color system") -__Np.meshgrid (x, y) __ that appears below is a function that converts coordinates (x, y) into a matrix and passes it.

#Import plt to make a graph and np to get coordinates
import matplotlib.pyplot as plt
import numpy as np
#plt.Create a scatter plot with scatter (the 0th column of the training data X is the x-axis and the 1st column is the y-axis)
plt.scatter(X[:,0],X[:,1],c=y,marker=".",cmap=matplotlib.cm.get_cmap(name="cool"),alpha=1.0)
#Specify the range of x-axis (x1) and y-axis (x2) to be specified next
x1_min,x1_max = X[:,0].min()-1, X[:,0].max()+1
x2_min,x2_max = X[:,1].min()-1, X[:,1].max()+1
#np.With meshgrid, the graph is 0.Store the x-coordinate of the intersection of x1 and x2 at the points separated by 02 in xx1 and the y-coordinate in xx2 (np).arrange(minimum value,Maximum value,interval))
xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),np.arange(x2_min,x2_max,0.02))
#Coordinates (xx1,Predict with model for the array of xx2) and plt.Draw the result with comtourf
Z=model.predict(np.array([xx1.ravel(),xx2.ravel()]).T).reshape((xx1.shape))
plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=matplotlib.cm.get_cmap(name="Wistia"))
#Set the graph range, title, label name, and grid and output
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
plt.title("classification data using LogisticRegression")
plt.xlabel("Sepal length")
plt.ylabel("Petal length")
plt.grid(True)
plt.show()

Linear SVM

-Boundary line is straight line → linear classification only. High generalization ability. Learning / prediction is slow. -SVM is a "support vector machine". Support vectors are data that are close in distance to other classes, and are easily generalized because the boundaries are drawn so that these distances are the longest.

-Linear SVM can be implemented with __LinearSVC () __. Other than that, it can be implemented in the same way as logistic regression.

Non-linear SVM

-Convert non-linear classification to linear classification and process it as a linear SVM. -Use "kernel function" to convert to linear SVM. -For non-linear SVM, use __SVC () __ imported by from sklearn.svm import SVC. Other than that, it is the same as logistic regression.

Decision tree

-Determine the class for each element of data. It is not easily affected by outliers. Linear classification only. Not generalized. -Use __DecisionTreeClassifier () __ imported by from sklearn.tree import DecisionTreeClassifier.

Random forest

-Create multiple random data decision trees and output the class with the largest number of classification results as a result. It is also one of ensemble learning. Non-linear classification is also possible. -Use __RandomForestClassifier () __ imported with from sklearn.ensemble import RandomForestClassifier.

k-NN -Extract k teacher data similar to the prediction data, and output the most common class as the prediction result. Learning cost is 0. High prediction accuracy. As the amount of data increases, the accuracy decreases and the speed decreases. -__ From sklearn.neighbors import Use KNeighborsClassifier () __ imported by KNeighborsClassifier.

Summary

-Supervised learning (classification) is to learn data and predict classification based on the data. -To create data suitable for classification, import and use the make_classification () method. -You can call and use the sample dataset provided in the scikit-learn library (sklearn) without creating data. (Ex) Iris data Iris) -Models for learning and predicting boundaries include logistic regression, linear SVM, non-linear SVM, decision tree, random forest, and k-NN, each of which has its own characteristics.

This time is over. Thank you for reading this far.

Recommended Posts

Supervised learning 1 Basics of supervised learning (classification)
Supervised learning (classification)
Supervised learning (regression) 1 Basics
Python: Supervised Learning (Classification)
Basics of Supervised Learning Part 1-Simple Regression- (Note)
Basics of Supervised Learning Part 3-Multiple Regression (Implementation)-(Notes)-
Basics of Machine Learning (Notes)
Supervised machine learning (classification / regression)
Python: Application of supervised learning (regression)
[Learning memo] Basics of class by python
Basics of Python ①
Basics of python ①
Machine learning algorithm (implementation of multi-class classification)
Unsupervised learning 1 Basics
Machine learning classification
Classification of guitar images by machine learning Part 1
Python & Machine Learning Study Memo ⑤: Classification of irises
Classification of guitar images by machine learning Part 2
Python: Unsupervised Learning: Basics
Machine Learning: Supervised --AdaBoost
Typical indicators of classification
Deep learning 1 Practice of deep learning
# 4 [python] Basics of functions
Basics of network programs?
Basics of Perceptron Foundation
Basics of regression analysis
Python: Supervised Learning (Regression)
Basics of python: Output
Basics of Python learning ~ What is a string literal? ~
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Python: Supervised Learning: Hyperparameters Part 1
Reinforcement learning 2 Installation of chainerrl
Deep running 2 Tuning of deep learning
Other applications of dictionary learning
Try to evaluate the performance of machine learning / classification model
Supervised Learning 3 Hyperparameters and Tuning (2)
Machine Learning: Supervised --Random Forest
Python: Supervised Learning: Hyperparameters Part 2
Supervised learning (regression) 2 Advanced edition
Importance of machine learning datasets
Machine Learning: Supervised --Support Vector Machine
XPath Basics (1) -Basic Concept of XPath
Supervised learning ~ Beginner's memo ~ (scikit-learn)
1st month of programming learning
Deep reinforcement learning 2 Implementation of reinforcement learning
Basics of Python × GIS (Part 1)
Supervised learning 2 Hyperparameters and tuning (1)
Machine Learning: Supervised --Decision Tree
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
[Gang of Four] Design pattern learning
Basics of Python x GIS (Part 3)
Supervised word-to-word relationship classification using Word2Vec
Paiza Python Primer 5: Basics of Dictionaries
Machine learning with python (1) Overall classification
Getting Started with Python Basics of Python
Machine learning ③ Summary of decision tree
Classification and regression in machine learning
[Must-see for beginners] Basics of Linux
Topic extraction of Japanese text 1 Basics
Review of the basics of Python (FizzBuzz)