2. Multivariate analysis spelled out in Python 8-1. K-nearest neighbor method (scikit-learn)

2_8_1_01.PNG

** It can also be used for regression, but here we will do a classification case. ** **

⑴ Import library

import numpy as np
import pandas as pd

from sklearn import datasets
# sklearn.neighbors module k-NN method
from sklearn.neighbors import KNeighborsClassifier
#sklearn data split utility
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
#Method to generate color map
from matplotlib.colors import ListedColormap

#Japanese display module of matplotlib
!pip install japanize-matplotlib
import japanize_matplotlib

Prepare the data

Variable name meaning Note Data type
0 species type Setosa=0, Versicolour=1, Virginica=2 int64
1 sepal length Sepal length Continuous amount(cm) float64
2 sepal width Sepal width Continuous amount(cm) float64
3 petal length Petal length Continuous amount(cm) float64
4 petal width Petal width Continuous amount(cm) float64

⑵ Data acquisition

iris = datasets.load_iris()
#Explanatory variable (feature)
print("label:\n", iris.feature_names)
print("shape:\n", iris.data.shape)
print("First 10 lines:\n", iris.data[0:10, :]) 

#Objective variable (type)
print("label:\n", iris.target_names)
print("shape:\n", iris.target.shape)
print("Full display:\n", iris.target)

2_8_1_02.PNG

⑶ Data division

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, 
    iris.target,
    stratify = iris.target, #Stratified sampling
    random_state = 0)
print("shape:", y_train.shape)

#Get the number of unique elements
np.unique(y_train, return_counts=True)

2_8_1_03.PNG

Determine the number of k

⑷ Execute k-NN while changing the k parameter

#Variable to store the correct answer rate
training_accuracy = []
test_accuracy = []

#k while changing k-Execute NN and get the correct answer rate
for k in range(3,21):
    #Pass k to create an instance, fit the data and generate a model
    kNN = KNeighborsClassifier(n_neighbors = k)
    kNN.fit(X_train, y_train)
    #Obtain the correct answer rate with score and store it sequentially
    training_accuracy.append(kNN.score(X_train, y_train))
    test_accuracy.append(kNN.score(X_test, y_test))

#Convert correct answer rate to numpy array
training_accuracy = np.array(training_accuracy)
test_accuracy = np.array(test_accuracy)

⑸ Select the optimum k parameter

#Changes in the correct answer rate for training and testing
plt.figure(figsize=(6, 4))

plt.plot(range(3,21), training_accuracy, label='Training')
plt.plot(range(3,21), test_accuracy, label='test')

plt.xticks(np.arange(2, 21, 1)) #x-axis scale
plt.xlabel('k number')
plt.ylabel('Correct answer rate')
plt.title('Transition of correct answer rate')

plt.grid()
plt.legend()

#Transition of difference in correct answer rate
plt.figure(figsize=(6, 4))

difference = np.abs(training_accuracy - test_accuracy) #Calculate the difference
plt.plot(range(3,21), difference, label='Difference')

plt.xticks(np.arange(2, 21, 1)) #x-axis scale
plt.xlabel('k number')
plt.ylabel('Difference(train - test)')
plt.title('Transition of difference in correct answer rate')

plt.grid()
plt.legend()

plt.show()

2_8_1_04.PNG

Execute and visualize k-NN

⑹ Re-execute k-NN with the optimum k parameter

#Specify the number of k
k = 15

#Set explanatory variable X and objective variable y
X = iris.data[:, :2]
y = iris.target

#Create an instance, fit the data and generate a model
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X, y)

⑺ Plot on contour diagram (isoline diagram)

#Specify mesh spacing
h = 0.02

#Create a color map
cmap_surface = ListedColormap(['darkseagreen', 'mediumpurple', 'gold']) #For area charts
cmap_dot = ListedColormap(['darkgreen', 'darkslateblue', 'olive']) #For scatter plots

# x,Get the minimum and maximum values of the y-axis
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
#Generate grid columns at specified mesh intervals
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

#Predict by passing the grid sequence to the model
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape) #Shape conversion

2_8_1_06.PNG

plt.figure(figsize=(6,5))

#Isolate diagram
plt.pcolormesh(xx, yy, Z, cmap=cmap_surface)
#Scatter plot
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_dot, s=30)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xlabel('sepal length')
plt.ylabel('sepal width')

plt.show()

2_8_1_05.PNG

Afterword

Recommended Posts

2. Multivariate analysis spelled out in Python 8-1. K-nearest neighbor method (scikit-learn)
2. Multivariate analysis spelled out in Python 8-3. K-nearest neighbor method [cross-validation]
2. Multivariate analysis spelled out in Python 8-2. K-nearest neighbor method [Weighting method] [Regression model]
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 7-1. Decision tree (scikit-learn)
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 6-2. Ridge regression / Lasso regression (scikit-learn) [Ridge regression vs. Lasso regression]
2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
2. Multivariate analysis spelled out in Python 7-3. Decision tree [regression tree]
2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)
2. Multivariate analysis spelled out in Python 6-3. Ridge regression / Lasso regression (scikit-learn) [How regularization works]
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
[Python] [scikit-learn] k-nearest neighbor method introductory memo
2. Multivariate analysis spelled out in Python 7-2. Decision tree [difference in division criteria]
2. Multivariate analysis spelled out in Python 2-3. Multiple regression analysis [COVID-19 infection rate]
Implemented k-nearest neighbor method in python from scikit learn
Simplex method (simplex method) in Python
Private method in python
Association analysis in Python
Regression analysis in Python
A simple Python implementation of the k-nearest neighbor method (k-NN)
K-nearest neighbor method (multiclass classification)
Axisymmetric stress analysis in Python
Implement method chain in Python
Simple regression analysis in Python
[Machine learning] Write the k-nearest neighbor method (k-nearest neighbor method) in python by yourself and recognize handwritten numbers.
Suppressing method overrides in Python
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
EEG analysis in Python: Python MNE tutorial
First simple regression analysis in Python
Try implementing extension method in python
Implemented label propagation method in Python
Simulate Monte Carlo method in Python
I can't install scikit-learn in Python
Hash method (open address method) in Python
Planar skeleton analysis in Python (2) Hotfix
Linear regression in Python (statmodels, scikit-learn, PyMC3)
Method to build Python environment in Xcode 6
Electron Microscopy Simulation in Python: Multislice Method (1)
Electron Microscopy Simulation in Python: Multislice Method (2)
Residual analysis in Python (Supplement: Cochrane rules)
Alignment algorithm by insertion method in Python