Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)

Introduction

The k-nearest neighbor method (kNN) is a simple machine learning algorithm that determines the predicted value by taking a majority vote among the k items that are closest to the value to be predicted. The following explanation is intuitive and easy to understand.

In the figure on the right, the smiley mark is the position of the value you want to predict, and shows the range of the neighborhood when k = 3. In this case, ◆ is the predicted value of the smiley mark. SnapCrab_NoName_2020-9-1_18-58-0_No-00.png

In this article, I would like to rewrite the part implemented by R in the teaching material for classification by the k-nearest neighbor method to python.

Teaching materials

[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 3 Information and Data Science Second Half (PDF: 7.6MB)

environment

Parts to be taken up in the teaching materials

Learning 15 Prediction by classification: "3. k-Classification by neighbor method"

Data handled this time

Download digit-recognizer data from kaggle in the same way as the material. It is "train.csv" to use.

https://www.kaggle.com/c/digit-recognizer/data

Implementation example and result in python

Reading training data and test data

Information on 42,000 handwritten numbers is stored in train.csv, and the information on one handwritten number is the correct label (correct number) in the first column (label) and 784 in the second and subsequent columns (pixel). It looks like 256 grayscale gradation values (0-255) of (28 x 28) pixels.

Here, we will use the first 1,000 data as training data and the next 100 data as test data.

import numpy as np
import pandas as pd
from IPython.display import display

mnist = pd.read_csv('/content/train.csv')

mnist_light = mnist.iloc[:1000,:]
mnist_light_test = mnist.iloc[1000:1100,:]

#Training data
Y_mnist_light = mnist_light[['label']].values.ravel()
#display(Y_mnist_light)
X_mnist_light = mnist_light.drop('label', axis = 1)
#display(X_mnist_light)

#test data
Y_mnist_light_test = mnist_light_test[['label']].values.ravel()
#display(Y_mnist_light_test)
X_mnist_light_test = mnist_light_test.drop('label', axis = 1)
#display(X_mnist_light_test)

Training and prediction of training data

After training by the k-nearest neighbor method and training data when k = 3, the predicted value is acquired from 100 test data. The correct answer rate is displayed by comparing with the label (correct answer value) of the test data.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score

# sklearn.neighbors.Use KNeighbors Classifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_mnist_light, Y_mnist_light)

#Predictive execution
pred_y = knn.predict(X_mnist_light_test)
display(pred_y)

#Confirm correct answer
result_compare = pred_y == Y_mnist_light_test
display(result_compare)

minist_accuracy_score = accuracy_score(Y_mnist_light_test, pred_y)

#Correct answer rate
print(minist_accuracy_score)

The execution result is as follows.

array([1, 5, 1, 7, 9, 8, 9, 5, 7, 4, 7, 2, 8, 1, 4, 3, 8, 6, 2, 7, 2, 6,
       7, 8, 1, 8, 8, 1, 9, 0, 9, 4, 6, 6, 8, 2, 3, 5, 4, 5, 4, 1, 3, 7,
       1, 5, 0, 0, 9, 5, 5, 7, 6, 8, 2, 8, 4, 2, 3, 6, 2, 8, 0, 2, 4, 7,
       3, 4, 4, 5, 4, 3, 3, 1, 5, 1, 0, 2, 2, 2, 9, 5, 1, 6, 6, 9, 4, 1,
       7, 2, 2, 0, 7, 0, 6, 8, 0, 5, 7, 4])
array([ True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True, False,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])
0.89

The correct answer rate is 0.89.

Confirmation of actually incorrect handwritten numbers

Of the test data, the fifth test data is incorrectly recognized. We are actually checking the wrong numbers.

import matplotlib.pyplot as plt

#Display of test data
fig, axes = plt.subplots(2, 5)
fig.subplots_adjust(left=0, right=1, bottom=0, top=1.0, hspace=0.1, wspace=0.1)
for i in range(2):
    for j in range(5):
        axes[i, j].imshow(X_mnist_light_test.values[i*5+j].reshape((28, 28)), cmap='gray')
        axes[i, j].set_xticks([])
        axes[i, j].set_yticks([])
plt.show()

The execution result is as follows.

SnapCrab_NoName_2020-9-1_19-16-22_No-00.png

The handwritten number on the far right of the upper row had a label (correct answer value) of 4, but it seems that the predicted value was misrecognized as 9. It looks like a number that can be judged visually as 9.

Change in correct answer rate when changing the mixed matrix and the value of k

The vertical axis is the predicted value and the horizontal axis is the correct label, and a table showing the number is displayed.

from sklearn.metrics import confusion_matrix

cfm = confusion_matrix(Y_mnist_light_test, pred_y)

print(cfm)

The execution result is as follows.

[[ 7  0  0  0  0  0  0  0  0  0]
 [ 0 10  0  0  0  0  0  0  0  0]
 [ 0  0 13  0  0  0  0  1  1  0]
 [ 0  0  0  5  0  1  0  0  0  0]
 [ 0  0  0  0 11  0  0  0  0  1]
 [ 0  0  0  0  0 10  0  0  0  0]
 [ 0  0  0  0  0  0  9  0  0  0]
 [ 1  0  0  0  0  0  0 10  0  1]
 [ 0  0  0  2  0  0  0  0 10  1]
 [ 0  1  0  0  1  0  0  0  0  4]]

Next, when you change the value of k to see what was suitable for the value of k, the percentage of correct answers is displayed in a graph.

n_neighbors_chg_list = []

# n_Graph when neighbors are changed
for i in range(1,100):
    # sklearn.neighbors.Use KNeighbors Classifier
    knn_temp = KNeighborsClassifier(n_neighbors = i)
    knn_temp.fit(X_mnist_light, Y_mnist_light)

    #Predictive execution
    pred_y_temp = knn_temp.predict(X_mnist_light_test)

    #Correct answer rate
    minist_accuracy_score_temp = accuracy_score(Y_mnist_light_test, pred_y_temp)

    #Store in array
    n_neighbors_chg_list.append(minist_accuracy_score_temp)

plt.plot(n_neighbors_chg_list)

The execution result is as follows.

ダウンロード (12).png

In general, larger k values have less effect on the result, even if there are outliers, so the effect of noise can be reduced, but class boundaries tend to be less clear. The appropriate value of k changes depending on the number of training data, etc., but in this trial, the correct answer rate tended to decrease as k increased.

Source code

https://gist.github.com/ereyester/01237a69f6b8ae73c55ccca33c931ade

Recommended Posts

Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
Object detection using YOLO (python) ([High School Information Department Information II] Teacher training materials)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
A simple Python implementation of the k-nearest neighbor method (k-NN)
K-nearest neighbor method (multiclass classification)
Web teaching materials for learning Python