Overview of k-nearest neighbor method

The k-Nearist Neighbor is a machine learning algorithm that is supervised learning and is used for classification problems. A very similar name is k-means, but k-means is used for clustering in unsupervised learning. The k-nearest neighbor algorithm itself is very simple. Calculate the distance between the data you want to classify and the existing data, and determine the class by majority voting of the data at k points that are close to each other. For example, when k = 1, it is just assumed that it is the data companion with the shortest distance. It's easier to understand if you look at the figure. 220px-KnnClassification.svg.png An example of the> k-nearest neighbor method. Specimens (green circles) are classified into either the first class (blue square) or the second class (red triangle). If k = 3, the objects in the inner circle are in the neighborhood, so they are classified in the second class (more red triangles). But if k = 5, it reverses. Quote: wikipedia

From the above, the features of the k-nearest neighbor method are as follows.

The simplest unsupervised learning
Does not require prior learning for identification
It takes time to determine because the distance to all data is calculated at the time of determination.
As the dimension becomes larger, there is a problem in terms of calculation cost and accuracy (the so-called curse of dimensionality tends to be addictive).

Implementation of k-nearest neighbor method (Python)

Easy implementation

import numpy as np
import scipy.spatial.distance as distance
import scipy.stats as stats


import numpy as np
import scipy.spatial.distance as distance
import scipy.stats as stats


class knn:
    def __init__(self,k):
        self.k = k
        self._fit_X = None  #Store existing data
        self.classes = None  #
        self._y = None
    def fit(self, X, label):
        #X is the original data point, shape(data_num, feature_num)
        print("original data:\n", X)
        print("label:\n", label)
        self._fit_X = X
        #Extract the class from the label data and create an array with the label as the index
        # self.classes[self.label_indices] ==It can be restored like a label, so return_called inverse
        self.classes, self.label_indices = np.unique(label, return_inverse=True)
        print("classes:\n", self.classes)
        print("label_indices:\n", self.label_indices)
        print("classes[label_indices]Check if it is restored with:\n", self.classes[self.label_indices])
    def neighbors(self, Y):
        #Y is the data point to be predicted(Multiple)so, shape(test_num, feature_num) 
        #Since the distance between the point to be predicted and the data point is calculated, test_num * data_Calculate the distance by num
        dist = distance.cdist(Y, self._fit_X)
        print("Distance between test data and original data:\n", dist)
        #dist is shape(test_num, data_num)Becomes
        # [[1.41421356 1.11803399 2.6925824  2.23606798]Distance between test point 1 and each point of the original data
        #  [3.         2.6925824  1.80277564 1.41421356]Distance between test point 2 and each point of the original data
        #  [3.31662479 3.20156212 1.11803399 1.41421356]]Distance between test point 3 and each point of the original data
        
        #After measuring the distance, find the index included up to the kth
        #argpartition is a function that divides data up to the kth and after that
        #With argsort, you can see the order of distance, but k-Since nn does not need distance ranking information, use argpartition
        neigh_ind = np.argpartition(dist, self.k)
        # neigh_ind shape is(test_num, feature_num)Becomes
        #K in the above dist=Result of arg partition in 2
        #For example, in the first line, index 2,1 is the top two elements. Looking at the distance above, 0.5 and 1.5 corresponds
        #The second line is index 3,2 is the top 2 elements, 1.73 and 1.80 is equivalent
        #[[1 0 3 2]
        # [3 2 1 0]
        # [2 3 1 0]]
        #Extract only the information up to the kth
        neigh_ind = neigh_ind[:, :self.k]
        # neigh_ind shape is(test_num, self.k)Becomes
        #[[1 0]List of indexes of original data points close to test point 1
        # [3 2]List of indexes of original data points close to test point 2
        # [2 3]]List of indexes of original data points close to test point 3
        return neigh_ind
    def predict(self, Y):
        #Shape to find the index up to the kth(test_num, self.k)Becomes
        print("test data:\n",Y)
        neigh_ind = self.neighbors(Y)
        # stats.Find the mode in mode. shape(test_num, 1) . _Is the mode count
        # self.label_indices is[0 0 1 1]Represents the label of each point in the original data
        # neigh_ind is a list of indexes of the original data near each test point, shape(est_num, k)Becomes
        # self.label_indices[neigh_ind]You can get a list of labels close to each test point as below
        # [[0 0]List of labels for original data points close to test point 1
        #  [1 1]List of labels for original data points close to test point 2
        #  [1 1]]List of labels for original data points close to test point 3
        #Row direction of the above data(axis=1)Against mode(Mode)And use it as the label to which each test point belongs
        # _Is the number of counts
        mode, _ = stats.mode(self.label_indices[neigh_ind], axis=1)
        #mode is axis=Since it is totaled by 1, shape(test_num, 1)So raise(=flatten)I'll do it
        # [[0]
        #  [1]
        #  [1]]
        #Note that np.intp is the data type used for index
        mode = np.asarray(mode.ravel(), dtype=np.intp)
        print("Mode of each label index in test data:\n",mode)
        #Change from index notation to label name notation. self.classes[mode]Same as
        result = self.classes.take(mode)
        return result

Try to predict

K = knn(k=2)
#Set the original data and label
samples = [[0., 0., 0.], [0., .5, 0.], [1., 2., -2.5],[1., 2., -2.]]
label = ['a','a','b', 'b']
K.fit(samples, label)
#Data you want to predict
Y = [[1., 1., 0.],[2, 2, -1],[1, 1, -3]]
p = K.predict(Y)
print("result:\n", p)

Execution result

>>result
original data:
 [[0.0, 0.0, 0.0], [0.0, 0.5, 0.0], [1.0, 2.0, -2.5], [1.0, 2.0, -2.0]]
label:
 ['a', 'a', 'b', 'b']
classes:
 ['a' 'b']
label_indices:
 [0 0 1 1]
classes[label_indices]Check if it is restored with:
 ['a' 'a' 'b' 'b']
test data:
 [[1.0, 1.0, 0.0], [2, 2, -1], [1, 1, -3]]
Distance between test data and original data:
 [[1.41421356 1.11803399 2.6925824  2.23606798]
 [3.         2.6925824  1.80277564 1.41421356]
 [3.31662479 3.20156212 1.11803399 1.41421356]]
Mode of each label index in test data:
 [0 1 1]
result:
 ['a' 'b' 'b']

References

Data analysis 10th "k-nearest neighbor method" by Daiji Suzuki
Let's set the classification, regression analysis, and model application range \ (application area ) with [k-nearest neighbor method \ (k \ -Nearest Neighbor, k \ -NN )! -Speaker Deck](https://speakerdeck.com/hkaneko/kzui-jin-bang-fa-k-nearest-neighbor-k-nn-dekurasufen-lei-hui-gui-fen-xi-moderufalseshi-yong- fan-wei-shi-yong-ling-yu-falseshe-ding-wosiyou)
Understanding [k-nearest neighbor \ (k \ -Nearest Neighbor ) — OpenCV -Python Tutorials 1 documentation](http://labs.eecs.tottori-u.ac.jp/sd/Member/oyamada/OpenCV/ html / py_tutorials / py_ml / py_knn / py_knn_understanding / py_knn_understanding.html)
Machine learning \ _k neighbor method \ _ theory ｜ Developers \ .IO

A simple Python implementation of the k-nearest neighbor method (k-NN)

Overview of k-nearest neighbor method

Implementation of k-nearest neighbor method (Python)

References