A simple Python implementation of the k-nearest neighbor method (k-NN)

Overview of k-nearest neighbor method

The k-Nearist Neighbor is a machine learning algorithm that is supervised learning and is used for classification problems. A very similar name is k-means, but k-means is used for clustering in unsupervised learning. The k-nearest neighbor algorithm itself is very simple. Calculate the distance between the data you want to classify and the existing data, and determine the class by majority voting of the data at k points that are close to each other. For example, when k = 1, it is just assumed that it is the data companion with the shortest distance. It's easier to understand if you look at the figure. 220px-KnnClassification.svg.png An example of the> k-nearest neighbor method. Specimens (green circles) are classified into either the first class (blue square) or the second class (red triangle). If k = 3, the objects in the inner circle are in the neighborhood, so they are classified in the second class (more red triangles). But if k = 5, it reverses. Quote: wikipedia

From the above, the features of the k-nearest neighbor method are as follows.

Implementation of k-nearest neighbor method (Python)

Easy implementation

import numpy as np
import scipy.spatial.distance as distance
import scipy.stats as stats


import numpy as np
import scipy.spatial.distance as distance
import scipy.stats as stats


class knn:
    def __init__(self,k):
        self.k = k
        self._fit_X = None  #Store existing data
        self.classes = None  #
        self._y = None
    def fit(self, X, label):
        #X is the original data point, shape(data_num, feature_num)
        print("original data:\n", X)
        print("label:\n", label)
        self._fit_X = X
        #Extract the class from the label data and create an array with the label as the index
        # self.classes[self.label_indices] ==It can be restored like a label, so return_called inverse
        self.classes, self.label_indices = np.unique(label, return_inverse=True)
        print("classes:\n", self.classes)
        print("label_indices:\n", self.label_indices)
        print("classes[label_indices]Check if it is restored with:\n", self.classes[self.label_indices])
    def neighbors(self, Y):
        #Y is the data point to be predicted(Multiple)so, shape(test_num, feature_num) 
        #Since the distance between the point to be predicted and the data point is calculated, test_num * data_Calculate the distance by num
        dist = distance.cdist(Y, self._fit_X)
        print("Distance between test data and original data:\n", dist)
        #dist is shape(test_num, data_num)Becomes
        # [[1.41421356 1.11803399 2.6925824  2.23606798]Distance between test point 1 and each point of the original data
        #  [3.         2.6925824  1.80277564 1.41421356]Distance between test point 2 and each point of the original data
        #  [3.31662479 3.20156212 1.11803399 1.41421356]]Distance between test point 3 and each point of the original data
        
        #After measuring the distance, find the index included up to the kth
        #argpartition is a function that divides data up to the kth and after that
        #With argsort, you can see the order of distance, but k-Since nn does not need distance ranking information, use argpartition
        neigh_ind = np.argpartition(dist, self.k)
        # neigh_ind shape is(test_num, feature_num)Becomes
        #K in the above dist=Result of arg partition in 2
        #For example, in the first line, index 2,1 is the top two elements. Looking at the distance above, 0.5 and 1.5 corresponds
        #The second line is index 3,2 is the top 2 elements, 1.73 and 1.80 is equivalent
        #[[1 0 3 2]
        # [3 2 1 0]
        # [2 3 1 0]]
        #Extract only the information up to the kth
        neigh_ind = neigh_ind[:, :self.k]
        # neigh_ind shape is(test_num, self.k)Becomes
        #[[1 0]List of indexes of original data points close to test point 1
        # [3 2]List of indexes of original data points close to test point 2
        # [2 3]]List of indexes of original data points close to test point 3
        return neigh_ind
    def predict(self, Y):
        #Shape to find the index up to the kth(test_num, self.k)Becomes
        print("test data:\n",Y)
        neigh_ind = self.neighbors(Y)
        # stats.Find the mode in mode. shape(test_num, 1) . _Is the mode count
        # self.label_indices is[0 0 1 1]Represents the label of each point in the original data
        # neigh_ind is a list of indexes of the original data near each test point, shape(est_num, k)Becomes
        # self.label_indices[neigh_ind]You can get a list of labels close to each test point as below
        # [[0 0]List of labels for original data points close to test point 1
        #  [1 1]List of labels for original data points close to test point 2
        #  [1 1]]List of labels for original data points close to test point 3
        #Row direction of the above data(axis=1)Against mode(Mode)And use it as the label to which each test point belongs
        # _Is the number of counts
        mode, _ = stats.mode(self.label_indices[neigh_ind], axis=1)
        #mode is axis=Since it is totaled by 1, shape(test_num, 1)So raise(=flatten)I'll do it
        # [[0]
        #  [1]
        #  [1]]
        #Note that np.intp is the data type used for index
        mode = np.asarray(mode.ravel(), dtype=np.intp)
        print("Mode of each label index in test data:\n",mode)
        #Change from index notation to label name notation. self.classes[mode]Same as
        result = self.classes.take(mode)
        return result

Try to predict

K = knn(k=2)
#Set the original data and label
samples = [[0., 0., 0.], [0., .5, 0.], [1., 2., -2.5],[1., 2., -2.]]
label = ['a','a','b', 'b']
K.fit(samples, label)
#Data you want to predict
Y = [[1., 1., 0.],[2, 2, -1],[1, 1, -3]]
p = K.predict(Y)
print("result:\n", p)

Execution result

>>result
original data:
 [[0.0, 0.0, 0.0], [0.0, 0.5, 0.0], [1.0, 2.0, -2.5], [1.0, 2.0, -2.0]]
label:
 ['a', 'a', 'b', 'b']
classes:
 ['a' 'b']
label_indices:
 [0 0 1 1]
classes[label_indices]Check if it is restored with:
 ['a' 'a' 'b' 'b']
test data:
 [[1.0, 1.0, 0.0], [2, 2, -1], [1, 1, -3]]
Distance between test data and original data:
 [[1.41421356 1.11803399 2.6925824  2.23606798]
 [3.         2.6925824  1.80277564 1.41421356]
 [3.31662479 3.20156212 1.11803399 1.41421356]]
Mode of each label index in test data:
 [0 1 1]
result:
 ['a' 'b' 'b']

References

Recommended Posts

A simple Python implementation of the k-nearest neighbor method (k-NN)
A python implementation of the Bayesian linear regression class
A reminder about the implementation of recommendations in Python
Implementation of a simple particle filter
A function that measures the processing time of a method in python
[Python] A simple function to find the center coordinates of a circle
[Python] [scikit-learn] k-nearest neighbor method introductory memo
[python] [meta] Is the type of python a type?
The story of blackjack A processing (python)
Hit a method of a class instance with the Python Bottle Web API
Get the caller of a function in Python
Why the Python implementation of ISUCON 5 used Bottle
Make a copy of the list in Python
4 methods to count the number of occurrences of integers in a certain interval (including imos method) [Python implementation]
A note about the python version of python virtualenv
[Python] A rough understanding of the logging module
Output in the form of a python array
A discussion of the strengths and weaknesses of Python
Build a python environment to learn the theory and implementation of deep learning
the zen of Python
[Python] Implementation of clustering using a mixed Gaussian model
Implementation example of simple LISP processing system (Python version)
[Python] A program that counts the number of valleys
Cut a part of the string using a Python slice
Destroy the intermediate expression of the sweep method with Python
Probably the most unhelpful Python implementation method on Qiita
Python points from the perspective of a C programmer
Calculate the regression coefficient of simple regression analysis with python
Machine learning #k-nearest neighbor method and its implementation and various
Implemented k-nearest neighbor method in python from scikit learn
Tasks at the start of a new python project
[Python] A program that compares the positions of kangaroos.
Python Note: The mystery of assigning a variable to a variable
Introduction to machine learning ~ Let's show the table of K-nearest neighbor method ~ (+ error handling)
Python: Calculate the uniform flow depth of a rectangular cross section using the Brent method
2. Multivariate analysis spelled out in Python 8-1. K-nearest neighbor method (scikit-learn)
Increase the speed of the Monte Carlo method of Cython cut-out implementation.
Find out the apparent width of a string in python
Towards the retirement of Python2
Reuse the behavior of the @property method by using a descriptor [16/100]
A simple sample of pivot_table.
K-nearest neighbor method (multiclass classification)
I just changed the sample source of Python a little.
How to use the __call__ method in a Python class
Different from the import type of python. from A import B meaning
About the ease of Python
Get the number of specific elements in a python list
[Note] Import of a file in the parent directory in Python
Simple FPS measurement of python
[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ②
Python implementation of particle filters
[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ①
[Machine learning] Write the k-nearest neighbor method (k-nearest neighbor method) in python by yourself and recognize handwritten numbers.
Implementation of quicksort in Python
About the features of Python
Find the eigenvalues of a real symmetric matrix in Python
2. Multivariate analysis spelled out in Python 8-3. K-nearest neighbor method [cross-validation]
A Python script that compares the contents of two directories
The Power of Pandas: Python
[Python] How to use the for statement. A method of extracting by specifying a range or conditions.
I want to clear up the question of the "__init__" method and the "self" argument of a Python class.