The k-Nearist Neighbor is a machine learning algorithm that is supervised learning and is used for classification problems. A very similar name is k-means, but k-means is used for clustering in unsupervised learning. The k-nearest neighbor algorithm itself is very simple. Calculate the distance between the data you want to classify and the existing data, and determine the class by majority voting of the data at k points that are close to each other. For example, when k = 1, it is just assumed that it is the data companion with the shortest distance. It's easier to understand if you look at the figure. An example of the> k-nearest neighbor method. Specimens (green circles) are classified into either the first class (blue square) or the second class (red triangle). If k = 3, the objects in the inner circle are in the neighborhood, so they are classified in the second class (more red triangles). But if k = 5, it reverses. Quote: wikipedia
From the above, the features of the k-nearest neighbor method are as follows.
Easy implementation
import numpy as np
import scipy.spatial.distance as distance
import scipy.stats as stats
import numpy as np
import scipy.spatial.distance as distance
import scipy.stats as stats
class knn:
def __init__(self,k):
self.k = k
self._fit_X = None #Store existing data
self.classes = None #
self._y = None
def fit(self, X, label):
#X is the original data point, shape(data_num, feature_num)
print("original data:\n", X)
print("label:\n", label)
self._fit_X = X
#Extract the class from the label data and create an array with the label as the index
# self.classes[self.label_indices] ==It can be restored like a label, so return_called inverse
self.classes, self.label_indices = np.unique(label, return_inverse=True)
print("classes:\n", self.classes)
print("label_indices:\n", self.label_indices)
print("classes[label_indices]Check if it is restored with:\n", self.classes[self.label_indices])
def neighbors(self, Y):
#Y is the data point to be predicted(Multiple)so, shape(test_num, feature_num)
#Since the distance between the point to be predicted and the data point is calculated, test_num * data_Calculate the distance by num
dist = distance.cdist(Y, self._fit_X)
print("Distance between test data and original data:\n", dist)
#dist is shape(test_num, data_num)Becomes
# [[1.41421356 1.11803399 2.6925824 2.23606798]Distance between test point 1 and each point of the original data
# [3. 2.6925824 1.80277564 1.41421356]Distance between test point 2 and each point of the original data
# [3.31662479 3.20156212 1.11803399 1.41421356]]Distance between test point 3 and each point of the original data
#After measuring the distance, find the index included up to the kth
#argpartition is a function that divides data up to the kth and after that
#With argsort, you can see the order of distance, but k-Since nn does not need distance ranking information, use argpartition
neigh_ind = np.argpartition(dist, self.k)
# neigh_ind shape is(test_num, feature_num)Becomes
#K in the above dist=Result of arg partition in 2
#For example, in the first line, index 2,1 is the top two elements. Looking at the distance above, 0.5 and 1.5 corresponds
#The second line is index 3,2 is the top 2 elements, 1.73 and 1.80 is equivalent
#[[1 0 3 2]
# [3 2 1 0]
# [2 3 1 0]]
#Extract only the information up to the kth
neigh_ind = neigh_ind[:, :self.k]
# neigh_ind shape is(test_num, self.k)Becomes
#[[1 0]List of indexes of original data points close to test point 1
# [3 2]List of indexes of original data points close to test point 2
# [2 3]]List of indexes of original data points close to test point 3
return neigh_ind
def predict(self, Y):
#Shape to find the index up to the kth(test_num, self.k)Becomes
print("test data:\n",Y)
neigh_ind = self.neighbors(Y)
# stats.Find the mode in mode. shape(test_num, 1) . _Is the mode count
# self.label_indices is[0 0 1 1]Represents the label of each point in the original data
# neigh_ind is a list of indexes of the original data near each test point, shape(est_num, k)Becomes
# self.label_indices[neigh_ind]You can get a list of labels close to each test point as below
# [[0 0]List of labels for original data points close to test point 1
# [1 1]List of labels for original data points close to test point 2
# [1 1]]List of labels for original data points close to test point 3
#Row direction of the above data(axis=1)Against mode(Mode)And use it as the label to which each test point belongs
# _Is the number of counts
mode, _ = stats.mode(self.label_indices[neigh_ind], axis=1)
#mode is axis=Since it is totaled by 1, shape(test_num, 1)So raise(=flatten)I'll do it
# [[0]
# [1]
# [1]]
#Note that np.intp is the data type used for index
mode = np.asarray(mode.ravel(), dtype=np.intp)
print("Mode of each label index in test data:\n",mode)
#Change from index notation to label name notation. self.classes[mode]Same as
result = self.classes.take(mode)
return result
Try to predict
K = knn(k=2)
#Set the original data and label
samples = [[0., 0., 0.], [0., .5, 0.], [1., 2., -2.5],[1., 2., -2.]]
label = ['a','a','b', 'b']
K.fit(samples, label)
#Data you want to predict
Y = [[1., 1., 0.],[2, 2, -1],[1, 1, -3]]
p = K.predict(Y)
print("result:\n", p)
Execution result
>>result
original data:
[[0.0, 0.0, 0.0], [0.0, 0.5, 0.0], [1.0, 2.0, -2.5], [1.0, 2.0, -2.0]]
label:
['a', 'a', 'b', 'b']
classes:
['a' 'b']
label_indices:
[0 0 1 1]
classes[label_indices]Check if it is restored with:
['a' 'a' 'b' 'b']
test data:
[[1.0, 1.0, 0.0], [2, 2, -1], [1, 1, -3]]
Distance between test data and original data:
[[1.41421356 1.11803399 2.6925824 2.23606798]
[3. 2.6925824 1.80277564 1.41421356]
[3.31662479 3.20156212 1.11803399 1.41421356]]
Mode of each label index in test data:
[0 1 1]
result:
['a' 'b' 'b']
Recommended Posts