[PYTHON] Try implementing k-NN on your own

Motivation

I was reconfirming my knowledge related to machine learning, and after studying the most basic method, the k-NN method (k-nearest neighbor method), I implemented it myself.

What is the k-NN method?

First, let's review the outline of k-NN. It's a very simple and easy-to-understand algorithm, so you can easily understand it. The figure below is all about this algorithm. knn.jpg The data of interest is the central star. First, for the given data, calculate the distance between all the data. For example, when k = 3, look at the labels of the top 3 data closest to the data of interest. In this case, since there are two classes 1 and one class 2, it is classified into class 1 which takes the mode. On the other hand, if k = 6, there are 2 classes 1 and 6 classes 2, so this data is classified as class 2. The data given in this way will be decided by a majority vote on the labels of similar data. By the way, when k = 1, it is also called the nearest neighbor method. A similar named technique is the k-means method (https://ja.wikipedia.org/wiki/K%E5%B9%B3%E5%9D%87%E6%B3%95). , This is an unsupervised clustering method, which is different from the k-NN method, which is a supervised classification method.

Implementation

Then I would like to actually implement it in Python.

data set

The dataset uses the famous Iris Data Set. Download iris.data and actually load the data.

knn.py


data = np.loadtxt('iris.data', delimiter=',', dtype='float64',usecols=[0, 1, 2, 3])
labels = np.loadtxt('iris.data', delimiter=',', dtype='str',usecols=[4])
print(data.shape)
print(labels.shape)

(150, 4)
(150,)

This dataset consists of 150 data with 4 features and their labels (3 classes).

Nearest neighbor search

This time we will use these libraries.

knn.py


import numpy as np
from sklearn import model_selection
from scipy.spatial import distance
from sklearn.metrics import accuracy_score
from statistics import mean, median,variance

First, calculate the distance to each data for a given point.

knn.py


distance_matrix = distance.cdist(test_data, train_data)
indexes = np.argsort(distance_matrix, axis=1)

Next, count how many data of which class are in the vicinity of k data of the given data.

knn.py


class_dict = {}
for label in labels:
    class_dict[label] = 0
class_dict_list = []
for data_num, index in enumerate(indexes[:,:self.k]):
    class_dict_list.append(class_dict.copy())
    for i in index:
        class_dict_list[data_num][self._labels[i]] += 1

Finally, identify the label of the most common class.

knn.py


predict_class = []
for d in class_dict_list:
    max_class = max(d, key=d.get)
    predict_class.append(max_class)

That's all for the k-NN algorithm itself.

Run

This time, each class is randomly divided in half, and the flow of learning and execution is repeated 20 times. The actual execution result is as follows. (Since the data is divided randomly, there will be a slight difference in accuracy.)

training number 1 ...
knn accuracy : 0.9466666666666667
training number 2 ...
knn accuracy : 0.9333333333333333
training number 3 ...
knn accuracy : 0.9466666666666667
training number 4 ...
knn accuracy : 0.9466666666666667
training number 5 ...
knn accuracy : 0.9333333333333333
training number 6 ...
knn accuracy : 0.92
training number 7 ...
knn accuracy : 0.9466666666666667
training number 8 ...
knn accuracy : 0.9466666666666667
training number 9 ...
knn accuracy : 0.8933333333333333
training number 10 ...
knn accuracy : 0.9466666666666667
training number 11 ...
knn accuracy : 0.96
training number 12 ...
knn accuracy : 0.96
training number 13 ...
knn accuracy : 0.96
training number 14 ...
knn accuracy : 0.96
training number 15 ...
knn accuracy : 0.92
training number 16 ...
knn accuracy : 0.96
training number 17 ...
knn accuracy : 0.92
training number 18 ...
knn accuracy : 0.9866666666666667
training number 19 ...
knn accuracy : 0.9333333333333333
training number 20 ...
knn accuracy : 0.96
=================================================
knn accuracy mean : 0.944
knn accuracy variance : 0.00042292397660818664

In this way, you can see that high accuracy has been achieved.

Impressions

There was only a simple algorithm and it was very easy to implement. It is very effective if the dimension is small and the number of data is small like this data. However, in reality, for high-dimensional data, [curse of dimensionality](https://ja.wikipedia.org/wiki/%E6%AC%A1%E5%85%83%E3%81%AE%E5% The effect of 91% AA% E3% 81% 84) makes the Euclidean distance meaningless, and it is not practical for both time complexity and spatial complexity in large data sets. For such problems, Methods for hashing and quantizing vectors has proposed methods for reducing memory consumption and computational complexity. I will. You can check the actual source code on github. https://github.com/kotaYkw/machine_learning

Recommended Posts

Try implementing k-NN on your own
Try installing OpenCV 3.0 on your AMI
Note on building your own Miniconda environment
[Python] Register your own library on PyPI
Import your own functions on AWS Glue
Try to make your own AWS-SDK with bash
Try making your own CorDapp. (Until States creation)
Take your own peak memory usage on Linux & Python
Try to improve your own intro quiz in Python
Try to put LED in your own PC (slightly)
Try FEniCS on Windows!
Try Poerty on Windows
Try NeosVR on Linux
Try deepdream on Mac
Try running Amazon Linux 2 on-premises (VM on your local PC).
Try sorting your own objects with priority queue in Python
Try HeloWorld in your own language (with How to & code)
Install Linux on your Chromebox
Try implementing RBM with chainer.
Try StyleGAN on Google Colaboratory
Try using OpenCV on Windows
Try implementing XOR with PyTorch
Install numba on your Mac
Try "100 knocks on data science" ①
Create your own Django middleware
Install Django on your Mac
Try implementing Yubaba in Python 3
Try implementing perfume with Go
Cross-compile for Raspberry Pi Zero on Debian-Create your own shared library
Try docker: Create your own container image for your Python web app
Try to log in to Netflix automatically using python on your PC
Introduction to Deep Learning (2) --Try your own nonlinear regression with Chainer-