[PYTHON] Machine learning #k-nearest neighbor method and its implementation and various

Preface

This article is a memo taken while studying Video of a great programmer, Mr. Sentdex. https://www.youtube.com/user/sentdex

First of all, what is K-NN? According to WIKI: https://ja.wikipedia.org/wiki/K%E8%BF%91%E5%82%8D%E6%B3%95

The k-nearest neighbor method is the simplest of almost any machine learning algorithm. The classification of an object is determined by voting on its neighbors (ie, assigning the most common class of k nearest neighbors to that object). k is a positive integer, generally small. If k = 1, it will only be classified in the same class as the nearest neighbor object. In the case of binary classification, if k is set to an odd number, the problem of not being able to classify with the same number of votes can be avoided.

Yeah, a very democratic way. First look at the graph below. There are 6 points of data. What if you want to group? image.png

Maybe it will be like this. Then why? Because they are close to each other. What we are doing is called "Cluster". image.png

All right. This time, there are already groups of + and-. image.png

I have a red dot now. Which group will this red dot belong to? You will probably be in the + group. But why? What kind of thinking circuit did you think in your head? image.png

The upcoming red dot will probably-be in the group. What is it again? image.png

Last red dot If you are here, which group is it? This depends on the algorithm you have chosen. image.png

Let's get back to the story. Perhaps (at least I) this red dot is closer to the + group than the-group, so the red dot is the + group! I should have decided. And we call this "idea" K-nearest neighbors (K-NN). I think that it is the k-nearest neighbor method in Japanese. All I have to do for K-NN is to measure the distance between the points I want to group and the other points, and decide which group to join. In the example shown above, there are only 2 groups, but what if there are 100 tasks? What if there are 1000 tasks? Yeah, I'm going to lose my mind ... image.png

But think twice. Do I really have to measure all the points? I don't think it's necessary in most cases. "K" is defined in K-NN. Now, let's set K = 2. image.png

Yeah, I see. Perhaps the place where the orange circle is written is the closest to the red dot. Therefore, I judge that this red dot is + group. image.png

This time a sunspot came out. If K = 2, which is the closest point? Yeah, maybe these two. I have a problem! It was a draw! I can't judge that. K-NN basically doesn't allow this k to be even. image.png

So set K = 3. When K = 3, we need another point. Maybe ... is closer? This sunspot is a group. image.png

So, according to the previous judgment, there are two-and one +. 2/3 = 66% This 66% is the confidence in this decision. And the distance measured in advance is called Euclidean Distance.

Implementation

Let's decide the flow first. image.png

Data preparation

Let's implement it using sklearn.

First of all, I have to prepare a Data-set, so this is the one I will use this time: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29 Data on breast cancer.

Data analysis

I think the downloaded one will look like this. image.png

Open breast-cancer-wisconsin.names for a description of the data. The "meaning" of each data value is written in 7. From here, it seems that 1 to 10 are Features and 11 is Label. image.png There is data in breast-cancer-wisconsin.data. However, the important label is not written. image.png You can add it programmatically, but for now, enter it manually ... image.png

Library Import


import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd

Pre-process with pandas

The id doesn't seem to have much to do with the Data-set, so drop it. It is written that "?" Is displayed in the breast-cancer-wisconsin.names. Since NAN is a basic NG, we will replace it with -99999 here.

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)

Prepare X and Y data

Simply put, X is unlabeled data. Y is label-only data. So X puts a DataFrame with a "class" dropped. Y puts a DataFrame with only "Class".

#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])

Separate Test and Train data

80% Train data and 20% test data are separated.

x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.2)

Prepare Classifier

Since I use K-NN, it's a K Neighbors Classifier.

clf=neighbors.KNeighborsClassifier()

Train

clf.fit(x_train,y_train)

Verify with Test data

Yeah, it's about 96%. It's a good accuracy.

accuracy=clf.score(x_test,y_test)
print(accuracy)

>0.9642857142857143

Overall Code

import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)

#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])

x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(x_train,y_train)
accuracy=clf.score(x_test,y_test)
print(accuracy)

Let's dig deep

From now on, I will write KNN by myself without relying on sklearn. Let's look again first. Actually, it is a distance to decide which group to make sunspots. This distance is called "Euclidean Distance". image.png

a formula

The calculation formula is as follows. Sorry for the ugly characters ... here, i=Diemensions q = point p = difference image.png

For example, q = (1,3), p = (2,5). Euclidean Distance=2.236 image.png

Implementation

Let's write this formula in python first.

from math import sqrt

plot1=[1,3]
plot2=[2,5]

euclidean_distance=sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

print(euclidean_distance)

Create a Function

Before creating a Function, let's think about the idea first. The code below first plots a Dataset and new data. That's right for the function of Function. data: Train_data. predict: This is Test_data. k: How many points do you get, I think sklearn has a default of k = 5 ... And at the end, it activates some magic and produces the result!

Implementation

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings

style.use('fivethirtyeight')

data_set={'k':[[1,2],[2,3],[3,1]]
          ,'r':[[6,5],[7,7],[8,6]]} 

new_features=[5,7]

[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1])

def k_nearest_neighbors(data,predict,k=3):
    if len(data) >=k:
        warnings.warnings('K is set to a value less than total voting groups!')
    
    #?make some magic
    
    return vote_result

image.png

Think of a magical device

Are you really looking for this Function? How do we compare one Data-point with another and find the closest Point. KNN has to compare Predict Points to all Points, which is a KNN problem. Certainly, it is possible to save the calculation time by setting Radius and seeing only a certain number of points.

Implementation

  1. Calculate all euclidean_distane in For Loop and add it to the array.
  2. For example, [2.44,'r'], [1.4,'r'], [2.14,'k'], [4.4,'k'], [5.44,'k'] ..
  3. Then sort the array called votes from small to large euclidean_distane calculated in the previous array.
  4. Store the label corresponding to that euclidean_distane.
  5. Use Counter (votes) .most_common (1) to extract the most frequent elements. This is tentatively [('r', 3)].
  6. By using Counter (votes) .most_common (1) [0] [0], ‘r’ can be retrieved further.
  7. Finally, return the result.
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings

style.use('fivethirtyeight')

data_set={'k':[[1,2],[2,3],[3,1]]
          ,'r':[[6,5],[7,7],[8,6]]} 

new_features=[5,7]

[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1])

def k_nearest_neighbors(data,predict,k=3):
    if len(data) >=k:
        warnings.warnings('K is set to a value less than total voting groups!')
    
    for group in data:
        #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
        #iterating  ^                       ^
        for features in data[group]:
            #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
            #iterating      ^^^^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^
            
            #not fast and just for 2-dimensions
            #euclidean_distacnce=sqrt(  (features[0]-predict[0])**2 + features[1]-predict[1]**2 )
            
            #better
            #euclidean_distane=np.sqrt( np.sum((np.array(features)-np.array(predict))**2))
            
            euclidean_distane=np.linalg.norm( np.array(features)-np.array(predict)  )
            distance.append([euclidean_distane,group])
    
    votes=[i[1] for i in sorted(distance)[:k]]
    
    print(Counter(votes).most_common(1))
    
    vote_result=Counter(votes).most_common(1)[0][0]
            
    return vote_result

result=k_nearest_neighbors(data_set,new_features,k=3)
print(result)
[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1],color=result)

Yeah, it's the red group! image.png

Let's go a little further

Now let's use real data and compare it with scikit-learn, which is more accurate.

  1. Randomly change the order of Data with full_data = df.astype (float) .values.tolist ().
  2. Prepare the data-set of train_set and test_set. In the dictionary, use 2 and 4 to match the Class in the previous breast cancer data.
  3. train_data = full_data [:-int (test_size * len (full_data))], for example, if there are 100 full_data, test_size is 0.2, so 100 * 0.2 = 20. train_data = full_data [: -20] stores train_data for 80 datasets.
  4. test_data = full_data [-int (test_size * len (full_data)):], for example, if there are 100 full_data, test_size is 0.2, so 100 * 0.2 = 20. train_data = full_data [-20:] stores test_data in the last 20 datasets.
  5. The last For loop uses Train_data and test_set, and although test_set is group 2, it can be said that it was wrong when vote returned 4, and conversely, test_set was group 2 and it was found to be correct when vote returned 4. Set correct to +1. And the exact number / Total can be said to be the correct answer rate.

Self-help Function

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random

style.use('fivethirtyeight')

def k_nearest_neighbors(data,predict,k=3):
    if len(data) >=k:
        warnings.warnings('K is set to a value less than total voting groups!')
    
    for group in data:
        #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
        #iterating  ^                       ^
        for features in data[group]:
            #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
            #iterating      ^^^^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^
            
            #not fast and just for 2-dimensions
            #euclidean_distacnce=sqrt(  (features[0]-predict[0])**2 + features[1]-predict[1]**2 )
            
            #better
            #euclidean_distane=np.sqrt( np.sum((np.array(features)-np.array(predict))**2))
            
            euclidean_distane=np.linalg.norm( np.array(features)-np.array(predict)  )
            distance.append([euclidean_distane,group])
    
    votes=[i[1] for i in sorted(distance)[:k]]
    
    print(Counter(votes).most_common(1))
    
    vote_result=Counter(votes).most_common(1)[0][0]
            
    return vote_result

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
full_data=df.astype(float).values.tolist()

#random sort your data
random.shuffle(full_data)

test_size=0.2
train_set={2:[],4:[]}
test_set={2:[],4:[]}

train_data=full_data[:-int(test_size*len(full_data))]#first 80%
test_data=full_data[-int(test_size*len(full_data)):] #last 20%

'''
for i in train_data:
    #train_set[i[-1]] is the class data from train_set:2 or 4.
    #.append(i[:-1]) is the function to take all the data (unless the class) to the list 
    train_set[i[-1]].append(i[:-1])

for i in test_data:
    test_set[i[-1]].append(i[:-1])
'''
correct=0
total=0

for group in test_set:
    #loop the class'2' and '4'
    for data in test_set[group]:
        #take the data from set'2' and '4'
        vote=k_nearest_neighbors(train_set,data,k=5)
        if group == vote:
            correct+=1
        total+=1
        

print('Accuracy:',correct/total)

>Accuracy: 0.9640287769784173

sklearn Function

Code as usual.

import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)

df.drop(['id'],1,inplace=True)

#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])

x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.4)

clf=neighbors.KNeighborsClassifier()
clf.fit(x_train,y_train)

accuracy=clf.score(x_test,y_test)

>Accuracies: 0.9671428571428571

Conclusion

… The correct answer rate does not change much, but the calculation speed is clearly faster with sklearn.

Finally

Well, I think there are still many other ways to classify without using KNN. For example, let's say you have such Linear data. image.png Perhaps? Write these lines first ... image.png And a new red dot will come out. image.png Is it possible to do some calculation with Square Error? image.png Now suppose you have such Non-linear data. image.png If you force a line, maybe it's like this? image.png In that case, you can use KNN to find out immediately. image.png

It's difficult to do Thank you for staying with us until the end! looking forward to contact with you again!

Recommended Posts

Machine learning #k-nearest neighbor method and its implementation and various
Machine learning ④ K-nearest neighbor Summary
[Machine learning] OOB (Out-Of-Bag) and its ratio
Machine learning algorithm classification and implementation summary
[Machine learning] Write the k-nearest neighbor method (k-nearest neighbor method) in python by yourself and recognize handwritten numbers.
I considered the machine learning method and its implementation language from the tag information of Qiita
Machine Learning: k-Nearest Neighbors
Introduction to machine learning ~ Let's show the table of K-nearest neighbor method ~ (+ error handling)
K-nearest neighbor method (multiclass classification)
Machine learning and mathematical optimization
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
A simple Python implementation of the k-nearest neighbor method (k-NN)
Significance of machine learning and mini-batch learning
Classification and regression in machine learning
Organize machine learning and deep learning platforms
Machine learning algorithm (gradient descent method)
Specific implementation method to add horse past performance data to machine learning features
Basic machine learning procedure: ③ Compare and examine the selection method of features
Implementation and experiment of convex clustering method
[Python] [scikit-learn] k-nearest neighbor method introductory memo
"Usable" one-hot Encoding method for machine learning
Machine learning algorithm (implementation of multi-class classification)
Personal notes and links about machine learning ① (Machine learning)
Python and machine learning environment construction (macOS)
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Evaluation method of machine learning regression problem (mean square error and coefficient of determination)
Machine learning
Study machine learning and computer science. Resource list
Machine learning Training data division and learning / prediction / verification