Summary of K-nearest neighbor

What is K-nearest neighbor?

When plotting Train Data on a plane and testing a certain test data't', label the test data't' on the plane with the mode of K points close to that point t. .. That is the K-nearest neighbor method. (Although I use other than planes, I will use planes that are easy to understand here to explain the concept.) It's a little difficult, so I'll borrow a diagram from Wikipedia.

Screen Shot 2017-05-11 at 16.05.09.png Extracted from Wikipedia

Label the green dots based on the K closest dots. What I want to pay attention to here is the variable K, which is K points. For example, in the figure above If K = 3, 2 red points, 1 blue point and a green point will be labeled as red. If K = 5, the red points will be labeled as blue and the green points will be labeled as blue with 2 points for red and 3 points for blue.

Precautions for K-nearest neighbor method

--In the above case, if the red and blue dots are separated to some extent, the K-nearest neighbor method will work well. However, on the contrary, the red and blue points are not particularly separated, and if the data is a mixture of red and blue points, it is not a good idea to use the K-nearest neighbor method.

--Furthermore, if you specify an even number when specifying the number of K, the two labels will be the same number and you may not be able to classify them, so be sure to make the number of K odd.

――And if you increase the number of K, for example, if the number of red dots is abnormally large compared to the number of blue dots, the probability of being classified as red will also be abnormally large. Therefore, you need to pay attention to the ratio of red and blue dots.

default code

`python`



from sklearn.neighbors import KNeighborsClassifier

KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', 
leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)

Description of the Parameter in the K-nearest neighbor

n_neighbors This is the number of nearest points to consider, as mentioned above. If it is default, n_neighbors = 5, so classify from the 5 nearest points. By the way, the difference between the boundaries of n_neighbors is as shown in the figure below. Extracted from kevinzakka's blog

Perhaps the most important parameter is n_neighbors described above, and the code below will give you the optimal number of K points to consider.

`python`



#Make a list of numbers to put in K K
myList = list(range(1,50))

#Subtract even numbers from that list to make a list of only odd numbers
neighbors = filter(lambda x: x % 2 != 0, myList)

#Make an empty list of Cross Validation scores
cv_scores = []

#Cross validate and append the score to the above empty list
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

This is also an excerpt from kevinzakka's blog

weights weights is a weighting parameter. If'uniform'is default and you use it, all values are considered equally. When set to'distance', nearby points are considered more often (weighted) than distant points. You can also create and specify functions by yourself.
algorithm I'm sorry, I didn't understand this because of the lack of mathematical education in Euclidean space and other technical terms. However, for such people, you can specify'auto'and they will choose the most suitable one. By the way, other types seem to be'ball_tree','kd_tree','brute', so if you are interested, please check it out. Also, if you find an easy-to-understand explanation, please let us know in the comments.

The above are the main parameters. Also, I will add as soon as I understand more. If you understand, editing requests are also welcome. I look forward to working with you.

The pros and cons of the K-nearest neighbor method.

good point

The larger the data, the more likely it is that a more accurate classification will be possible. A simple and easy-to-understand model.

--Bad point

I mentioned two bad points in the What is K-nearest neighbor method, but I will summarize it again in one sentence. If multiple classes are abnormally mixed, or if the ratio is abnormally biased, the classification may not work.

Summary

The above is the outline of the K-nearest neighbor as far as I can understand. We will update it daily, so if you have something to add or fix, we would appreciate it if you could comment.

[PYTHON] Machine learning ④ K-nearest neighbor Summary

Summary of K-nearest neighbor

What is K-nearest neighbor?

Precautions for K-nearest neighbor method

default code

python

Description of the Parameter in the K-nearest neighbor

python

The pros and cons of the K-nearest neighbor method.

Summary

`python`

`python`