When plotting Train Data on a plane and testing a certain test data't', label the test data't' on the plane with the mode of K points close to that point t. .. That is the K-nearest neighbor method. (Although I use other than planes, I will use planes that are easy to understand here to explain the concept.) It's a little difficult, so I'll borrow a diagram from Wikipedia.
Extracted from Wikipedia
Label the green dots based on the K closest dots. What I want to pay attention to here is the variable K, which is K points. For example, in the figure above If K = 3, 2 red points, 1 blue point and a green point will be labeled as red. If K = 5, the red points will be labeled as blue and the green points will be labeled as blue with 2 points for red and 3 points for blue.
--In the above case, if the red and blue dots are separated to some extent, the K-nearest neighbor method will work well. However, on the contrary, the red and blue points are not particularly separated, and if the data is a mixture of red and blue points, it is not a good idea to use the K-nearest neighbor method.
--Furthermore, if you specify an even number when specifying the number of K, the two labels will be the same number and you may not be able to classify them, so be sure to make the number of K odd.
――And if you increase the number of K, for example, if the number of red dots is abnormally large compared to the number of blue dots, the probability of being classified as red will also be abnormally large. Therefore, you need to pay attention to the ratio of red and blue dots.
python
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto',
leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
Perhaps the most important parameter is n_neighbors described above, and the code below will give you the optimal number of K points to consider.
python
#Make a list of numbers to put in K K
myList = list(range(1,50))
#Subtract even numbers from that list to make a list of only odd numbers
neighbors = filter(lambda x: x % 2 != 0, myList)
#Make an empty list of Cross Validation scores
cv_scores = []
#Cross validate and append the score to the above empty list
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
This is also an excerpt from kevinzakka's blog
weights weights is a weighting parameter. If'uniform'is default and you use it, all values are considered equally. When set to'distance', nearby points are considered more often (weighted) than distant points. You can also create and specify functions by yourself.
algorithm I'm sorry, I didn't understand this because of the lack of mathematical education in Euclidean space and other technical terms. However, for such people, you can specify'auto'and they will choose the most suitable one. By the way, other types seem to be'ball_tree','kd_tree','brute', so if you are interested, please check it out. Also, if you find an easy-to-understand explanation, please let us know in the comments.
The above are the main parameters. Also, I will add as soon as I understand more. If you understand, editing requests are also welcome. I look forward to working with you.
The larger the data, the more likely it is that a more accurate classification will be possible. A simple and easy-to-understand model.
--Bad point
I mentioned two bad points in the What is K-nearest neighbor method, but I will summarize it again in one sentence. If multiple classes are abnormally mixed, or if the ratio is abnormally biased, the classification may not work.
The above is the outline of the K-nearest neighbor as far as I can understand. We will update it daily, so if you have something to add or fix, we would appreciate it if you could comment.
Recommended Posts