[PYTHON] Machine learning imbalanced data sklearn with k-NN

(Added on 2020/02/25) TODO: The weights of k-NN are calculated by the sum of the distances without using the reciprocal of the distances. Will be corrected to the reciprocal of the distance. (The calculation method of KNeighborsClassifier is not wrong, but the calculation method of my own function is wrong)

Conclusion

--With KNeighbors Classifier of sklearn, it was possible to set a heavy weight on the siple side with less imbalanced data.

--Result: We were able to raise the recall on the small sample side. - before confusion matrix
[[2641, 67]
[ 167, 125]] - after: confusion matrix
[[2252 456]
[ 80 212]]


Image diagram before スクリーンショット 2020-02-24 12.39.05.png

Image diagram after スクリーンショット 2020-02-24 12.38.55.png

Background / Issues

--The behavior of the weights argument of sklearn.neighbors.KNeighborsClassifier was unclear, so I checked it.

Method

No setting for imbalanced data

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

%matplotlib inline
from sklearn.datasets import make_classification
data_base = make_classification(
    n_samples = 10000, n_features = 2, n_informative = 2, n_redundant = 0, 
    n_repeated = 0, n_classes = 2, n_clusters_per_class = 2, weights = [0.9, 0.1], 
    flip_y = 0, class_sep = 0.5, hypercube = True, shift = 0.0, 
    scale = 1.0, shuffle = True, random_state =5)

df = pd.DataFrame(data_base[0], columns = ['f1', 'f2'])
df['class'] = data_base[1]

fig = plt.figure()
ax = fig.add_subplot()
for i in df.groupby('class'):
    cls = i[1]
    ax.plot(cls['f1'],
              cls['f2'],
               'o',
            ms=2)

plt.show()

image.png

X = df[["f1","f2"]]
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

print("train", X_train.shape, y_train.shape)
print("test", X_test.shape, y_test.shape)

train (7000, 2) (7000,) test (3000, 2) (3000,)


from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
    
result   = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)

print("confusion matrix \n",result) 
print("accuracy \n", result2 )

confusion matrix [[2641 67] [ 167 125]] accuracy 0.922

There is a setting for imbalanced data

--First, calculate the reciprocal of the ratio of the sample size to weight.

size_and_weight = pd.DataFrame({
                'class0': [sum(clf._y == 0),1/ (sum(clf._y == 0)/ len(clf._y))],
                'class1': [sum(clf._y == 1),1/ (sum(clf._y == 1)/ len(clf._y))]}).T
size_and_weight.columns = ['sample_size', 'weight']
size_and_weight
sample_size weight
class0 6292.0 1.112524
class1 708.0 9.887006

--Train the train data, and then calculate the distance for the test data.



weights_array = pd.Categorical(clf._y)
weights_array.categories = [size_and_weight.loc[('class0'),'weight'],
                            size_and_weight.loc[('class1'),'weight']]

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
neigh_dist, neigh_ind = clf.kneighbors(X_test) #The data frame of this part will be described later.

weights_array = np.array(weights_array).reshape((-1, 1))[neigh_ind,0]
pd.DataFrame(weights_array).head()
0 1 2 3 4
0 1.112524 1.112524 1.112524 1.112524 1.112524
1 1.112524 1.112524 1.112524 1.112524 1.112524
2 1.112524 9.887006 1.112524 1.112524 1.112524
3 1.112524 1.112524 1.112524 1.112524 1.112524
4 1.112524 1.112524 1.112524 1.112524 1.112524

-↑ Weight completed to handle imbalanced data

--Set the argument weights to take weight into account and execute until prediction

def tmp(array_):
    global weights_array
    array_ = array_ * weights_array
    return array_

clf = KNeighborsClassifier(n_neighbors=5,weights=tmp)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

pred = clf.predict(X_test)
result   = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)

print("confusion matrix \n",result) 
print("accuracy \n", result2 )

confusion matrix [[2252 456] [ 80 212]] accuracy 0.8213333333333334

Conclusion

--KNeighborsClassifier classifies test data into the class with the largest total distance. (Although it is a slightly counterintuitive algorithm that the estimation result is in a distant class, the distance calculation is applied only to the closest n data, so if weight is not applied, the estimation is closer to a majority vote than the sum of the distances. It becomes.) --By replacing the sum with "element product of sum and weights_array", we customized the method to deal with imbalanced data.

Details: About customized arguments

Replace the sum with "sum and element product of weights_array"

In order to do so, it is necessary to understand the correspondence between the following data frames.


Understand the correspondence of neigh_dist, neigh_ind, y_train (clf._y) from the table.

neigh_dist, neigh_ind = clf.kneighbors(X_test)
pd.DataFrame(neigh_dist).tail(5)
pd.DataFrame(neigh_ind).tail(5)
pd.DataFrame(clf._y.reshape((-1, 1))[neigh_ind,0]).tail(5)

image.png

From the left of ↑, it becomes neighbor_dist, neigh_ind, "neigh_ind class"

--Discussion 1: The shapes of the above three tables are the same. --Discussion 2: [Number of 3 lines above] = [Number of lines of test data] --Discussion 3: Number of columns = [n_neighbors = 5] --Discussion 4: Regarding neigh_dist, it increases as you move to the right. In other words, it is considered that the five points closest to the test data were extracted.

Understand the correspondence between neighbor_dist, X_train, and X_test from the calculation of neigh_dist.

--DataFrame: Calculate the following numbers for neigh_dist --index = 2998 # 2998th test data --values = 0.015318 # Distance between [1374th data of X_train determined to be the closest distance] and [test data of the above index]

test_index = 2998
tmp1 = pd.DataFrame(X_test.iloc[test_index])
display(tmp1.T)

train_index = 1374
tmp2 = pd.DataFrame(X_train.iloc[train_index])
display(tmp2.T)

image.png

#Calculate Euclidean distance
(
sum(   (tmp1.values - tmp2.values)  **2    )
**(1/2)
)

array([0.01531811])

--About neigh_dist.iloc [2998,0]: could be calculated from the training data and the test data.

Correspondence between y_train and clf_knn._y

sum(clf_knn._y == y_train) == len(y_train)

True

--It turns out that y_train and clf_knn._y match.

Correspondence between neigh_ind and "neigh_ind class"

index_ = neigh_ind[2998,:]
pd.DataFrame(clf._y[index_]).T

image.png

--About "neigh_ind class" line 2998: I was able to create a "neigh_ind class" from night_ind and y_train.

Finally

-Chapter [Details: About customization arguments] only explains the source code of the KNeighborsClassifier.predict part, so it may be faster to look at the source code of git. Reference git sklearn

Recommended Posts

Machine learning imbalanced data sklearn with k-NN
I started machine learning with Python Data preprocessing
Machine learning learned with Pokemon
Data set for machine learning
Machine learning with Python! Preparation
Machine learning Minesweeper with PyTorch
Beginning with Python machine learning
Try machine learning with Kaggle
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
How to deal with imbalanced data
How to deal with imbalanced data
Try machine learning with scikit-learn SVM
Machine learning in Delemas (data acquisition)
Preprocessing in machine learning 2 Data acquisition
Quantum-inspired machine learning with tensor networks
Get started with machine learning with SageMaker
"Scraping & machine learning with Python" Learning memo
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
[Machine learning] Check the performance of the classifier with handwritten character data
Machine learning
Predict power demand with machine learning Part 2
Amplify images for machine learning with python
Machine learning with python (2) Simple regression analysis
A story about machine learning with Kyasuket
Python: Preprocessing in machine learning: Data acquisition
[Shakyo] Encounter with Python for machine learning
[Python] First data analysis / machine learning (Kaggle)
Machine learning with Pytorch on Google Colab
Data analysis starting with python (data preprocessing-machine learning)
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Build AI / machine learning environment with Python
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
I started machine learning with Python (I also started posting to Qiita) Data preparation
[Python] Easy introduction to machine learning with python (SVM)
Machine learning starting with Python Personal memorandum Part2
Data supply tricks using deques in machine learning
Machine learning starting with Python Personal memorandum Part1
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python] Collect images with Icrawler for machine learning [1000 images]
Machine learning starting from scratch (machine learning learned with Kaggle)
Machine learning Training data division and learning / prediction / verification
Looking back on learning with Azure Machine Learning Studio
[Python3] Let's analyze data using machine learning! (Regression)
[Memo] Machine learning
Machine learning classification
[Machine learning] Create a machine learning model by performing transfer learning with your own data set
Align the number of samples between classes of data for machine learning with Python
Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane
A story about data analysis by machine learning
Build a Python machine learning environment with a container
Machine Learning sample
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
I tried to move machine learning (ObjectDetection) with TouchDesigner
Easy Machine Learning with AutoAI (Part 4) Jupyter Notebook Edition
Machine learning with Raspberry Pi 4 and Coral USB Accelerator