0. Introduction

Data is often imbalanced when supervised learning ... Rather, I think there are few cases where a large amount can be secured in a balanced manner.

This time, I will introduce a library called ** imbalanced-learn ** that may be useful for resampling unbalanced data.

I mainly referred to the following articles.

-Imbalanced-learn to under-sampling / over-sampling unbalanced data -Data analysis with Python: Sampling unbalanced data with imbalanced-learn

The Official Documentation is here.

1. Install imbalanced-learn

Follow Install and contribution to install.

pip install -U imbalanced-learn

Install with.

By the way, as of March 2020, it seems that there are the following conditions for the following libraries.

numpy (>=1.11)
scipy (>=0.17)
scikit-learn (>=0.21)

2. Prepare pseudo data

Prepare the pseudo data to be used this time. If you already have the data, skip it. I'm using a function called make_classification.

`In[1]`


import pandas as pd
from sklearn.datasets import make_classification
df = make_classification(n_samples=5000, n_features=10, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1, weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)

This df contains two return values as tuples. df [0] contains so-called X, df [1] contains the so-called y. Therefore, store it in the data frame by the following operation.

`In[2]`


df_raw = pd.DataFrame(df[0], columns = ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var8', 'var9', 'var10'])
df_raw['Class'] = df[1]
df_raw.head()

Divide this into X and y.

`In[3]`


X = df_raw.iloc[:, 0:10]
y = df_raw['Class']
y.value_counts()

`Out[3]`


2    4674
1     261
0      65
Name: Class, dtype: int64

As you can see, we have an extremely large amount of data for label 2. This completes the preparation of pseudo data.

3. Split the data with train_test_split

Split the data frame prepared earlier using train_test_split.

`In[4]`


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 71, stratify=y)
y_train.value_counts()

`Out[4]`


2    3272
1     183
0      45
Name: Class, dtype: int64

This is the number of data for each label of train data after dividing it into train and test. Here, ** stratified sampling ** is performed by specifying y in the argument stratify of train_test_split. can do.

Statistics Web: Sample extraction method

4. Undersample with RandomUnderSampler

It's finally the main subject. Undersample with RandomUnderSampler. The API is here.

Describes the argument ** sampling_strategy **. With this argument, you can determine the ratio of each class at the time of sampling. It seems that the argument was ratio in the previous version, but it has been changed from version 0.6 to sampling_strategy.

It is possible to give this argument mainly float and dictionary type.

For float, specify a minority class ÷ majority class. However, it is only applicable for 2-label questions.

In case of dictionary type, please pass the sample size of each class as follows.

`In[5]`


from imblearn.under_sampling import RandomUnderSampler

positive_count_train = y_train.value_counts()[0]
strategy = {0:positive_count_train, 1:positive_count_train*2, 2:positive_count_train*5}

rus = RandomUnderSampler(random_state=0, sampling_strategy = strategy)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
y_resampled.value_counts()

`Out[5]`


Using TensorFlow backend.
2    225
1     90
0     45
Name: Class, dtype: int64

You have now undersampled.

6. Bonus

Since it's a big deal, I'd like to classify it by some model.

`In[6]`


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)

print('Accuracy(test) : %.5f' %accuracy_score(y_test, y_pred))

`Out[6]`


Accuracy(test) : 0.97533

Let's output the confusion matrix as a heat map.

`In[7]`


from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm)

The heatmap was also hard to see because the test data was unbalanced ...

7. Summary

This time I tried undersampling.

As I researched various things while writing this article, I found that there are various methods in undersampling and oversampling.

-Introduction of imbalanced-learn functions -[Handling of imbalanced data | PortoSeguro Competition](https://data-bunseki.com/2019/11/30/%E4%B8%8D%E5%9D%87%E8%A1%A1%E3%83 % 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 8F% 96% E3% 82% 8A% E6% 89% B1% E3% 81% 84-portoseguro-% E3% 82% B3% E3% 83% B3% E3% 83% 9A /)

** SMOTE ** seems to be interesting, so I will study it.

We are always looking for articles, comments, etc.

[PYTHON] [Kaggle] I tried undersampling using imbalanced-learn

0. Introduction

1. Install imbalanced-learn

2. Prepare pseudo data

In[1]

In[2]

In[3]

Out[3]

3. Split the data with train_test_split

In[4]

Out[4]