[PYTHON] [Kaggle] I tried undersampling using imbalanced-learn

0. Introduction

Data is often imbalanced when supervised learning ... Rather, I think there are few cases where a large amount can be secured in a balanced manner.

This time, I will introduce a library called ** imbalanced-learn ** that may be useful for resampling unbalanced data.

I mainly referred to the following articles.

-Imbalanced-learn to under-sampling / over-sampling unbalanced data -Data analysis with Python: Sampling unbalanced data with imbalanced-learn

The Official Documentation is here.

1. Install imbalanced-learn

Follow Install and contribution to install.

pip install -U imbalanced-learn

Install with.

By the way, as of March 2020, it seems that there are the following conditions for the following libraries.

2. Prepare pseudo data

Prepare the pseudo data to be used this time. If you already have the data, skip it. I'm using a function called make_classification.

In[1]


import pandas as pd
from sklearn.datasets import make_classification
df = make_classification(n_samples=5000, n_features=10, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1, weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)

This df contains two return values as tuples. df [0] contains so-called X, df [1] contains the so-called y. Therefore, store it in the data frame by the following operation.

In[2]


df_raw = pd.DataFrame(df[0], columns = ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var8', 'var9', 'var10'])
df_raw['Class'] = df[1]
df_raw.head()

image.png

Divide this into X and y.

In[3]


X = df_raw.iloc[:, 0:10]
y = df_raw['Class']
y.value_counts()

Out[3]


2    4674
1     261
0      65
Name: Class, dtype: int64

As you can see, we have an extremely large amount of data for label 2. This completes the preparation of pseudo data.

3. Split the data with train_test_split

Split the data frame prepared earlier using train_test_split.

In[4]


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 71, stratify=y)
y_train.value_counts()

Out[4]


2    3272
1     183
0      45
Name: Class, dtype: int64

This is the number of data for each label of train data after dividing it into train and test. Here, ** stratified sampling ** is performed by specifying y in the argument stratify of train_test_split. can do.

Statistics Web: Sample extraction method

4. Undersample with RandomUnderSampler

It's finally the main subject. Undersample with RandomUnderSampler. The API is here.

Describes the argument ** sampling_strategy **. With this argument, you can determine the ratio of each class at the time of sampling. It seems that the argument was ratio in the previous version, but it has been changed from version 0.6 to sampling_strategy.

It is possible to give this argument mainly float and dictionary type.

For float, specify a minority class ÷ majority class. However, it is only applicable for 2-label questions.

In case of dictionary type, please pass the sample size of each class as follows.

In[5]


from imblearn.under_sampling import RandomUnderSampler

positive_count_train = y_train.value_counts()[0]
strategy = {0:positive_count_train, 1:positive_count_train*2, 2:positive_count_train*5}

rus = RandomUnderSampler(random_state=0, sampling_strategy = strategy)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
y_resampled.value_counts()

Out[5]


Using TensorFlow backend.
2    225
1     90
0     45
Name: Class, dtype: int64

You have now undersampled.

6. Bonus

Since it's a big deal, I'd like to classify it by some model.

In[6]


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)

print('Accuracy(test) : %.5f' %accuracy_score(y_test, y_pred))

Out[6]


Accuracy(test) : 0.97533

Let's output the confusion matrix as a heat map.

In[7]


from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm)

image.png

The heatmap was also hard to see because the test data was unbalanced ...

7. Summary

This time I tried undersampling.

As I researched various things while writing this article, I found that there are various methods in undersampling and oversampling.

-Introduction of imbalanced-learn functions -[Handling of imbalanced data | PortoSeguro Competition](https://data-bunseki.com/2019/11/30/%E4%B8%8D%E5%9D%87%E8%A1%A1%E3%83 % 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 8F% 96% E3% 82% 8A% E6% 89% B1% E3% 81% 84-portoseguro-% E3% 82% B3% E3% 83% B3% E3% 83% 9A /)

** SMOTE ** seems to be interesting, so I will study it.

We are always looking for articles, comments, etc.

Recommended Posts

[Kaggle] I tried undersampling using imbalanced-learn
[Kaggle] I tried ensemble learning using LightGBM
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using aiomysql
I tried using Summpy
I tried using Pipenv
I tried using matplotlib
I tried using ESPCN
I tried using openpyxl
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using Random Forest
I tried using BigQuery ML
I tried using Amazon Glacier
I tried using git inspector
I tried using magenta / TensorFlow
I tried using AWS Chalice
I tried using Slack emojinator
I tried using Rotrics Dex Arm
I tried using GrabCut of OpenCV
I tried using Thonny (Python / IDE)
I tried server-client communication using tmux
I tried reinforcement learning using PyBrain
Somehow I tried using jupyter notebook
I tried shooting Kamehameha using OpenPose
I tried using the checkio API
[Python] I tried using YOLO v3
I tried asynchronous processing using asyncio
I tried using Amazon SQS with django-celery
I tried using Azure Speech to Text.
I tried using Twitter api and Line api
I tried playing a ○ ✕ game using TensorFlow
I tried using YOUTUBE Data API V3
I tried using Selenium with Headless chrome
I tried drawing a line using turtle
I tried learning with Kaggle's Titanic (kaggle②)
I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ
I tried using Bayesian Optimization in Python
I tried to classify text using TensorFlow
I tried using Selective search as R-CNN
I tried using UnityCloudBuild API from Python
I tried using Headless Chrome from Selenium
I tried using pipenv, so a memo
[Visualization] I tried using Bokeh / plotly! 【memorandum】
I tried using the BigQuery Storage API