Align the number of samples between classes of data for machine learning with Python

For machine learning, it is desirable to have the same number of samples between classes. However, in reality, not only such clean data, but also data with different numbers of samples between classes are often used.

This time, I implemented the process of aligning the number of samples between classes described in the label data in Python, so make a note.

Thing you want to do

When there is the following data array and its label data

#Data array
data = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
#Label array
label = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])

###############
#Data processing...
###############

>>>data
[10 11 12 14 15 16]
>>>label
[0 0 1 1 2 2]

code

Details are in the comments. Simply put, we are doing the following for a class that has more samples than the minimum number of samples.

  1. Get the index array of the data elements of that class
  2. Use random.sample () to get the indexes of the number of elements to be randomly deleted from the index array.
  3. Delete the acquired index data and label
import numpy as np
import random

#Data array
data = np.array(range(10,20))
print("data:", data)
#Label array
label = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])
print("label:", label)
#Number of samples for all classes
sample_nums = np.array([])


print("\n Calculate the number of samples for each class")
for i in range(max(label)+1):
    #Number of samples for each class
    sample_num = np.sum(label == i)
    #Added to sample number management array
    sample_nums = np.append(sample_nums, sample_num)
print("sample_nums:", sample_nums)

#Minimum number of samples in all classes
min_num = np.min(sample_nums)
print("min_num:", min_num)


print("\n Align the number of samples for each class")
for i in range(len(sample_nums)):

    #Difference between the number of samples in the target class and the minimum number of samples
    diff_num = int(sample_nums[i] - min_num)
    print("class%d Number of deleted samples: %d (%0.2f%)" % (i, diff_num, (diff_num/sample_nums[i])*100))

    #Skip if you don't need to delete
    if diff_num == 0:
        continue

    #Index of elements to delete
    #Since it is a tuple, convert it to list(Located at the 0th index)
    indexes = list(np.where(label == i)[0])
    print("\tindexes:", indexes)

    #Index of data to delete
    del_indexes = random.sample(indexes, diff_num)
    print("\tdel_indexes:", del_indexes)

    #Delete from data
    data = np.delete(data, del_indexes)
    label = np.delete(label, del_indexes)


print("\ndata:", data)
print("label:", label)

Execution result

data: [10 11 12 13 14 15 16 17 18 19]
label: [0 0 1 1 1 2 2 2 2 2]

Calculate the number of samples for each class
sample_nums: [ 2.  3.  5.]
min_num: 2.0

Align the number of samples for each class
Class 0 number of deleted samples: 0 (0.00%)
Class 1 number of deleted samples: 1 (33.33%)
	indexes: [2, 3, 4]
	del_indexes: [3]
Class 2 number of deleted samples: 3 (60.00%)
	indexes: [4, 5, 6, 7, 8]
	del_indexes: [7, 8, 6]

data: [10 11 12 14 15 16]
label: [0 0 1 1 2 2]

At the end

If you are familiar with Python, you can make it more efficient.

Recommended Posts

Align the number of samples between classes of data for machine learning with Python
Summary of the basic flow of machine learning with Python
[Homology] Count the number of holes in data with Python
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
Amplify images for machine learning with python
[Machine learning] Check the performance of the classifier with handwritten character data
[Shakyo] Encounter with Python for machine learning
Python learning memo for machine learning by Chainer until the end of Chapter 2
The story of low learning costs for Python
Upgrade the Azure Machine Learning SDK for Python
Calculate the total number of combinations with python
I started machine learning with Python Data preprocessing
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
[Introduction to Python] How to get the index of data with a for statement
Predict the gender of Twitter users with machine learning
Record of the first machine learning challenge with Keras
Extract the band information of raster data with python
Data set for machine learning
Machine learning with Python! Preparation
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
Beginning with Python machine learning
Try scraping the data of COVID-19 in Tokyo with Python
The result of Java engineers learning machine learning in Python www
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
The story of rubyist struggling with python :: Dict data with pycall
How to increase the number of machine learning dataset images
Machine learning with python (1) Overall classification
<For beginners> python library <For machine learning>
"Scraping & machine learning with Python" Learning memo
A story stuck with the installation of the machine learning library JAX
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
Building a Windows 7 environment for getting started with machine learning with Python
The story of making a standard driver for db with python.
Seaborn basics for beginners ① Aggregate graph of the number of data (Countplot)
How to use machine learning for work? 01_ Understand the purpose of machine learning
Aggregate the number of hits per second for one day from the web server log with Python
For those of you who glance at the log while learning with machine learning ~ Muscle training with LightGBM ~
[Examples of improving Python] Learning Python with Codecademy
Align the size of the colorbar with matplotlib
REST API of model made with Python with Watson Machine Learning (CP4D edition)
Machine learning imbalanced data sklearn with k-NN
Machine learning with python (2) Simple regression analysis
Learning notes from the beginning of Python 1
[Python] [Machine learning] Beginners without any knowledge try machine learning for the time being
Why Python is chosen for machine learning
Try to image the elevation data of the Geographical Survey Institute with Python
[Example of Python improvement] What is the recommended learning site for Python beginners?
Python: Preprocessing in machine learning: Data acquisition
Source code of sound source separation (machine learning practice series) learned with Python
[Python] First data analysis / machine learning (Kaggle)
One-click data prediction for the field realized by fully automatic machine learning
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
The third night of the loop with for
Get the number of articles accessed and likes with Qiita API + Python
A beginner of machine learning tried to predict Arima Kinen with python
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
An introduction to Python for machine learning
The second night of the loop with for
[Machine learning pictorial book] A memo when performing the Python exercise at the end of the book while checking the data