[PYTHON] Sampling in imbalanced data

Motivation

My motivation is to effectively learn classifiers using imbalanced data of 1: 10,000 or more. I think that those who are doing web-based CV analysis may be troubled around here. I am one of them w

What I wrote in this blog

Overview of sampling methods for unbalanced data

References

Appropriate summary of the above paper

Please see the paper for details as it is really properly summarized. ..

How to deal with imbalanced data

- algorithm-level approaches Introduce a coefficient to adjust the imbalance into the model (adjust the cost function) - data-level approaches A method of reducing majority data and increasing minority data. The order is undersampling, oversampling. This time, we will mainly describe data-level approaches.

Sampling type

There are roughly divided under-sampling, over-sampling, and hybrid methods. - under-sampling ・ Reduce majority data ・ Random undersampling and other methods ・ Random undersampling may delete useful data ⇒ If the cluster-based method is used, each class will have a distinct data group. Do not erase only some useful data

- over-sampling ・ Increase minority data ・ Random oversampling and other methods ・ Random oversampling tends to cause overfitting ⇒Resolved by increasing peripheral data (data with noise added) instead of duplicating existing data

- Hybrid Methods Do both under-sampling and over-sampling

Error type

Looking at the formula, it is as follows. erro.png

Sampling algorithm

- under-sampling · Leave / drop farthest / most recent data / clusters from minority data / clusters (In the case of a cluster, it is judged by the distance of the center of gravity, etc.) ・ Cluster all data with k-means, and determine the number of negative sample reductions based on the ratio of positive and negative samples for each cluster. ← This time I am using this method - over-sampling ・ The method called SMOTE seems to be the de facto standard Sampled with noise added to one of the five neighbors of the minority sample on a k-NN basis

Reference code

under-sampling First of all, about under-sampling

python


def undersampling(imp_info, cv, m):
    # minority data
    minodata = imp_info[np.where(cv==1)[0]]
    
    # majority data
    majodata = imp_info[np.where(cv==0)[0]]

    #Clustering with kmeans2
    whitened = whiten(imp_info) #Normalization (match the variance of each axis)
    centroid, label = kmeans2(whitened, k=3) # kmeans2
    C1 = []; C2 = []; C3 = []; #For cluster storage
    C1_cv = []; C2_cv = []; C3_cv = [] 
    for i in xrange(len(imp_info)):
        if label[i] == 0:
            C1 += [whitened[i]]
            C1_cv.append(cv[i])
        elif label[i] == 1:
            C2 += [whitened[i]]
            C2_cv.append(cv[i])
        elif label[i] == 2:
            C3 += [whitened[i]]
            C3_cv.append(cv[i])
    
    #Converted because numpy format is easier to handle
    C1 = np.array(C1); C2 = np.array(C2); C3 = np.array(C3) 
    C1_cv = np.array(C1_cv); C2_cv = np.array(C2_cv); C3_cv = np.array(C3_cv);
    
    #Number of minority data for each class
    C1_Nmajo = sum(1*(C1_cv==0)); C2_Nmajo = sum(1*(C2_cv==0)); C3_Nmajo = sum(1*(C3_cv==0)) 
    
    #Number of majority data for each class
    C1_Nmino = sum(1*(C1_cv==1)); C2_Nmino = sum(1*(C2_cv==1)); C3_Nmino = sum(1*(C3_cv==1))
    t_Nmino = C1_Nmino + C2_Nmino + C3_Nmino

    #There is a possibility that 0 will appear in the denominator, so add 1
    C1_MAperMI = float(C1_Nmajo)/(C1_Nmino+1); C2_MAperMI = float(C2_Nmajo)/(C2_Nmino+1); C3_MAperMI = float(C3_Nmajo)/(C3_Nmino+1);
    
    t_MAperMI = C1_MAperMI + C2_MAperMI + C3_MAperMI
    
    under_C1_Nmajo = int(m*t_Nmino*C1_MAperMI/t_MAperMI)
    under_C2_Nmajo = int(m*t_Nmino*C2_MAperMI/t_MAperMI)
    under_C3_Nmajo = int(m*t_Nmino*C3_MAperMI/t_MAperMI)
    t_under_Nmajo = under_C1_Nmajo + under_C2_Nmajo + under_C3_Nmajo

#    draw(majodata, label)
    
    #Delete data so that the majority and minority are the same in each group
    C1 = C1[np.where(C1_cv==0),:][0]
    random.shuffle(C1)
    C1 = np.array(C1)
    C1 = C1[:under_C1_Nmajo,:]
    C2 = C2[np.where(C2_cv==0),:][0]
    random.shuffle(C2)
    C2 = np.array(C2)
    C2 = C2[:under_C2_Nmajo,:]
    C3 = C3[np.where(C3_cv==0),:][0]
    random.shuffle(C3)
    C3 = np.array(C3)
    C3 = C3[:under_C3_Nmajo,:]
    
    cv_0 = np.zeros(t_under_Nmajo); cv_1 = np.ones(len(minodata))
    cv_d = np.hstack((cv_0, cv_1))
    
    info = np.vstack((C1, C2, C3, minodata))
    
    return cv_d, info

over-sampling Next, about over-sampling

python


class SMOTE(object):
    def __init__(self, N):
        self.N = N
        self.T = 0
    
    def oversampling(self, smp, cv):
        mino_idx = np.where(cv==1)[0]
        mino_smp = smp[mino_idx,:]
        
        #Implementation of kNN
        mino_nn = []
        
        for idx in mino_idx:
            near_dist = np.array([])
            near_idx = np.zeros(nnk)
            for i in xrange(len(smp)):
                if idx != i:
                    dist = self.dist(smp[idx,:], smp[i,:])
                    
                    if len(near_dist)<nnk: #If you have not reached the expected number of neighbors, add it to the list without asking questions
                        tmp = near_dist.tolist()
                        tmp.append(dist)
                        near_dist = np.array(tmp)
                    elif sum(near_dist[near_dist > dist])>0:
                        near_dist[near_dist==near_dist.max()] = dist
                        near_idx[near_dist==near_dist.max()] = i
            mino_nn.append(near_idx)
        return self.create_synth( smp, mino_smp, np.array(mino_nn, dtype=np.int) )

    def dist(self, smp_1, smp_2):
        return np.sqrt( np.sum((smp_1 - smp_2)**2) )
                    
    def create_synth(self, smp, mino_smp, mino_nn):
        self.T = len(mino_smp)
        if self.N < 100:
            self.T = int(self.N*0.01*len(mino_smp))
            self.N = 100
        self.N = int(self.N*0.01)
        
        rs = np.floor( np.random.uniform(size=self.T)*len(mino_smp) )
        
        synth = []
        for n in xrange(self.N):
            for i in rs:
                nn = int(np.random.uniform(size=1)[0]*nnk)
                dif = smp[mino_nn[i,nn],:] - mino_smp[i,:]
                gap = np.random.uniform(size=len(mino_smp[0]))
                tmp = mino_smp[i,:] + np.floor(gap*dif)
                tmp[tmp<0]=0
                synth.append(tmp)
        return synth   

Impressions of the experiment

I wonder if algorithms-level approaches are more robust to dirty data than data-level approaches. Maybe the code is wrong. .. .. Please teach if there are any deficiencies m (_ _) m

Summary

Since batch processing is the target of data-level approaches and there is a high possibility that the number of calculations will increase, I felt that it would be more practical to make adjustments with algorithm-level approaches. With algorithm-level approaches, you only have to increase the cost and the gradient of weight adjustment in the minority data sample, so there is almost no effect on the calculation time. .. If you have any other good ways to deal with imbalanced data, please comment.

Recommended Posts

Sampling in imbalanced data
Handle Ambient data in Python
Data Manipulation in Python-Try Pandas_plyr
Display UTM-30LX data in Python
Write data in HDF format
Probability prediction of imbalanced data
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Get Leap Motion data in Python.
Export DB data in json format
Read Protocol Buffers data in Python3
Get data from Quandl in Python
How to deal with imbalanced data
Handle NetCDF format data in Python
How to deal with imbalanced data
Data visualization in Python-draw cool heatmaps
Store RSS data in Zabbix (Zabbix sender)
Try to put data in MongoDB
Data prediction competition in 3 steps (titanic)
Hashing data in R and Python
Machine learning in Delemas (data acquisition)
Check the data summary in CASTable
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
View image after Data Augmentation in PyTorch
Data input / output in Python (CSV, JSON)
Ant book in python: Sec. 2-4, data structures
Try working with binary data in Python
Machine learning imbalanced data sklearn with k-NN
Windows → linux Tips for bringing in data
Get Google Fit API data in Python
Wind-like dummy data generation in Markov process
Python: Preprocessing in machine learning: Data acquisition
Get Youtube data in Python using Youtube Data API
I saved the scraped data in CSV!
Store RSS data in Zabbix (external check)
Easily graph data in shell and Python
Separation of design and data in matplotlib
Conversion of time data in 25 o'clock notation
RDS data via stepping stones in Pandas
Python: Preprocessing in machine learning: Data conversion
Gacha written in python-Implementation in basic data structure-
Overwrite data in RDS with AWS Glue
Preprocessing in machine learning 1 Data analysis process
SELECT data using client library in BigQuery
Working with 3D data structures in pandas
Books on data science to read in 2020