[Implementation for learning] Implement Stratified Sampling in Python (1)

Entry summary

Stratified sampling is a technique for maintaining a good population distribution and sampling. In python, it is implemented in scikit-learn's StratifiedShuffleSplit and train_test_split. I'm often taken care of when cross-validating machine learning models. Instead of just using it as a tool, I implemented a sample code for learning in python and compared it with random sampling to deepen my understanding. As a result, it was confirmed that the smaller the number of samples from the extraction source, the better the accuracy of extraction compared to random sampling. (Although it is a natural result, I deepened my understanding by reproducing it with my own hands)

Stratified sampling

Simply put, it's a useful technique for sampling from a biased sample composition population. Divide the population into small groups called "layers". At that time, the dispersion for each layer is as small as possible, and the dispersion between layers is as large as possible. In other words, samples with the same attributes are grouped together.

For example, suppose you have 100 cards with a number from 0 to 9. Shuffle this.

0 1 9 ・ ・ ・ 5 3 7 1

Group this by the same number. It's a number, so you can sort it.

0 0 0 ・ ・ ・ 5 5 5 5 ・ ・ ・ 9 9 9 9

After grouping by the same number like this, sample from each group at the same ratio. For example, if the sampling rate is 10%, random sampling will be performed from each group with numbers 0 to 9 with a 10% probability. If there are 10 numbers from 0 to 9, random sampling is performed as follows.

0 0 0 ・ ・ ・ 0 → Randomly extract one sheet
1 1 1 ・ ・ ・ 1 → Randomly extract one sheet
2 2 2 ・ ・ ・ 2 → Randomly extract one sheet
・ ・ ・
9 9 9 ・ ・ ・ 9 → Randomly extract one sheet

This stratified sampling is effective for biased sample configurations.

Extraction from biased sample composition

As an easy-to-understand example, let's say that out of 20 cards, 2 are "0" and the remaining 18 are "1".

0 0 1 1 ・ ・ ・ 1 1 1 1 1

(1) When randomly sampling from the whole

Let's say you randomly sample the whole thing and choose 10 of them. The extraction rate is 50%. Since it is random, it may happen that 10 out of 18 "1" s are selected and "0" is not extracted. In that case, the sample after extraction does not contain any "0", so the distribution cannot be maintained.

0 0 1 1 ・ ・ ・ 1 1 1 1 1

↓ ↓ ↓ ↓ Extract 10 out of 20

1 1 1 1 1 1 1 1 1 1

→ ** Since there is no "0", it is clearly different from the distribution of the population! ** **

Therefore, stratified sampling is performed.

(2) In the case of stratified sampling

For stratified sampling, extract as follows.

0 0 1 1 ・ ・ ・ 1 1 1 1 1

↓ ↓ ↓ ↓ Extract 10 out of 20

0 1 1 1 1 1 1 1 1 1

→ ** The composition ratio of "0" and "1" is the same as the population = the distribution is the same! ** **

The samples thus extracted maintain a better population distribution than if they were randomly selected from the whole.

Implement stratified sampling

Now that we have an overview, we will implement stratified sampling. The processing logic is as follows.

--Suppose a given sample consists of K layers. --Let the extraction ratio be r. (0 <r <1) --Extract at least one from each layer. --After extracting one from each layer, random sampling is performed for each layer. --Let the layer number of each layer be i. (i = 1,2,3, .., K) --The number of samples in each layer is N (i). --Let's assume that the sampling rate for random sampling in each layer is X (i). (0 <X (i) <1; i = 1,2,3, .., K) --Since one is always extracted in each layer, random sampling is performed for the remaining N (i) --1 at the extraction rate X (i). ――The relationship between the number of extracts is as follows.

The code implemented as a sample based on the above logic is as follows.

import numpy as np
import random

def extract_stratified_sampling_result(ratio, base_samples):
    u"""
Stratified sampling is performed from a finite population by specifying the extraction ratio.
    :param ratio:Extraction ratio 0 to 1.0
    :param base_sample:Extraction source group
    :return:
    """
    #First, take out one from each group of numbers.
    #Then, it is randomly selected from each number group to bring the composition ratio closer to the population.
    #X extraction rate after extracting one from each number group(i)And. i is the group number.
    #N for the number of each digit group(i)And.
    #The number to be extracted is ratio x N(i)Is.
    #I have already taken out one, so it remains( N(i) - 1 )Extraction rate X from the pieces(i)Take out at random.
    #Therefore, when combined with the one already taken out, ratio x N(i)Become an individual.
    # X(i) x (N(i) - 1) + 1 = ratio x N(i)
    # X(i) = (ratio x N(i) - 1 )/(N(i) - 1)Is.

    block_count = np.bincount(base_samples)
    x = (ratio * block_count - 1) / (block_count - 1)

    #Calculate the random number threshold for sampling.
    #Threshold= 1.0 -Extraction rate of each group
    #Extract when the random number exceeds the threshold.
    #Extraction rate of a group is 0.If 3, then 1.0 - 0.3 = 0.7, random number is 0.If it is 7 or more, it will be extracted.
    #An array x containing the extraction rates for each number group,
    #Arrange as many numbers as there are numbers.
    threshold = np.repeat(1.0 - x, block_count)

    #Index list of each element when the original set is sorted
    #Extracting from samples in this order will result in sorting.
    sorted_pos = np.argsort(base_samples)

    #Start position of each number group
    block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

    #When the generated random number exceeds the threshold threshold, it is extracted.
    threshold[block_start] = 0  #The first element of each number group is always extracted
    extracted = []
    for i in range(len(base_samples)):
        each_rand = random.random()
        if each_rand > threshold[i]:
            pos = sorted_pos[i]
            extracted.append(base_samples[pos])
    extracted = np.array(extracted, dtype=np.int64)
    return extracted

It is assumed that the argument base_samples contains a list of integers. Let's take a look at each code. First, let's understand the structure of the group of the extraction source base_samples. That's where np.bincount () comes in. It aggregates the number of each number that makes up the list.

    
    block_count = np.bincount(base_samples)
    

For example, if base_samples contains 9 0s and 91 1s, block_count will return the following result:

[ 9 91] 

In other words block_coount [0] = number of numbers 0, block_count [1] = number of numbers 1, That's why. This block_count corresponds to the number N (i) of each layer in the previous logic. Next, find the extraction rate X (i) for each layer. The extraction rate r corresponds to the argument ratio, so the code looks like this:

    x = (ratio * block_count - 1) / (block_count - 1)

Since block_count is a numpy array, the calculation result will also be a numpy array. That is, the contents of x look like this:

x [0] = Extraction rate of layer with number 0 x [1] = Extraction rate of the layer with the number 1

Now that x has been calculated, the next step is to find the random number threshold. If the random number is greater than or equal to this value, the sample is taken out. Therefore,

Random number threshold for each layer= 1.0 - X(i)  

It will be. For example, if the extraction rate for the number 0 is 0.10, the threshold is

1.0 - 0.10 = 0.90

It will be.

In other words, the sample is taken out only when the generated random number (0 to 1.0) is 0.90 or more.

The random number thresholds for each layer obtained in this way are repeatedly arranged for the number of samples in each layer.

    #An array x containing the extraction rates for each number group,
    #Arrange as many numbers as there are numbers.
    threshold = np.repeat(1.0 - x, block_count)

x[0] = 0.20 , block_count[0] = 2 Assuming x [1] = 0.10, block_count [1] = 18, the list threshold of random number thresholds is:

threshold = [ 0.80, 0.80, 0.90, 0.90, 0.90, ... 0.90]

Next, get the index list after sorting base_samples.

    #Index list of each element when the original set is sorted
    #Extracting from samples in this order will result in sorting.
    sorted_pos = np.argsort(base_samples)

You can use a combination of threshold, sorted_pos, and base_samples for stratified sampling. For example

--threshold [0]: Random number threshold of the first number of layer 0 --sort_pos [0]: Position (= index) where the first number of layer 0 is stored on base_samples

Therefore, if the first random number generated is threshold [0] or higher, the first element of layer 0 is extracted. By moving the scanning position of threshold and sorted_pos, stratified sampling can be performed.

After that, we will process for the logic that one is always taken out from each layer.

    #Start position of each number group
    block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

    #When the generated random number exceeds the threshold threshold, it is extracted.
    threshold[block_start] = 0  #The first element of each number group is always extracted

block_start contains the position of the first element of each layer. For example, if there are 10 0s and 90 1s, then:

block_start = [0,10]

Layers with the number 0 start at the block_start [0] th The layer with the number 1 means that it starts at the block_start [1] th.

Use this block_start to set the random number threshold of the first element of each layer to 0. A random number threshold of 0 means that it will always be extracted.

And since the random number threshold value after the beginning of each layer is (1-X (i)), it will be randomly extracted according to the extraction rate calculated.

The entry is likely to be long, so I'll conclude here.

The following entry presents sample code that compares the performance of simple random sampling with stratified sampling.

Recommended Posts

[Implementation for learning] Implement Stratified Sampling in Python (1)
Implement stacking learning in Python [Kaggle]
Implement Enigma in python
RNN implementation in python
ValueObject implementation in Python
Implement recommendations in Python
Implement XENO in python
Implement sum in Python
Implement Traceroute in Python 3
SVM implementation in python
First deep learning in C #-Imitating implementation in Python-
Build an interactive environment for machine learning in Python
Learning flow for Python beginners
Python learning plan for AI learning
Search for strings in Python
Implement naive bayes in Python 3.3
Implement ancient ciphers in python
Techniques for sorting in Python
Implement Redis Mutex in Python
Implement extension field in Python
Implement fast RPC in Python
Implement method chain in Python
Checkio's recommendation for learning Python
Implement Dijkstra's Algorithm in python
Implement Slack chatbot in Python
Implementation of quicksort in Python
About "for _ in range ():" in python
How to implement Python EXE for Windows in Docker container
Implement R's power.prop.test function in python
Check for external commands in python
Sorting algorithm and implementation in Python
HMM parameter estimation implementation in python
Mixed normal distribution implementation in python
Web teaching materials for learning Python
Implement the Singleton pattern in Python
Implementation of life game in Python
<For beginners> python library <For machine learning>
Python: Preprocessing in Machine Learning: Overview
Implemented Perceptron learning rules in Python
Quickly implement REST API in Python
Implementation of original sorting in Python
Run unittests in Python (for beginners)
Learning history for participating in team app development in Python ~ Django Tutorial 5 ~
Learning history for participating in team application development in Python ~ Index page ~
Learning history for participating in team app development in Python ~ Django Tutorial 4 ~
How about Anaconda for building a machine learning environment in Python?
Learning history for participating in team app development in Python ~ Django Tutorial 6 ~
I tried to implement PLSA in Python
Implement __eq__ etc. generically in Python class
I tried to implement permutation in Python
Amplify images for machine learning with python
Notes on nfc.ContactlessFrontend () for nfcpy in python
Inject is recommended for DDD in Python
Implementation module "deque" in queue and Python
Implement FIR filters in Python and C
Tips for dealing with binaries in Python
[python] Frequently used techniques in machine learning
Collectively implement statistical hypothesis testing in Python
Why Python is chosen for machine learning
I tried to implement PLSA in Python 2
Summary of various for statements in Python