[PYTHON] I want to find variations in various statistics! Recommendation for re-sampling (Bootstrap)

Data analysis seems to be booming. Whether it's machine learning or whatever, the first thing you do when you're given data is to visualize the data and get a hypothesis to start your analysis. To test the hypothesis, we perform various analyzes and calculate various statistics.

If you do that, you may come across a situation where you want to know ** the variation in a population of statistics **.

If it is the mean of the data, then you can calculate the standard error of the mean. It's the standard deviation divided by the square root of the number of data.

However, in the analysis, we will look not only at the mean but also at various statistics such as the median, the percentile, and the correlation coefficient. And there is no simple formula for these that can quantify variations such as ** standard error and confidence intervals **.

In such a case, ** resampling **, especially ** Bootstrap **, is useful.

What is bootstrap?

Bootstrap first creates thousands of similar datasets from the original data by resampling. Then, for each dataset, calculate the statistic you want. Then, the distribution of the statistic is created, and the standard error of the statistic can be obtained by calculating the standard deviation of the distribution.

As an example, let's say you have IQ data for 20 people. Let's say the values are 61, 88, 89, 89, 90, 92, 93, 94, 98, 98, 101, 102, 105, 108, 109, 113, 114, 115, 120, 138, respectively. The distribution looks like this:

figure1.png

At this time, the mean value is 100.85, and the standard error of the mean value is 3.45.

However, the distribution of these values, like income, is that the median rather than the mean reflects a person's feelings. So, this time, I'd like to use bootstrap to find the median variation, that is, ** standard error of the median **.

Bootstrap algorithm

Bootstrap is performed according to the following flow.

1: Decide how many times to resample. Make a large number, such as 500, 1000, 10000, 100000, etc. This time I will try 1000.

2: Resample and artificially generate data. At this time, ** allow duplication and extract as many data samples as the number of original data samples **.

For example, if the original data is $ X_1, X_2, ..., X_5 $ and $ 5 $, the following is an example of the dataset generated each time you resample.

Resample 1 = $ X1, X2, X3, X3, X4 $ Resample 2 = $ X1, X1, X3, X4, X4 $ Resample 3 = $ X2, X2, X5, X5, X5 $ ...... Resample 998 = $ X1, X2, X3, X4, X5 $ Resample 999 = $ X1, X3, X4, X5, X5 $ Resample 1000 = $ X3, X3, X3, X3, X4 $

In this way, duplication is allowed for the number determined in step 1, and the data sample is repeatedly extracted for the number of original data, and data is artificially created.

Which data sample is included in each resample is random. Sometimes all five will have the same value, and sometimes all five will have different values.

3: For each resample, calculate the statistic you want to find. Whatever the median, each resampled data is used to find the statistic. If you resample 1000 times, you can get 1000 statistics.

Resample 1 = $ X1, X2, X3, X3, X4 $ $ \ to $ Statistic = $ S1 $ Resample 2 = $ X1, X1, X3, X4, X4 $ $ \ to $ Statistic = $ S2 $ Resample 3 = $ X2, X2, X5, X5, X5 $ $ \ to $ Statistic = $ S3 $ ...... Resample 998 = $ X1, X2, X3, X4, X5 $ $ \ to $ Statistic = $ S998 $ Resample 999 = $ X1, X3, X4, X5, X5 $ $ \ to $ Statistic = $ S999 $ Resample 1000 = $ X3, X3, X3, X3, X4 $ $ \ to $ Statistic = $ S1000 $

You got a lot of statistics. And think like this.

"You can think of the distribution of this lot of statistics as ** almost the same as the population distribution of that statistic **! Because ** anyway, when I experimented more and got a lot of data, it was similar. Only the value data will increase ** Because !! "

So bootstrap uses statistical tricks to increase the amount of data, not experimentally. And if the number of data is large to some extent, the distribution should be as close as possible to the distribution of the population.   4: Calculate the variation (standard error, confidence interval, etc.) using the obtained distribution of statistics. For example, ** standard error is the standard deviation of the population distribution **, so you can simply find the standard deviation of the obtained statistic distribution.

In this way, you can find the variation of the statistics that you are interested in.

Bootstrap practice

Let's do it in python (Matlab / Octave is also below).

First, get the required library.


# import librarys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Use Numpy to put the data into the array.


# get data
iq = np.array([61, 88, 89, 89, 90, 92, 93, 
               94, 98, 98, 101, 102, 105, 108,
               109, 113, 114, 115, 120, 138])

Calculate the mean, standard error of the mean, and median as basic statistics.


# compute mean, SEM (standard error of the mean) and median
mean_iq = np.average(iq)
sem_iq = np.std(iq)/np.sqrt(len(iq))
median_iq = np.median(iq)

We were able to calculate the mean: 100.85, the standard error of the mean: 3.45, and the median: 99.5, respectively.

Run bootstrap to find the standard error of the median. As a sanity check to see if the bootstrap is correct, the standard error of the average value is also calculated.

In python, to randomly sort consecutive integer values from 1 to n with duplicates, do as follows.


np.random.choice(n,n,replace=True)

Use this to resample bootstrap.


# bootstrap to compute sem of the median
def bootstrap(x,repeats):
    # placeholder (column1: mean, column2: median)
    vec = np.zeros((2,repeats))
    for i in np.arange(repeats):
        # resample data with replacement
        re = np.random.choice(len(x),len(x),replace=True)
        re_x = x[re]
            
        # compute mean and median of the "new" dataset
        vec[0,i] = np.mean(re_x)
        vec[1,i] = np.median(re_x)
    
    # histogram of median from resampled datasets
    sns.distplot(vec[1,:], kde=False)
    
    # compute bootstrapped standard error of the mean,
    # and standard error of the median
    b_mean_sem = np.std(vec[0,:])
    b_median_sem = np.std(vec[1,:])
    
    return b_mean_sem, b_median_sem   

If you run this function with x = iq, repeats = 1000, you can see the distribution of the 1000 medians found by bootstrap.

figure2.png

...... I think it would have been better to have more repetitions.

For the time being, the standard error of the median can be calculated from this distribution. Simply calculate the standard deviation ** of the median of ** 1000 pieces.

--Standard error of median calculated by bootstrap: 4.22

In the same way, calculate the standard error of the mean.

--Standard error of mean value obtained by bootstrap: 3.42

First, the standard error of the mean value obtained from the original data is 3.45, so it's pretty close. Bootstrap seems to have correctly estimated the variation in statistics.

important point Bootstrap ** resamples randomly, so the results will be slightly different each time you run **. To reduce the variation in results and make more accurate estimates, you need to increase the number of iterations (~ 10,000).

Summary Summary

Bootstrap can estimate variations in statistics of interest by resampling.

At the end

I think that anyone who thinks about resampling in statistics, including bootstrap, is a genius. By artificially increasing the data and approximating the population distribution, you are freed from the penance of continuing to collect data for the time being! Of course, if you don't actually collect some data and collect representative examples of the population, no matter how much you resample, you will never be able to approach the population ...

The source code is listed below.

bootstrap_demo.py



# import librarys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# get data
iq = np.array([61, 88, 89, 89, 90, 92, 93, 
               94, 98, 98, 101, 102, 105, 108,
               109, 113, 114, 115, 120, 138])

# compute mean, SEM (standard error of the mean) and median
mean_iq = np.average(iq)
sem_iq = np.std(iq)/np.sqrt(len(iq))
median_iq = np.median(iq)

# bootstrap to compute sem of the median
def bootstrap(x,repeats):
    # placeholder (column1: mean, column2: median)
    vec = np.zeros((2,repeats))
    for i in np.arange(repeats):
        # resample data with replacement
        re = np.random.choice(len(x),len(x),replace=True)
        re_x = x[re]
            
        # compute mean and median of the "new" dataset
        vec[0,i] = np.mean(re_x)
        vec[1,i] = np.median(re_x)
    
    # histogram of median from resampled datasets
    sns.distplot(vec[1,:], kde=False)
    
    # compute bootstrapped standard error of the mean,
    # and standard error of the median
    b_mean_sem = np.std(vec[0,:])
    b_median_sem = np.std(vec[1,:])
    
    return b_mean_sem, b_median_sem   

# execute bootstrap
bootstrapped_sem = bootstrap(iq,1000)    

Matlab / Ovctave is below.

bootstrap_demo.m



function bootstrap_demo

% data
iq = [61, 88, 89, 89, 90, 92, 93,94, 98, 98, 101, 102, 105, 108,109, 113, 114, 115, 120, 138];

% compute mean, SEM (standard error of the mean) and median
mean_iq = mean(iq);
sem_iq = std(iq)/sqrt(length(iq));
median_iq = median(iq);
disp(['the mean: ' num2str(mean_iq)])
disp(['the SE of the mean: ' num2str(sem_iq)])
disp(['the median: ' num2str(median_iq)])
disp('---------------------------------')

[b_mean_sem, b_median_sem] = bootstrap(iq, 1000);
disp(['bootstrapped SE of the mean: ' num2str(b_mean_sem)])
disp(['bootstrapped SE of the median: ' num2str(b_median_sem)])

% bootstrap to compute sem of the median
function [b_mean_sem, b_median_sem] = bootstrap(x, repeats)

% placeholder (column1: mean, column2: median)
vec = zeros(2,repeats);
for i = 1:repeats
    % resample data with replacement
    re_x = x(datasample(1:length(x),length(x),'Replace',True));
    
    % compute mean and median of the "new" dataset
    vec(1,i) = mean(re_x);
    vec(2,i) = median(re_x);
    
end

% histogram of median from resampled dataset
histogram(vec(2,:))

% compute bootstrapped standard error of the mean, and standard error of
% the median
b_mean_sem = std(vec(1,:));
b_median_sem = std(vec(2,:));

Reference

Recommended Posts

I want to find variations in various statistics! Recommendation for re-sampling (Bootstrap)
I just want to find the 95% confidence interval for the difference in population ratios in Python
I want to print in a comprehension
I want to embed Matplotlib in PySimpleGUI
I analyzed Airbnb data for those who want to stay in Amsterdam
I want to connect to PostgreSQL from various languages
I want to pin Datetime.now in Django tests
I want to create a window in Python
I want to store DB information in list
I want to merge nested dicts in Python
I want to easily find a delicious restaurant
I want to display the progress in Python!
I want to set up a mock server for python-flask in seconds using swagger-codegen.
I want to write in Python! (1) Code format check
I want to embed a variable in a Python string
I want to easily implement a timeout in python
I want to transition with a button in flask
I want to use self in Backpropagation (tf.custom_gradient) (tensorflow)
I want to write in Python! (2) Let's write a test
Even in JavaScript, I want to see Python `range ()`!
I want to find a popular package on PyPi
I want to randomly sample a file in Python
I want to work with a robot in python.
I want to write in Python! (3) Utilize the mock
When you want to plt.save in a for statement
I want to use the R dataset in python
I want to do something in Python when I finish
I want to manipulate strings in Kotlin like Python!
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
I want to exchange gifts even for myself! [Christmas hackathon]
I want to move selenium for the time being [for mac]
I want to easily delete columns containing NA in R
I want to do something like sort uniq in Python
[NetworkX] I want to search for nodes with specific attributes
[Django] I want to log in automatically after new registration
I want to make the Dictionary type in the List unique
Reference reference for those who want to code in Rhinoceros / Grasshopper
[Introduction to Pytorch] I want to generate sentences in news articles
I want to count unique values in arrays and tuples
I want to align the significant figures in the Numpy array
I want to be able to run Python in VS Code
I want to make input () a nice complement in python
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to solve Sudoku (Sudoku)
I want to automatically find high-quality parts from the videos I shot
[Linux] I want to know the date when the user logged in
I want to solve APG4b with Python (only 4.01 and 4.04 in Chapter 4)
I want to run Rails with rails s even in vagrant environment
I want to find the shortest route to travel through all points
LINEbot development, I want to check the operation in the local environment
[Python / AWS Lambda layers] I want to reuse only module in AWS Lambda Layers
Scraping and tabelog ~ I want to find a good restaurant! ~ (Work)
I want to create a pipfile and reflect it in docker
I want to make the second line the column name in pandas
For the time being, I want to convert files with ffmpeg !!
I want to pass the G test in one month Day 1
I want to know the population of each country in the world.
I want to use Ubuntu's desktop environment on Android for the time being (Termux version-Japanese input in desktop environment)