Data analysis seems to be booming. Whether it's machine learning or whatever, the first thing you do when you're given data is to visualize the data and get a hypothesis to start your analysis. To test the hypothesis, we perform various analyzes and calculate various statistics.
If you do that, you may come across a situation where you want to know ** the variation in a population of statistics **.
If it is the mean of the data, then you can calculate the standard error of the mean. It's the standard deviation divided by the square root of the number of data.
However, in the analysis, we will look not only at the mean but also at various statistics such as the median, the percentile, and the correlation coefficient. And there is no simple formula for these that can quantify variations such as ** standard error and confidence intervals **.
In such a case, ** resampling **, especially ** Bootstrap **, is useful.
Bootstrap first creates thousands of similar datasets from the original data by resampling. Then, for each dataset, calculate the statistic you want. Then, the distribution of the statistic is created, and the standard error of the statistic can be obtained by calculating the standard deviation of the distribution.
As an example, let's say you have IQ data for 20 people. Let's say the values are 61, 88, 89, 89, 90, 92, 93, 94, 98, 98, 101, 102, 105, 108, 109, 113, 114, 115, 120, 138, respectively. The distribution looks like this:
At this time, the mean value is 100.85, and the standard error of the mean value is 3.45.
However, the distribution of these values, like income, is that the median rather than the mean reflects a person's feelings. So, this time, I'd like to use bootstrap to find the median variation, that is, ** standard error of the median **.
Bootstrap is performed according to the following flow.
1: Decide how many times to resample. Make a large number, such as 500, 1000, 10000, 100000, etc. This time I will try 1000.
2: Resample and artificially generate data. At this time, ** allow duplication and extract as many data samples as the number of original data samples **.
For example, if the original data is $ X_1, X_2, ..., X_5 $ and $ 5 $, the following is an example of the dataset generated each time you resample.
Resample 1 = $ X1, X2, X3, X3, X4 $ Resample 2 = $ X1, X1, X3, X4, X4 $ Resample 3 = $ X2, X2, X5, X5, X5 $ ...... Resample 998 = $ X1, X2, X3, X4, X5 $ Resample 999 = $ X1, X3, X4, X5, X5 $ Resample 1000 = $ X3, X3, X3, X3, X4 $
In this way, duplication is allowed for the number determined in step 1, and the data sample is repeatedly extracted for the number of original data, and data is artificially created.
Which data sample is included in each resample is random. Sometimes all five will have the same value, and sometimes all five will have different values.
3: For each resample, calculate the statistic you want to find. Whatever the median, each resampled data is used to find the statistic. If you resample 1000 times, you can get 1000 statistics.
Resample 1 = $ X1, X2, X3, X3, X4 $ $ \ to $ Statistic = $ S1 $ Resample 2 = $ X1, X1, X3, X4, X4 $ $ \ to $ Statistic = $ S2 $ Resample 3 = $ X2, X2, X5, X5, X5 $ $ \ to $ Statistic = $ S3 $ ...... Resample 998 = $ X1, X2, X3, X4, X5 $ $ \ to $ Statistic = $ S998 $ Resample 999 = $ X1, X3, X4, X5, X5 $ $ \ to $ Statistic = $ S999 $ Resample 1000 = $ X3, X3, X3, X3, X4 $ $ \ to $ Statistic = $ S1000 $
You got a lot of statistics. And think like this.
"You can think of the distribution of this lot of statistics as ** almost the same as the population distribution of that statistic **! Because ** anyway, when I experimented more and got a lot of data, it was similar. Only the value data will increase ** Because !! "
So bootstrap uses statistical tricks to increase the amount of data, not experimentally. And if the number of data is large to some extent, the distribution should be as close as possible to the distribution of the population. 4: Calculate the variation (standard error, confidence interval, etc.) using the obtained distribution of statistics. For example, ** standard error is the standard deviation of the population distribution **, so you can simply find the standard deviation of the obtained statistic distribution.
In this way, you can find the variation of the statistics that you are interested in.
Let's do it in python (Matlab / Octave is also below).
First, get the required library.
# import librarys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Use Numpy to put the data into the array.
# get data
iq = np.array([61, 88, 89, 89, 90, 92, 93,
94, 98, 98, 101, 102, 105, 108,
109, 113, 114, 115, 120, 138])
Calculate the mean, standard error of the mean, and median as basic statistics.
# compute mean, SEM (standard error of the mean) and median
mean_iq = np.average(iq)
sem_iq = np.std(iq)/np.sqrt(len(iq))
median_iq = np.median(iq)
We were able to calculate the mean: 100.85, the standard error of the mean: 3.45, and the median: 99.5, respectively.
Run bootstrap to find the standard error of the median. As a sanity check to see if the bootstrap is correct, the standard error of the average value is also calculated.
In python, to randomly sort consecutive integer values from 1 to n with duplicates, do as follows.
np.random.choice(n,n,replace=True)
Use this to resample bootstrap.
# bootstrap to compute sem of the median
def bootstrap(x,repeats):
# placeholder (column1: mean, column2: median)
vec = np.zeros((2,repeats))
for i in np.arange(repeats):
# resample data with replacement
re = np.random.choice(len(x),len(x),replace=True)
re_x = x[re]
# compute mean and median of the "new" dataset
vec[0,i] = np.mean(re_x)
vec[1,i] = np.median(re_x)
# histogram of median from resampled datasets
sns.distplot(vec[1,:], kde=False)
# compute bootstrapped standard error of the mean,
# and standard error of the median
b_mean_sem = np.std(vec[0,:])
b_median_sem = np.std(vec[1,:])
return b_mean_sem, b_median_sem
If you run this function with x = iq, repeats = 1000, you can see the distribution of the 1000 medians found by bootstrap.
...... I think it would have been better to have more repetitions.
For the time being, the standard error of the median can be calculated from this distribution. Simply calculate the standard deviation ** of the median of ** 1000 pieces.
--Standard error of median calculated by bootstrap: 4.22
In the same way, calculate the standard error of the mean.
--Standard error of mean value obtained by bootstrap: 3.42
First, the standard error of the mean value obtained from the original data is 3.45, so it's pretty close. Bootstrap seems to have correctly estimated the variation in statistics.
important point Bootstrap ** resamples randomly, so the results will be slightly different each time you run **. To reduce the variation in results and make more accurate estimates, you need to increase the number of iterations (~ 10,000).
Bootstrap can estimate variations in statistics of interest by resampling.
I think that anyone who thinks about resampling in statistics, including bootstrap, is a genius. By artificially increasing the data and approximating the population distribution, you are freed from the penance of continuing to collect data for the time being! Of course, if you don't actually collect some data and collect representative examples of the population, no matter how much you resample, you will never be able to approach the population ...
The source code is listed below.
bootstrap_demo.py
# import librarys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# get data
iq = np.array([61, 88, 89, 89, 90, 92, 93,
94, 98, 98, 101, 102, 105, 108,
109, 113, 114, 115, 120, 138])
# compute mean, SEM (standard error of the mean) and median
mean_iq = np.average(iq)
sem_iq = np.std(iq)/np.sqrt(len(iq))
median_iq = np.median(iq)
# bootstrap to compute sem of the median
def bootstrap(x,repeats):
# placeholder (column1: mean, column2: median)
vec = np.zeros((2,repeats))
for i in np.arange(repeats):
# resample data with replacement
re = np.random.choice(len(x),len(x),replace=True)
re_x = x[re]
# compute mean and median of the "new" dataset
vec[0,i] = np.mean(re_x)
vec[1,i] = np.median(re_x)
# histogram of median from resampled datasets
sns.distplot(vec[1,:], kde=False)
# compute bootstrapped standard error of the mean,
# and standard error of the median
b_mean_sem = np.std(vec[0,:])
b_median_sem = np.std(vec[1,:])
return b_mean_sem, b_median_sem
# execute bootstrap
bootstrapped_sem = bootstrap(iq,1000)
Matlab / Ovctave is below.
bootstrap_demo.m
function bootstrap_demo
% data
iq = [61, 88, 89, 89, 90, 92, 93,94, 98, 98, 101, 102, 105, 108,109, 113, 114, 115, 120, 138];
% compute mean, SEM (standard error of the mean) and median
mean_iq = mean(iq);
sem_iq = std(iq)/sqrt(length(iq));
median_iq = median(iq);
disp(['the mean: ' num2str(mean_iq)])
disp(['the SE of the mean: ' num2str(sem_iq)])
disp(['the median: ' num2str(median_iq)])
disp('---------------------------------')
[b_mean_sem, b_median_sem] = bootstrap(iq, 1000);
disp(['bootstrapped SE of the mean: ' num2str(b_mean_sem)])
disp(['bootstrapped SE of the median: ' num2str(b_median_sem)])
% bootstrap to compute sem of the median
function [b_mean_sem, b_median_sem] = bootstrap(x, repeats)
% placeholder (column1: mean, column2: median)
vec = zeros(2,repeats);
for i = 1:repeats
% resample data with replacement
re_x = x(datasample(1:length(x),length(x),'Replace',True));
% compute mean and median of the "new" dataset
vec(1,i) = mean(re_x);
vec(2,i) = median(re_x);
end
% histogram of median from resampled dataset
histogram(vec(2,:))
% compute bootstrapped standard error of the mean, and standard error of
% the median
b_mean_sem = std(vec(1,:));
b_median_sem = std(vec(2,:));
Recommended Posts