[PYTHON] Bootstrap sampling with Pandas

A little script for Bootstrap sampling in Pandas

Bootstrap sampling is used to randomly retrieve data from a sample, allowing duplication, to create a slightly different population. For example, I repeat it 1000 times or so to get statistics. I thought about what to do with Pandas, so make a note of it.

Try using an iris sample

Get samples of pandas and irises, and then import the random number module used for random sampling.

import pandas as pd
import random
from sklearn.datasets import load_iris

Then load the data and put it in the pandas data frame.

iris_dataset = load_iris()
df = pd.DataFrame(data=iris_dataset.data, columns=iris_dataset.feature_names)

Take a look at the data with df.describe ().

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Then define a function that randomly samples the data. First, create an empty dataframe with pd.DataFrame (columns = a_data_frame.columns) using the original dataframe columns, and then create a random numberselected_num = random.choice (range (a_data_frame.shape [0)) there. ])) Add the framea_data_frame [selected_num: selected_num + 1]of the line selected bywith append. Note that it seems that you need to select a range ([0: 1]) to select a single line (for example, [0] for numpy) in the data frame of pandas.

def btstrap(a_data_frame):
    btstr_data = pd.DataFrame(columns=a_data_frame.columns)
    for a_data in range(a_data_frame.shape[0]):
        selected_num = random.choice(range(a_data_frame.shape[0]))
        btstr_data = btstr_data.append(a_data_frame[selected_num : selected_num + 1])
    return btstr_data

Check the data after random sampling with btstr_data.describe () by doing btstr_data = btstrap (df).

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.750000	3.040667	3.660667	1.176000
std	0.728034	0.410287	1.716634	0.766644
min	4.300000	2.000000	1.100000	0.100000
25%	5.100000	2.800000	1.500000	0.200000
50%	5.700000	3.000000	4.250000	1.300000
75%	6.300000	3.300000	5.000000	1.800000
max	7.700000	4.400000	6.700000	2.500000

Turn this 1000 times or in a loop to get the result of fitting or variable selection.

[See below] There was an easier way

If you set replace = True, it seems that you can do the same with .sample, which is the original function of pandas. @nkay Thank you for pointing out.

df.sample(n=df.shape[0], replace=True)

Recommended Posts

Bootstrap sampling with Pandas
Processing datasets with pandas (1)
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Merge datasets with pandas
Learn Pandas with Cheminformatics
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Read csv with python pandas
[Python] Change dtype with pandas
Standardize by group with pandas
Prevent omissions with pandas print
Data processing tips with Pandas
Extract the maximum value with pandas.
Versatile data plotting with pandas + matplotlib
[Python] Join two tables with pandas
Latin super square sampling with OpenMDAO
Dynamically create new dataframes with pandas
Extract specific multiple columns with pandas
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Convenient analysis with Pandas + Jupyter notebook
Draw a graph with pandas + XlsxWriter
Manipulating strings with pandas group by
Bulk Insert Pandas DataFrame with psycopg2
I want to do ○○ with Pandas
Create an age group with pandas
Excel aggregation with Python pandas Part 1
[Python] Format when to_csv with pandas
Feature generation with pandas group by
Handle various date formats with pandas
Plot the Nikkei Stock Average with pandas
Load csv with duplicate columns in pandas
Import of japandas with pandas 1.0 and above
Excel aggregation with Python pandas Part 2 Variadic
Tips for plotting multiple lines with pandas
Try converting to tidy data with pandas
Draw hierarchical axis labels with matplotlib + pandas
Quickly try to visualize datasets with pandas
Replace column names / values with pandas dataframe
[Easy Python] Reading Excel files with pandas
Load csv with pandas and play with Index
Working with 3D data structures in pandas
Read CSV and analyze with Pandas and Seaborn
Example of efficient data processing with PANDAS
Best practices for messing with data with pandas
[Pandas 1.0.1 Memorial] Fierce battle record with cookbook