The Bootstrap re-Sampling

3 min readDec 4, 2022

The Bootstrap

Bootstrap is an extremely powerful statistical tool that can be used in the cases where the data available is much less. In other words the estimate of various statistic is questionable due to low data. The elegance of the Bootstrap comes into the picture because of the fact that it can be easily applied to a variety of statistical learning methods, such as linear regression fit.

Each bootstrap data set contains n observations, sampled with replacement from the given data set. Each bootstrap data set is used to obtain an estimate of given statistic. This is as simple as it gets. However it takes time realize its impact holistically.

Using the Bootstrap resampling for re-populating a data set can help in generating amazing dataset for generating distributions and thus fitting predefined curves.

Why should you trust the outcome?

If you look at the process by which this is created, as shown in the image below, one can see it to be equivalent of doing a real-world sampling.

It can be argued that largely the ideal-case scenario data sampling tries to avoid the replacement, but then again it has to be understood that most of the real-world data sets also are probabilistic in nature. So, in-short one can trust the outcomes as whole process of Bootstrap resampling makes very less assumption about the underlying distribution of the original data set. And the comment section is open for the disagreements for the observation made :)

Python

I understand it will be unfair to provide a working sample. So, here is one taken from people.duke.edu

import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%precision 4
plt.style.use('ggplot')

np.set_printoptions(formatter={'float': lambda x: '%.3f' % x})

# For example, what is the 95% confidence interval for
# the mean of this data set if you didn't know how it was generated?

x = np.concatenate([np.random.exponential(size=200), np.random.normal(size=100)])
plt.hist(x, 25, histtype='step');

n = len(x)
reps = 10000
xb = np.random.choice(x, (n, reps))
mb = xb.mean(axis=0)
mb.sort()

np.percentile(mb, [2.5, 97.5])

Usecases

Below are some possible situations to consider Bootstrap resampling:

Less data set, as low as 30 data points,
Distribution is not known,
Generate smoother distribution,

Hope this helps someone with limited data, someday :)

Ref

The Bootstrap re-Sampling | True Geometry’s Blog [Original Article]
https://people.duke.edu/~ccc14/sta-663/ResamplingAndMonteCarloSimulations.html
https://towardsdatascience.com/bootstrap-resampling-2b453bb036ec
An Introduction to Statistical Learning by Gareth James et.al.

The Bootstrap re-Sampling

The Bootstrap

Why should you trust the outcome?

Python

Usecases

Ref

Written by Dr. Manoj Kumar Yadav

No responses yet