The Bootstrap re-Sampling

Dr. Manoj Kumar Yadav
3 min readDec 4, 2022

--

The Bootstrap re-Sampling | True Geometry’s Blog

The Bootstrap

Bootstrap is an extremely powerful statistical tool that can be used in the cases where the data available is much less. In other words the estimate of various statistic is questionable due to low data. The elegance of the Bootstrap comes into the picture because of the fact that it can be easily applied to a variety of statistical learning methods, such as linear regression fit.

Each bootstrap data set contains n observations, sampled with replacement from the given data set. Each bootstrap data set is used to obtain an estimate of given statistic. This is as simple as it gets. However it takes time realize its impact holistically.

Using the Bootstrap resampling for re-populating a data set can help in generating amazing dataset for generating distributions and thus fitting predefined curves.

Why should you trust the outcome?

If you look at the process by which this is created, as shown in the image below, one can see it to be equivalent of doing a real-world sampling.

The Bootstrap re-Sampling | True Geometry’s Blog

It can be argued that largely the ideal-case scenario data sampling tries to avoid the replacement, but then again it has to be understood that most of the real-world data sets also are probabilistic in nature. So, in-short one can trust the outcomes as whole process of Bootstrap resampling makes very less assumption about the underlying distribution of the original data set. And the comment section is open for the disagreements for the observation made :)

Python

I understand it will be unfair to provide a working sample. So, here is one taken from people.duke.edu

import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%precision 4
plt.style.use('ggplot')

np.set_printoptions(formatter={'float': lambda x: '%.3f' % x})

# For example, what is the 95% confidence interval for
# the mean of this data set if you didn't know how it was generated?

x = np.concatenate([np.random.exponential(size=200), np.random.normal(size=100)])
plt.hist(x, 25, histtype='step');
The Bootstrap re-Sampling | True Geometry’s Blog
n = len(x)
reps = 10000
xb = np.random.choice(x, (n, reps))
mb = xb.mean(axis=0)
mb.sort()

np.percentile(mb, [2.5, 97.5])

Usecases

Below are some possible situations to consider Bootstrap resampling:

  • Less data set, as low as 30 data points,
  • Distribution is not known,
  • Generate smoother distribution,

Hope this helps someone with limited data, someday :)

Ref

--

--

Dr. Manoj Kumar Yadav
Dr. Manoj Kumar Yadav

Written by Dr. Manoj Kumar Yadav

Doctor of Business Administration | VP - Engineering at redBus | Data Engineering | ML | Servers | Serverless | Java | Python | Dart | 3D/2D

No responses yet