Confidence Interval
Interval Estimation
A point estimator is not expected to provide exact value of the population parameter. An interval estimate
is often computed by adding and subtracting a value called margin of error
of the point estimate. Now that is a complicated technical, statistical, scientific statement taken from book [1].
Calculating the CI
from Python libraries are tricky as the underlying assumptions are often not explained by the various examples found over internet. For an instance one of the top searches redirect to an article [2] that just goes ahead and uses the function Percent point function
t.ppf(q, df, loc=0, scale=1) [5] assuming the distribution to be normal. Now digging further into the libraries we see that there also a class of variable A non-central Student’s t continuous random variable
and it has nct.ppf(q, df, loc=0, scale=1) [4]. So, what this really means is that the libraries are built on certain assumptions about the distribution of the random variable. One has to be really careful in choosing these libraries in practical scenarios.
Before going for any of the libraries one must know the some of the functions that are built on top of the geometry created by projecting the data on 2D space.
Percent Point Function (ppf)
returns the value x of the variable that has a given cumulative distribution probability (cdf). Thus, given the cdf(x) of a x value, ppf returns the value x itself, therefore, operating as the inverse of cdf
Cumulative Distribution Function (cdf)
Given a value x0 cdf returns the cumulative probability that x gets a value less or equal to x0, or in other words “lies in the interval (-inf, x0]”.
probability density function (pdf)
pdf returns the probability that the variable x takes a specific value (more correctly: lies between a range of values)
Survival function (sf)
returns the probability that the variate x gets a value that is greater than a specific value x0
Why should you trust the outcome?
Probability, Statistics and Distributions are not be believed if the outcome is ovserved for every reading. These concepts work at scale, so we always need to consider volume of observations not the individuals. This is also a reason why many ideas using these concepts are generally discarded if the group making the decision is un-aware of the nature of the probability and distributions. The Confidence Interval
can be a great tool to make a case for using the data models and data driven decisions. CI essentially gives a way to for group to believe how effective the data distribution is. It is also to be understood that higher confidence interval will result in less accurate outcomes and lower confidence interval is reverse of that. So, using CI with large confidence interval may not always be a solution to the problem.
Python
I understand it will be unfair to provide a working sample. So, here is one taken from [2] and the modifed for this example.
import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%precision 4
plt.style.use('ggplot')
np.set_printoptions(formatter={'float': lambda x: '%.3f' % x})
x = np.concatenate([np.random.exponential(size=200), np.random.normal(size=100)])
plt.figure(figsize = (15, 5))
plt.hist(x, 25, histtype='step');
mean= x.mean()
print (mean)
stdev = x.std()
print (stdev)
degOfFreedom = len(x)-1
print (degOfFreedom)
confidence = 0.99
from scipy.stats import t
t_crit = np.abs(t.ppf((1-confidence)/2,degOfFreedom))
print(t_crit) CI_MIN=mean-stdev*t_crit/np.sqrt(len(x))
CI_MAX=mean+stdev*t_crit/np.sqrt(len(x))
print (CI_MIN,CI_MAX)
plt.figure(figsize = (15, 5))
plt.axvline(x = CI_MIN, color = 'b', label = 'CI_MIN')
plt.axvline(x = mean, color = 'g', label = 'mean')
plt.axvline(x = CI_MAX, color = 'r', label = 'CI_MAX')
plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper left')
plt.hist(x, 25, histtype='step');
Discussion
We can clearly see that for the example the mean
stays well within the the confidence interval with confidence of 99%
. Now, let’s see what happens when we reduce the confidence to 60%
:
What is clearly visible is that the confidence band shrinks when the confidence is low, but then again it appears to be closer to mean. This is an amazing outcome, and understanding how and when to use larger and smaller band is usually the key to success. So, if you wish to price a product and you know the various price points available, can you pick a price that will give you a definite sale?
Originally published at blog.truegeometry.com
Ref
- [1] Statistics For Business and Economics — Anderson,Sweeney,Williams, Camm, Cochran
- [2] https://towardsdatascience.com/how-to-calculate-confidence-intervals-in-python-a8625a48e62b
- [3] http://pytolearn.csd.auth.gr/d1-hyptest/11/distros.html
- [4] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.nct.html#scipy.stats.nct
- [5] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html