Bootstrap Confidence Interval

Author

Jumbong Junior

Introduction

When working with data, it very important to assess the uncertainty associated with interest parameter. Confidence interval is one of the most popular method to estimate the uncertainty of the parameter. Let’s define it :

Given a sample \(\mathcal{X} = \{X_1, X_2, \ldots, X_n\}\) of size \(n\) from a distribution \(P\).

Given \(\theta = \theta(P)\) \(\in \Theta\) a real-valued parameter.

Let \(\alpha \in (0, 1)\) be a confidence level (commonly \(\alpha = 0.05\) for a 95% confidence interval).

A confidence interval for \(\theta\) at level \(1-\alpha\), denoted as \(IC_{1-\alpha}(\theta)\), is a random subset :

\[ \mathcal{I}_n(X) = [T_{\inf}(X), T_{\sup}(X)] \]

of Θ such that:

\[ 1 - \alpha = \mathbb{P}(\theta \in \mathcal{I}_n(X)). \]

There are serveral methods to estimate the bootstrap confidence interval. In this document, three methods will be presented :

The normal method
The pivotal method
The percentile method

Theoritical Background

Let \(\theta_n = \theta(P_n)\) be a plug-in estimator of \(\theta\).

Let \(\theta_{n,1}^*, ..., \theta_{n,B}^*\) be the bootstrap replications of \(\theta_n\).

Let \(q_{\alpha/2}\) and \(q_{1-\alpha/2}\) be the \(\alpha/2\) and \(1-\alpha/2\) quantiles of the bootstrap distribution of the sample (\(\theta_{n,1}^*, ..., \theta_{n,B}^*\)).

The Normal Interval

This is the simplest method to construct the confidence interval. This method assumes that the sampling distribution of the estimator \(\theta_n = \theta\) is approximately normal. The standard error of the estimator is estimated by the bootstrap standard error. The normal interval is given by : \[ \theta_n \pm z_{\alpha/2} \hat{se}_{boot} \]

where :

\(\hat{se}_{boot}\) is the bootsrap estimate of the standard error.
\(z_{\alpha/2}\) is the \(\alpha/2\) quantile of the standard normal distribution. Many consider \(z_{\alpha/2} = 1.96\) for a 95% confidence interval.

This interval is not accurate unless the distribution of \(T_n\) is close to the normal distribution.

The Pivotal Intervals

The pivotal method uses a pivotal statistic \(R_n(\mathcal{X}, \theta)\) that is a function of the data and the parameter.

Example : Let \(\theta\) be a location parameter for \(\theta_n(\mathcal{X})\).

\(R_n = \theta_n - \theta\) pivot
\(R_n = \frac{\theta_n - \theta}{\hat{se}_{boot}}\) pivot
\(R_n = \beta(\theta_n) - \beta(\theta)\) variance-stabilizing pivot

To construct this interval, the first step is to determine a pivotal statistic \(R_n(\mathcal{X}, \theta).\) Next, use bootstrap sampling to compute the quantiles \(q_{\alpha/2}\) and \(q_{1-\alpha/2}\) so that : \[ P_p(R_n(\mathcal{X}, \theta) \in [q_{\alpha/2}, q_{1-\alpha/2}]) = 1 - \alpha, \]

Finally, the equation is inverted to obtain the confidence interval : \[ R_n(\mathcal{X}, \theta) \in [q_{\alpha/2}, q_{1-\alpha/2}] \Leftrightarrow \theta \in [T_{inf}(\mathcal{X}), T_{sup}(\mathcal{X})] \]

The algorithm below shows how to construct the confidence interval of the pivot :

Pivotal Confidence Interval bootstrap

Variable:

\(B\) : A large number of bootstrap samples.

Begin:

For \(b = 1, 2, \ldots, B\):

Generate a bootstrap sample \(\mathcal{X}^*_b\) from \(\mathcal{X}\).
Compute \(r_n^{*b} = R_n(\mathcal{X}^{*b})\), the bootstrap replication of \(R_n(\mathcal{X})\).

End for.

Return the quantiles \(q_{\alpha/2}\) and \(q_{1-\alpha/2}\) of the bootstrap distribution of \(r_n^{*b}\). They are also the order statistics \(\lceil B \alpha \rceil\) and \(\lceil B (1-\alpha) \rceil\) of \(r_n^{*b}\).

When the pivot is \(R_n = \theta_n - \theta\), the bootstrap pivotal confidence interval is :

\[ C_n = [2\theta_n-q_{\alpha/2}, 2\theta_n-q_{1-\alpha/2}] \]

where \(q_{\alpha/2}\) and \(q_{1-\alpha/2}\) are the \(\alpha/2\) and \(1-\alpha/2\) quantiles of the bootstrap distribution of the sample (\(\theta_{n,1}^*, ..., \theta_{n,B}^*\)).

The percentile intervals

The percentile method is the conceptually straightforward method to construct the confidence interval. The quantiles distribution of the bootstrap replications of the estimator \(\theta_n\) are used to construct the confidence interval :

\[ C_n = [\theta_{\lceil B \alpha/2 \rceil}, \theta_{\lceil B (1-\alpha/2) \rceil}] \]

where \(\theta_{\lceil B \alpha/2 \rceil}\) and \(\theta_{\lceil B (1-\alpha/2) \rceil}\) are the \(\lceil B \alpha/2 \rceil\) and \(\lceil B (1-\alpha/2) \rceil\) order statistics of the bootstrap distribution of the sample (\(\theta_{n,1}^*, ..., \theta_{n,B}^*\)).

The algorithm below shows how to construct the confidence interval of the pivot :

Percentile Confidence Interval bootstrap

Variable:

\(B\) : A large number of bootstrap samples.

Begin:

For \(b = 1, 2, \ldots, B\):

Generate a bootstrap sample \(\mathcal{X}^*_b\) from \(\mathcal{X}\).
Compute \(\theta^{*b} = \theta(\mathcal{X}^{*b})\), the bootstrap replication of \(\theta(\mathcal{X})\).

End for.

Return the quantiles \(\theta_{\lceil B \alpha \rceil}\) and \(\theta_{\lceil B (1-\alpha) \rceil}\) of the bootstrap distribution of \(\theta^{*b}\).

Application in python

Let’s illustrate these three methods with two samples X1 =94,38,23,197,99,16,141 and X2 = 52,10,40,104,50,27,146,31,46. We are interested in the difference between medians.

import numpy as np
import pandas as pd

np.random.seed(123)

X1 = np.array([94,38,23,197,99,16,141])
X2 = np.array([52,10,40,104,50,27,146,31,46])
alpha = 0.05
n1 = len(X1)
n2 = len(X2)

# Compute the difference of the median
theta_n = np.median(X1) - np.median(X2)
B = 1000
tboot = np.empty(B)

for b in range(B):
    X1b = np.random.choice(X1, n1, replace=True)
    X2b = np.random.choice(X2, n2, replace=True)
    tboot[b] = np.median(X1b) - np.median(X2b)

#######################################################################
# Normal interval
#######################################################################
seboot = np.std(tboot)

Normal_inf = round(theta_n - 1.96*seboot,2)      
Normal_sup = round(theta_n + 1.96*seboot,2)
Normal = [Normal_inf, Normal_sup]

#######################################################################
# Pivot interval : theta - theta_n
#######################################################################
pivot_inf = np.quantile(tboot, 0.025)
pivot_sup = np.quantile(tboot, 0.975)
Pivot = [round(2*theta_n - pivot_sup,2), round(2*theta_n - pivot_inf,2)]

#######################################################################
# Percentile interval
#######################################################################
percentile_inf = np.quantile(tboot,0.025)
percentile_sup = np.quantile(tboot,0.975)
Percentile = [round(percentile_inf,2), round(percentile_sup,2)]

#######################################################################
# Write output in a table with the first column, the Method and the second
# column the 95% Interval
#######################################################################
output = pd.DataFrame({'Method': ['Normal', 'Pivot', 'Percentile'],
                       '95% Interval': [Normal, Pivot, Percentile]})
print(f"The point estimate of the difference of the median is {theta_n}")
print(f"The bootstrap standard error is {seboot}")
print(output)

The point estimate of the difference of the median is 48.0
The bootstrap standard error is 39.68365527518855
       Method      95% Interval
0      Normal  [-29.78, 125.78]
1       Pivot     [-5.0, 125.0]
2  Percentile    [-29.0, 101.0]

The point estimate is 48.0, the bootstrap standard error is 39.68 and the resulting approximate 95 percent confidence intervals are as follows:

Conclusion

All the confidence intervals are approximate. All three intervals have the same level of accuracy. They are are more accurate bootstrap confidence intervals but they are more complicated and computationally intensive.

References

Wasserman, Larry. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media.