Bootstrapping in R: A Mathematical and Computational Approach

Introduction

Bootstrapping is a resampling method that enables the estimation of the sampling distribution of a statistic by drawing repeated samples with replacement from the observed data. In simple terms, imagine you have a jar filled with cookies. Instead of baking a new jar of cookies to see how the average size might vary, you repeatedly pick a handful of cookies from the same jar (with replacement), calculate the average size for each handful, and then study how these averages vary. This variation gives you an insight into the reliability and variability of your original measurement.

This method contrasts sharply with the more traditional, or “popular,” approach of relying on parametric inference. In classical parametric methods, analysts assume a specific probability distribution (often the normal distribution) for the data and use analytical formulas based on the central limit theorem to compute standard errors and confidence intervals. These methods require strong assumptions about the underlying population distribution. Bootstrapping, on the other hand, is largely assumption-free and leverages the observed data directly, making it especially useful when the underlying distribution is unknown or when sample sizes are small.

Bootstrapping has become indispensable in modern statistics and econometrics due to its flexibility and minimal reliance on parametric assumptions. In this blog, we provide a rigorous and mathematically detailed treatment of the bootstrapping procedure as implemented in R, supplementing the analysis with advanced theoretical concepts such as bias correction, asymptotic properties, and various econometric applications.

Theoretical Framework

Suppose we have a sample $$ X = { x_1, x_2, \dots, x_n } $$ drawn from an unknown population distribution $F$. Let $\theta = \theta(F)$ be a parameter of interest (e.g., the mean, variance, or a regression coefficient). The bootstrap procedure involves generating $B$ resamples $X^{*b}$ (for $b = 1, 2, \dots, B$), each of size $n$, by sampling with replacement from $X$. For each bootstrap sample, we compute the statistic $\hat{\theta}^{*b}$. The empirical distribution of $$ { \hat{\theta}^{*1}, \hat{\theta}^{*2}, \dots, \hat{\theta}^{*B} } $$ serves as an approximation to the sampling distribution of $\hat{\theta}$.

Mathematical Definitions

  • Sample Mean: The sample mean is defined as $$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i. $$

  • Sample Variance: The sample variance is given by $$ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2. $$

  • Bootstrap Statistic: For a given statistic (for example, the t-statistic), the bootstrap version is defined as

$$Q^* = \sqrt{n} \frac{\bar{x}^* - \bar{x}}{s^*}$$

where $\bar{x}^* $ and $s^* $ denote the sample mean and standard deviation computed from a bootstrap sample.

Hypothesis Testing via Bootstrapping

Consider testing the hypothesis for the population mean:

  • Null Hypothesis: $H_0: \mu \leq \mu_0$
  • Alternative Hypothesis: $H_1: \mu > \mu_0$

For the original sample, the test statistic is computed as $$ Q = \sqrt{n} \frac{\bar{x} - \mu_0}{s}. $$ We generate $B$ bootstrap replicates $Q^{*b}$ from the resampled data and compute the bootstrap p-value as $$ p = \frac{1}{B} \sum_{b=1}^{B} I\left(Q^{*b} > Q\right), $$ where $I(\cdot)$ is the indicator function.

Furthermore, a $(1-\alpha)\times 100%$ confidence interval for $\mu$ can be constructed using the quantiles of the bootstrap distribution. Let $q_{\alpha/2}$ and $q_{1-\alpha/2}$ denote the lower and upper quantiles, respectively. Then, an approximate confidence interval is given by $$ \left[\bar{x} - q_{1-\alpha/2} \frac{s}{\sqrt{n}}, , \bar{x} - q_{\alpha/2} \frac{s}{\sqrt{n}}\right]. $$

R Implementation

The R implementation of the bootstrap procedure follows these steps:

  1. Data Preparation:
    Set a seed using set.seed() for reproducibility and load the data (e.g., annual temperature data).

  2. Calculation of Original Statistics:
    Compute the sample mean $\bar{x}$, the sample standard deviation $s$, and the test statistic $$ Q = \sqrt{n} \frac{\bar{x} - \mu_0}{s}. $$

  3. Bootstrap Resampling:
    For $b = 1, 2, \dots, B$, generate a bootstrap sample $X^{*b}$ by sampling with replacement from $X$. For each bootstrap sample, compute:

    • The bootstrap sample mean $\bar{x}^{*b}$,
    • The bootstrap sample standard deviation $s^{*b}$,
    • The bootstrap statistic $$ Q^{*b} = \sqrt{n} \frac{\bar{x}^{*b} - \bar{x}}{s^{*b}}. $$
  4. Aggregation and Analysis:
    Construct the bootstrap distribution from $ (Q^{*1}, Q^{*2}, \dots, Q^{*B}), $

    Determine the bootstrap critical value $( c_\alpha^* )$ as the $(1-\alpha) $

    quantile of this distribution and calculate the bootstrap p-value as described above.

The following pseudocode summarizes the implementation:

set.seed(1)
data <- read.csv("data.csv")
n <- length(data)
mu <- mean(data)
s <- sd(data)
mu0 <- mu  # Hypothesis value for the mean

# Calculate the original test statistic
Q <- sqrt(n) * (mu - mu0) / s

B <- 1000  # Number of bootstrap replications
Q_star <- numeric(B)

for (b in 1:B) {
  X_star <- sample(data, size = n, replace = TRUE)
  mu_star <- mean(X_star)
  s_star <- sd(X_star)
  Q_star[b] <- sqrt(n) * (mu_star - mu) / s_star
}

# Calculate bootstrap critical value and p-value
alpha <- 0.05
c_alpha_star <- quantile(Q_star, probs = 1 - alpha)
p_value <- mean(Q_star > Q)

Advanced Topics in Bootstrapping

Consistency and Asymptotic Normality

Under standard regularity conditions, if
$$ \sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N(0,\sigma^2), $$
then the bootstrap estimator satisfies
$$ \sqrt{n}(\hat{\theta}^* - \hat{\theta}) \xrightarrow{d} N(0,\sigma^2), $$
conditional on the observed data. This convergence in distribution justifies the use of bootstrapping for approximating the sampling distribution of $\hat{\theta}$.

Bias Correction and Acceleration (BCa)

Bootstrapping can be employed not only for hypothesis testing but also for bias correction. The bias of an estimator $\hat{\theta}$ is given by
$$ \text{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta. $$
The bootstrap estimate of the bias is
$$ \widehat{\text{Bias}} = \frac{1}{B}\sum_{b=1}^{B} \hat{\theta}^{*b} - \hat{\theta}. $$
The BCa method further adjusts for both bias and skewness in the bootstrap distribution, leading to more accurate confidence intervals.

Econometric Applications

In econometrics, bootstrapping is extensively used for:

  • Estimating standard errors of regression coefficients,
  • Constructing confidence intervals for complex estimators,
  • Conducting robust hypothesis tests when traditional parametric assumptions fail.

For example, in linear regression, bootstrapping allows for the construction of the sampling distribution of the Ordinary Least Squares (OLS) estimators, providing a more reliable inference framework in the presence of heteroskedasticity or non-normal error distributions.

Discussion

The bootstrap method is a versatile and powerful tool for statistical inference. Its primary strength lies in its minimal assumptions about the underlying population distribution, making it suitable for a wide range of applications—from simple descriptive statistics to complex econometric models. The ability to correct for bias and adjust for distributional asymmetries further enhances its utility in empirical research. Moreover, the R implementation provides an accessible and reproducible approach to applying bootstrap techniques in practice.

Conclusion

This blog post has presented a comprehensive, mathematically rigorous analysis of the bootstrapping procedure as implemented in R. We have detailed the theoretical foundations, provided a step-by-step R implementation, and discussed advanced topics such as bias correction and the asymptotic properties of bootstrap estimators. The integration of these elements highlights the robustness and flexibility of bootstrapping as a tool for statistical inference and econometric analysis.

References

  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1-26.
  • Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press.
  • Chernick, M. R. (2008). Bootstrap Methods: A Guide for Practitioners and Researchers. Wiley.