Applied Causal Analysis

Large Sample Theory: LLN and CLT

Author

Mauricio Olivares

Published

May 5, 2026

0.1 Goals

This lecture illustrates two foundational results of large sample theory in the context of the OLS estimator:

  1. Law of Large Numbers (LLN) — As the sample size grows, the OLS estimator concentrates around the true parameter value.
  2. Central Limit Theorem (CLT) — Once properly recentered and scaled, the OLS estimator converges in distribution to a standard normal, regardless of the shape of the error distribution.

The key message is that these results hold across very different data generating processes (DGPs). We will verify this by running Monte Carlo simulations under three DGPs that differ in how errors are distributed.

0.2 Monte Carlo Design

Rather than working with a single dataset, we simulate many datasets of varying sizes and track how the OLS estimate of the treatment coefficient \(\hat{\beta}\) behaves across those repetitions. This gives us an empirical picture of the sampling distribution of \(\hat{\beta}\) — something we cannot observe in practice but can study through simulation.

All three DGPs share the same structural equation:

\[Y_i = \beta \cdot D_i + X_i'\gamma + \varepsilon_i, \quad \beta = 1\]

where \(D_i\) is a binary treatment indicator (assigned with probability 0.5), \(X_i\) includes a constant, a standard normal covariate, and a uniform covariate, and \(\gamma = (0.5,\ 2,\ -1)'\). The DGPs differ only in the distribution of \(\varepsilon_i\):

DGP Error distribution Key feature
Model 1 \(\varepsilon_i \sim \mathcal{N}(0, 1)\) Homoskedastic errors
Model 2 \(\varepsilon_i \sim \mathcal{N}(0,\ e^{0.4 X_{2i}})\) Heteroskedastic errors
Model 3 \(\varepsilon_i \sim t_{2.01}\) Heavy-tailed errors

0.3 Setup

Run this first. It loads all required packages and the shared function library.

library(tidyverse)
library(kableExtra)

source("R/ACA_Function_Library.r")

set.seed(42, kind = "L'Ecuyer-CMRG")

# Simulation parameters (shared across all chapters)
sample_sizes    <- c(50, 100, 1000)
num_simulations <- 1000