4 Details on Central Limit Theorem

Previously, we illustrated the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) for the $\hat{\beta}$ using Monte Carlo simulations. In particular, the LLN tells us that $\hat{\beta}$ converges to $\beta$ as $n$ grows. The Central Limit Theorem goes further: it tells us the rate at which this happens and, more strikingly, the shape of the sampling distribution.

Let us now build more intuition in a simpler case, namely the sample mean (the intuition carries over to the estimator of the slope parameter since $\hat{\beta}$ is a smoothed function of sample averages). Specifically, in class we stated that the sample mean based on an i.i.d. sample from a distribution with finite variance

\[\sqrt{n}\!\left(\mathbb{E}_n[Y] - \mathbb{E}[Y]\right) \xrightarrow{d} \mathcal{N}\!\left(0,\, \mathbb{V}[Y]\right)\]

In words: if we recenter the sample mean by subtracting the true population mean and rescale by $\sqrt{n}$, the resulting quantity converges in distribution to a standard normal — no matter what the underlying distribution of $Y$ is, provided it has finite variance.

Why this is remarkable

The normality in the limit does not come from $Y$ being normally distributed. In fact, we did not assume what the distribution of origin was, only that such a data generating process from which we draw $Y_1, \ldots, Y_n$ has finite variance. The normality behavior emerged purely from the averaging process itself. This is the mathematical foundation we will use to make approximate inference in this course.

4.1 The scaled and centered sum

Let $Y_1, \ldots, Y_n$ be i.i.d. with mean $\mu = \mathbb{E}[Y]$ and finite variance $\sigma^2 = \mathbb{V}[Y]$. Arguing as in the lecture, define the sample mean as

\[\mathbb{E}_n[Y] = \frac{1}{n}\sum_{i=1}^n Y_i\]

Now consider the centered version, scaled by $\sqrt{n}$:

\[S_n \;=\; \sqrt{n}\!\left(\mathbb{E}_n[Y] - \mu\right) \;=\; \frac{1}{\sqrt{n}}\sum_{i=1}^n (Y_i - \mu)\]

Mean and Variance of $S_n$ During the lecture, we obtained the mean and variance of $S_n$ under the i.i.d. assumption as \[\mathbb{E}[S_n] = \frac{1}{\sqrt{n}}\sum_{i=1}^n \mathbb{E}[Y_i - \mu] = 0\] and \[\mathbb{V}[S_n] = \frac{1}{n}\sum_{i=1}^n \mathbb{V}[Y_i - \mu] = \frac{1}{n} \cdot n\sigma^2 = \sigma^2\]

Why multiply by $\sqrt{n}$ and not $n$?

The sample mean $\mathbb{E}_n[Y]$ has variance $\sigma^2/n$, which collapses to zero as $n\to\infty$, producing a degenerate distribution in the limit. Thus, multiplying by $\sqrt{n}$ exactly offsets this shrinkage, stabilizing the variance at a fixed, finite level $\sigma^2$. This is why $\sqrt{n}$ is the right normalizing rate: it is precisely the scaling that produces a non-degenerate limit.

4.2 Why the normal though? The characteristic function argument

To see why the limit is normal and not something else, consider the characteristic function of $S_n$. Let $\phi_Y(t) = \mathbb{E}[e^{itY}]$ be the characteristic function of a single centered observation $Y_i - \mu$.

Since the $\{Y_i,1\le i \le n\}$ are i.i.d., the characteristic function of $S_n = \frac{1}{\sqrt{n}}\sum(Y_i - \mu)$ is:

\[\phi_{S_n}(t) = \left[\phi_Y\!\left(\tfrac{t}{\sqrt{n}}\right)\right]^n\]

Now expand $\phi_Y$ in a Taylor series around $t = 0$. Using $\mathbb{E}[Y_i - \mu] = 0$ and $\mathbb{E}[(Y_i-\mu)^2] = \sigma^2$:

\[\phi_Y\!\left(\tfrac{t}{\sqrt{n}}\right) = 1 - \frac{\sigma^2 t^2}{2n} - \underbrace{\frac{i^3 \kappa_3}{6} \cdot \frac{t^3}{n^{3/2}}}_{\text{skewness term}} - \underbrace{\frac{\kappa_4}{24} \cdot \frac{t^4}{n^2}}_{\text{kurtosis term}} - \cdots\]

where $\kappa_3, \kappa_4, \ldots$ are the cumulants (skewness, excess kurtosis, …) of $Y$.

We make the following observation: \[\log\left(1+\frac{x}{n}\right)^n=n\log\left(1+\frac{x}{n}\right)\] Now let $u=x/n$ so $u\to0$ as $n\to\infty$. Use the Taylor expansion of $\log(1+u)=u-\frac{u^2}{2}+\frac{u^3}{3}-\ldots$ to conclude

\[n\log\left(1+\frac{x}{n}\right)=n\left(\frac{x}{n}-\frac{x^2}{2n^2}+O(n^{-3})\right)=x-\frac{x^2}{2n}+O(n^{-2})\] where $a_n:=O(n^{-2})$ means $a_n$ converges to 0 at least as fast as $n^{-2}$, so as $n\to\infty$, all terms beyond $x$ vanish. Exponentiating both sides gives \[\lim_{n\to\infty}\left[1+\frac{x}{n}\right]^n=e^x\] so, if we take $x=-\sigma^2 t^2/2$, then raising $\phi_Y$ to the $n$-th power and taking $n \to \infty$ yields the desired result:

\[ \phi_{S_n}(t) = \left[1 - \frac{\sigma^2 t^2}{2n} - O(n^{-3/2})\right]^n \;\longrightarrow\; e^{-\sigma^2 t^2/2}\]

This is precisely the characteristic function of $\mathcal{N}(0, \sigma^2)$. Observe that the higher-order terms vanish because they contribute, after raising to the power $n$, $O(n^{-1/2})$, so they go to zero rather than accumulating.

The key insight

Notice what happens to each higher-order term as $n \to \infty$:

Term	Shrinks at rate
Variance ($\kappa_2$)	Stays fixed (by design of the $\sqrt{n}$ scaling)
Skewness ($\kappa_3$)	$n^{-1/2} \to 0$
Excess kurtosis ($\kappa_4$)	$n^{-1} \to 0$
$k$-th cumulant ($\kappa_k$, $k \geq 3$)	$n^{-(k-2)/2} \to 0$

Averaging destroys higher-moment information faster than it destroys variance. The mean and variance are pinned — everything else vanishes. The normal distribution is the unique distribution with finite variance for which all cumulants beyond the second are zero. It is the only distribution that survives.

4.3 Intuition

While we accepted the statement of the CLT, it was less clear why the sampling distribition of $S_n$ appears to behave like the normal distribution rather than some other distribution. Here are three complementary perspectives, with varying degrees of detail.

4.3.1 Take 1: The normal distribution as a fixed point (technical)

The characteristic function argument above shows that repeated averaging is an operator on distributions, and $\mathcal{N}(0, \sigma^2)$ is the fixed point of that operator. Any distribution with finite variance gets pulled toward the normal under averaging, the way iterating a contraction mapping pulls you to a fixed point.

4.3.2 Take 2: Entropy maximization (technical but more intuitive)

Recall the intuition behind entropy (differential Shannon entropy): it measures how spread out a distribution is. Roughly speaking the more spread out a distribution is, the more uncertainty there is (in terms of information). High uncertainty means high entropy, because if the distribution is spread out all over, I know very little about where the next draw will land on the support. As an extreme case, a degenerate distribution, where all the mass is concentrated at a single point, has no uncertainty and the entropy is $-\infty$.

Now let’s go back to our example at hand. It turns out that, among all distributions with a fixed mean $\mu$ and fixed variance $\sigma^2$, the normal distribution uniquely maximizes entropy (differential Shannon entropy). Entropy measures how much information a distribution carries beyond its first two moments.

So, when we average $n$ i.i.d. draws, the skewness of the result shrinks at rate $1/\sqrt{n}$, the excess kurtosis at rate $1/n$, and so on. All the idiosyncratic structure of the original distribution is progressively erased. What remains, once only the mean and variance are left standing, is the distribution that makes the fewest additional claims, that is, the one with maximum entropy. That distribution is the normal.

A silly analogy. Imagine describing a person’s face with many numbers: eye color, nose shape, jawline, freckles, skin texture, etc. Now suppose a process keeps blurring the photo slightly at each step. The specific details, like freckles or eye color, will disappear first. Eventually all you can make out from the original photo is the rough oval shape of a face. That oval shape would be the “maximum entropy face” (least informative) given the constraint that it’s still a human head. The oval shape would be the fixed mean and variance, and the normal distribution is the statistical equivalent of that oval limit process.

4.3.3 Take 3: Convolutions smooth everything out (technical, easy to visualize)

When we add two independent random variables, you convolve their densities. Each convolution makes the result smoother and more symmetric. It is like ironing out bumps, filling in gaps, and spreading mass more evenly. After enough convolutions, the shape is entirely determined by just the mean and variance, and the only shape with those two properties and no others is the normal.

We can actually see this happening in the simulation below: even a bimodal or heavily skewed population produces the familiar bell-shaped sampling distribution once $n$ is large enough.

4.4 Interactive simulation

The simulation below lets you choose any population distribution and watch the sampling distribution of the mean evolve as $n$ grows. The orange curve is the normal fit — notice how it tracks the histogram more and more closely as $n$ increases.

Population distribution

Sample size n

Number of simulated means

500

Population distribution

Sampling distribution of $\bar{Y}$ — with normal fit

Population mean μ

—

Population σ

—

Theoretical SE = σ/√n

—

Observed SD of means

—

Skewness of means

—

Drag the n slider slowly and watch the bell shape emerge. Even bimodal or heavily skewed populations converge to normal. The skewness stat above tracks how quickly the asymmetry washes out.

How to read the simulation

Left panel — the population distribution. Notice it can be far from bell-shaped.
Right panel — the sampling distribution of $\bar Y$ across all simulated samples, with an orange normal curve overlaid.
Try the bimodal distribution first. Set $n=1$: two clear humps. Set $n=5$: humps soften. By $n=30$ the fit is already very good.
Try exponential (heavily right-skewed). The convergence is slower — you need a larger $n$ before the skewness stat drops close to zero.
The Bernoulli (p = 0.3) case is a good reminder that the CLT applies even to discrete distributions: for large $n$, the sampling distribution of a proportion is approximately normal.

4.5 Key takeaways

Centering and scaling matter. The sample mean collapses to a point (LLN). Multiplying by $\sqrt{n}$ holds the variance fixed and reveals the distributional shape.
Higher moments vanish at increasing rates. Skewness shrinks at $n^{-1/2}$, kurtosis at $n^{-1}$, and so on. The $\sqrt{n}$-scaled sum retains only the variance (the mean was already pinned down at zero, due to recentering in $S_n$).
The normal is the attractor. It is the unique fixed point of the averaging operator for distributions with finite variance, the maximum-entropy distribution given mean and variance, and the limit of any convolution sequence. It is not assumed, it emerges in the limit as $n\to\infty$.
Finite variance is the key condition. If $\mathbb{V}[Y] = \infty$ (e.g., Cauchy distribution), the CLT fails and the $\sqrt{n}$ rate no longer applies.
Practical consequence. Even when errors are skewed, bimodal, or heteroskedastic, we can make inference (e.g., confidence intervals) based on the normal approximation under a framework when we take $n\to\infty$, because the CLT operates on the sample mean or smoothed functions of sample means.

# Details on Central Limit Theorem {#sec-clt} ```{r} #| label: setup #| include: false library(tidyverse) ``` Previously, we illustrated the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) for the $\hat{\beta}$ using Monte Carlo simulations. In particular, the LLN tells us that $\hat{\beta}$ converges to $\beta$ as $n$ grows. The **Central Limit Theorem** goes further: it tells us the *rate* at which this happens and, more strikingly, the *shape* of the sampling distribution. Let us now build more intuition in a simpler case, namely the sample mean (the intuition carries over to the estimator of the slope parameter since $\hat{\beta}$ is a smoothed function of sample averages). Specifically, in class we stated that the sample mean based on an i.i.d. sample from a distribution with finite variance $$\sqrt{n}\!\left(\mathbb{E}_n[Y] - \mathbb{E}[Y]\right) \xrightarrow{d} \mathcal{N}\!\left(0,\, \mathbb{V}[Y]\right)$$ In words: if we **recenter** the sample mean by subtracting the true population mean and **rescale** by $\sqrt{n}$, the resulting quantity converges in distribution to a standard normal — *no matter what the underlying distribution of $Y$ is*, provided it has finite variance. ::: {.callout-important} ## Why this is remarkable The normality in the limit does not come from $Y$ being normally distributed. In fact, we did not assume what the distribution of origin was, only that such a data generating process from which we draw $Y_1, \ldots, Y_n$ has finite variance. The normality behavior emerged purely from the *averaging* process itself. This is the mathematical foundation we will use to make approximate inference in this course. ::: --- ## The scaled and centered sum {#sec-clt-scaling} Let $Y_1, \ldots, Y_n$ be i.i.d. with mean $\mu = \mathbb{E}[Y]$ and finite variance $\sigma^2 = \mathbb{V}[Y]$. Arguing as in the lecture, define the sample mean as $$\mathbb{E}_n[Y] = \frac{1}{n}\sum_{i=1}^n Y_i$$ Now consider the **centered** version, scaled by $\sqrt{n}$: $$S_n \;=\; \sqrt{n}\!\left(\mathbb{E}_n[Y] - \mu\right) \;=\; \frac{1}{\sqrt{n}}\sum_{i=1}^n (Y_i - \mu)$$ **Mean and Variance of $S_n$** During the lecture, we obtained the mean and variance of $S_n$ under the i.i.d. assumption as $$\mathbb{E}[S_n] = \frac{1}{\sqrt{n}}\sum_{i=1}^n \mathbb{E}[Y_i - \mu] = 0$$ and $$\mathbb{V}[S_n] = \frac{1}{n}\sum_{i=1}^n \mathbb{V}[Y_i - \mu] = \frac{1}{n} \cdot n\sigma^2 = \sigma^2$$ ::: {.callout-note} ## Why multiply by $\sqrt{n}$ and not $n$? The sample mean $\mathbb{E}_n[Y]$ has variance $\sigma^2/n$, which collapses to zero as $n\to\infty$, producing a degenerate distribution in the limit. Thus, multiplying by $\sqrt{n}$ exactly offsets this shrinkage, stabilizing the variance at a fixed, finite level $\sigma^2$. This is why $\sqrt{n}$ is the right normalizing rate: it is precisely the scaling that produces a non-degenerate limit. ::: --- ## Why the normal though? The characteristic function argument {#sec-clt-char} To see *why* the limit is normal and not something else, consider the **characteristic function** of $S_n$. Let $\phi_Y(t) = \mathbb{E}[e^{itY}]$ be the characteristic function of a single centered observation $Y_i - \mu$. Since the $\{Y_i,1\le i \le n\}$ are i.i.d., the characteristic function of $S_n = \frac{1}{\sqrt{n}}\sum(Y_i - \mu)$ is: $$\phi_{S_n}(t) = \left[\phi_Y\!\left(\tfrac{t}{\sqrt{n}}\right)\right]^n$$ Now expand $\phi_Y$ in a Taylor series around $t = 0$. Using $\mathbb{E}[Y_i - \mu] = 0$ and $\mathbb{E}[(Y_i-\mu)^2] = \sigma^2$: $$\phi_Y\!\left(\tfrac{t}{\sqrt{n}}\right) = 1 - \frac{\sigma^2 t^2}{2n} - \underbrace{\frac{i^3 \kappa_3}{6} \cdot \frac{t^3}{n^{3/2}}}_{\text{skewness term}} - \underbrace{\frac{\kappa_4}{24} \cdot \frac{t^4}{n^2}}_{\text{kurtosis term}} - \cdots$$ where $\kappa_3, \kappa_4, \ldots$ are the cumulants (skewness, excess kurtosis, …) of $Y$. We make the following observation: $$\log\left(1+\frac{x}{n}\right)^n=n\log\left(1+\frac{x}{n}\right)$$ Now let $u=x/n$ so $u\to0$ as $n\to\infty$. Use the Taylor expansion of $\log(1+u)=u-\frac{u^2}{2}+\frac{u^3}{3}-\ldots$ to conclude $$n\log\left(1+\frac{x}{n}\right)=n\left(\frac{x}{n}-\frac{x^2}{2n^2}+O(n^{-3})\right)=x-\frac{x^2}{2n}+O(n^{-2})$$ where $a_n:=O(n^{-2})$ means $a_n$ converges to 0 at least as fast as $n^{-2}$, so as $n\to\infty$, all terms beyond $x$ vanish. Exponentiating both sides gives $$\lim_{n\to\infty}\left[1+\frac{x}{n}\right]^n=e^x$$ so, if we take $x=-\sigma^2 t^2/2$, then raising $\phi_Y$ to the $n$-th power and taking $n \to \infty$ yields the desired result: $$ \phi_{S_n}(t) = \left[1 - \frac{\sigma^2 t^2}{2n} - O(n^{-3/2})\right]^n \;\longrightarrow\; e^{-\sigma^2 t^2/2}$$ This is **precisely** the characteristic function of $\mathcal{N}(0, \sigma^2)$. Observe that the higher-order terms vanish because they contribute, after raising to the power $n$, $O(n^{-1/2})$, so they go to zero rather than accumulating. ::: {.callout-important} ## The key insight Notice what happens to each higher-order term as $n \to \infty$: | Term | Shrinks at rate | |------|----------------| | Variance ($\kappa_2$) | Stays fixed (by design of the $\sqrt{n}$ scaling) | | Skewness ($\kappa_3$) | $n^{-1/2} \to 0$ | | Excess kurtosis ($\kappa_4$) | $n^{-1} \to 0$ | | $k$-th cumulant ($\kappa_k$, $k \geq 3$) | $n^{-(k-2)/2} \to 0$ | Averaging destroys higher-moment information faster than it destroys variance. The mean and variance are pinned — everything else vanishes. The normal distribution is the *unique* distribution with finite variance for which all cumulants beyond the second are zero. It is the only distribution that survives. ::: --- ## Intuition {#sec-clt-intuition} While we accepted the *statement* of the CLT, it was less clear *why* the sampling distribition of $S_n$ appears to behave like the normal distribution rather than some other distribution. Here are three complementary perspectives, with varying degrees of detail. ### Take 1: The normal distribution as a fixed point (technical) The characteristic function argument above shows that repeated averaging is an operator on distributions, and $\mathcal{N}(0, \sigma^2)$ is the **fixed point** of that operator. Any distribution with finite variance gets pulled toward the normal under averaging, the way iterating a contraction mapping pulls you to a fixed point. ### Take 2: Entropy maximization (technical but more intuitive) Recall the intuition behind entropy (differential Shannon entropy): it measures how spread out a distribution is. Roughly speaking the more spread out a distribution is, the more uncertainty there is (in terms of information). High uncertainty means high entropy, because if the distribution is spread out all over, I know very little about where the next draw will land on the support. As an extreme case, a degenerate distribution, where all the mass is concentrated at a single point, has no uncertainty and the entropy is $-\infty$. Now let's go back to our example at hand. It turns out that, among all distributions with a fixed mean $\mu$ and fixed variance $\sigma^2$, the normal distribution uniquely **maximizes entropy** (differential Shannon entropy). Entropy measures how much information a distribution carries beyond its first two moments. So, when we average $n$ i.i.d. draws, the skewness of the result shrinks at rate $1/\sqrt{n}$, the excess kurtosis at rate $1/n$, and so on. All the idiosyncratic structure of the original distribution is progressively erased. What remains, once only the mean and variance are left standing, is the distribution that makes the fewest additional claims, that is, the one with maximum entropy. That distribution is the normal. **A silly analogy.** Imagine describing a person's face with many numbers: eye color, nose shape, jawline, freckles, skin texture, etc. Now suppose a process keeps blurring the photo slightly at each step. The specific details, like freckles or eye color, will disappear first. Eventually all you can make out from the original photo is the rough oval shape of a face. That oval shape would be the "maximum entropy face" (least informative) given the constraint that it's still a human head. The oval shape would be the fixed mean and variance, and the normal distribution is the statistical equivalent of that oval limit process. ### Take 3: Convolutions smooth everything out (technical, easy to visualize) When we add two independent random variables, you **convolve** their densities. Each convolution makes the result smoother and more symmetric. It is like ironing out bumps, filling in gaps, and spreading mass more evenly. After enough convolutions, the shape is entirely determined by just the mean and variance, and the only shape with those two properties and no others is the normal. We can actually *see* this happening in the simulation below: even a bimodal or heavily skewed population produces the familiar bell-shaped sampling distribution once $n$ is large enough. --- ## Interactive simulation {#sec-clt-sim} The simulation below lets you choose any population distribution and watch the sampling distribution of the mean evolve as $n$ grows. The orange curve is the normal fit — notice how it tracks the histogram more and more closely as $n$ increases. ```{=html} <style> .clt-sim-wrap { font-family: inherit; margin: 1.5rem 0; } .clt-controls { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin-bottom: 1.25rem; } .clt-ctrl { display: flex; flex-direction: column; gap: 5px; } .clt-ctrl label { font-size: 0.8rem; color: #555; font-weight: 500; } .clt-val-row { display: flex; align-items: center; gap: 8px; } .clt-val-row input[type=range] { flex: 1; } .clt-badge { font-size: 0.85rem; font-weight: 600; min-width: 30px; color: #222; } .clt-charts { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin-bottom: 1rem; } .clt-chart-label { font-size: 0.78rem; color: #555; margin-bottom: 4px; } .clt-stats { display: flex; gap: 12px; flex-wrap: wrap; } .clt-stat { background: #f5f5f5; border-radius: 6px; padding: 8px 14px; flex: 1; min-width: 110px; } .clt-stat-label { font-size: 0.72rem; color: #777; } .clt-stat-val { font-size: 1rem; font-weight: 600; color: #222; } .clt-run { margin-top: 12px; padding: 7px 18px; cursor: pointer; border-radius: 5px; border: 1px solid #888; background: white; font-size: 0.85rem; } .clt-run:hover { background: #f0f0f0; } .clt-note { font-size: 0.75rem; color: #888; margin-top: 10px; } select { padding: 4px 6px; border-radius: 4px; border: 1px solid #ccc; font-size: 0.85rem; } </style> <div class="clt-sim-wrap"> <div class="clt-controls"> <div class="clt-ctrl"> <label>Population distribution</label> <select id="clt-dist"> <option value="uniform">Uniform [0, 1]</option> <option value="exp">Exponential (λ = 1)</option> <option value="bimodal">Bimodal (two humps)</option> <option value="bernoulli">Bernoulli (p = 0.3)</option> <option value="triangle">Right-skewed triangle</option> </select> </div> <div class="clt-ctrl"> <label>Sample size n</label> <div class="clt-val-row"> <input type="range" id="clt-n" min="1" max="100" value="1" step="1"> <span class="clt-badge" id="clt-n-val">1</span> </div> </div> <div class="clt-ctrl"> <label>Number of simulated means</label> <div class="clt-val-row"> <input type="range" id="clt-sim" min="100" max="1000" value="500" step="100"> <span class="clt-badge" id="clt-sim-val">500</span> </div> </div> <div class="clt-ctrl" style="justify-content: flex-end;"> <button class="clt-run" id="clt-run-btn">Run simulation</button> </div> </div> <div class="clt-charts"> <div> <div class="clt-chart-label">Population distribution</div> <div style="position:relative;height:200px;"><canvas id="clt-pop" role="img" aria-label="Population distribution histogram"></canvas></div> </div> <div> <div class="clt-chart-label">Sampling distribution of $\bar{Y}$ — with normal fit</div> <div style="position:relative;height:200px;"><canvas id="clt-samp" role="img" aria-label="Sampling distribution of the mean with normal overlay"></canvas></div> </div> </div> <div class="clt-stats"> <div class="clt-stat"><div class="clt-stat-label">Population mean μ</div><div class="clt-stat-val" id="cs-mu">—</div></div> <div class="clt-stat"><div class="clt-stat-label">Population σ</div><div class="clt-stat-val" id="cs-sigma">—</div></div> <div class="clt-stat"><div class="clt-stat-label">Theoretical SE = σ/√n</div><div class="clt-stat-val" id="cs-se">—</div></div> <div class="clt-stat"><div class="clt-stat-label">Observed SD of means</div><div class="clt-stat-val" id="cs-obs">—</div></div> <div class="clt-stat"><div class="clt-stat-label">Skewness of means</div><div class="clt-stat-val" id="cs-skew">—</div></div> </div> <p class="clt-note">Drag the n slider slowly and watch the bell shape emerge. Even bimodal or heavily skewed populations converge to normal. The skewness stat above tracks how quickly the asymmetry washes out.</p> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/4.4.1/chart.umd.js"></script> <script> (function() { function randNorm(mu, sigma) { let u = 0, v = 0; while(u === 0) u = Math.random(); while(v === 0) v = Math.random(); return mu + sigma * Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v); } const DISTS = { uniform: { sample: () => Math.random(), mu: 0.5, sigma: Math.sqrt(1/12) }, exp: { sample: () => -Math.log(1 - Math.random()), mu: 1, sigma: 1 }, bimodal: { sample: () => Math.random() < 0.5 ? randNorm(0.2,0.07) : randNorm(0.8,0.07), mu: 0.5, sigma: null }, bernoulli:{ sample: () => Math.random() < 0.3 ? 1 : 0, mu: 0.3, sigma: Math.sqrt(0.3*0.7) }, triangle: { sample: () => 1 - Math.sqrt(1 - Math.random()), mu: 1/3, sigma: null } }; function mean(a) { return a.reduce((s,x) => s+x, 0) / a.length; } function std(a) { const m = mean(a); return Math.sqrt(a.reduce((s,x) => s+(x-m)**2, 0)/a.length); } function skew(a) { const m = mean(a), s = std(a); return s===0?0: a.reduce((t,x)=>t+((x-m)/s)**3,0)/a.length; } function histogram(data, bins) { const mn = Math.min(...data), mx = Math.max(...data), w = (mx-mn)/bins; const counts = Array(bins).fill(0); const labels = []; for(let i=0;i<bins;i++) labels.push((mn+i*w+w/2).toFixed(3)); data.forEach(v => { let b=Math.floor((v-mn)/w); if(b>=bins)b=bins-1; counts[b]++; }); return { labels, density: counts.map(c=>c/data.length/w) }; } function normalPDF(x, mu, sigma) { return Math.exp(-0.5*((x-mu)/sigma)**2)/(sigma*Math.sqrt(2*Math.PI)); } let popChart, sampChart; function run() { const key = document.getElementById('clt-dist').value; const n = parseInt(document.getElementById('clt-n').value); const B = parseInt(document.getElementById('clt-sim').value); const dist = DISTS[key]; const popData = Array.from({length: 3000}, dist.sample); const ph = histogram(popData, 30); if(popChart) popChart.destroy(); popChart = new Chart(document.getElementById('clt-pop'), { type: 'bar', data: { labels: ph.labels, datasets: [{ data: ph.density, backgroundColor: '#185FA5aa', borderWidth: 0, barPercentage:1, categoryPercentage:1 }] }, options: { responsive:true, maintainAspectRatio:false, plugins:{legend:{display:false}}, scales:{ x:{ticks:{maxTicksLimit:5},grid:{display:false}}, y:{ticks:{maxTicksLimit:4},title:{display:true,text:'density',font:{size:11}}} } } }); const means = Array.from({length: B}, () => { let s=0; for(let i=0;i<n;i++) s+=dist.sample(); return s/n; }); const empMu = mean(means), empSD = std(means), empSkew = skew(means); const sh = histogram(means, 40); const normalLine = sh.labels.map(l => normalPDF(parseFloat(l), empMu, empSD)); if(sampChart) sampChart.destroy(); sampChart = new Chart(document.getElementById('clt-samp'), { type: 'bar', data: { labels: sh.labels, datasets: [ { type:'bar', data: sh.density, backgroundColor:'#1D9E7566', borderWidth:0, barPercentage:1, categoryPercentage:1 }, { type:'line', data: normalLine, borderColor:'#D85A30', borderWidth:2, pointRadius:0, tension:0.4 } ]}, options: { responsive:true, maintainAspectRatio:false, plugins:{legend:{display:false}}, scales:{ x:{ticks:{maxTicksLimit:5},grid:{display:false}}, y:{ticks:{maxTicksLimit:4},title:{display:true,text:'density',font:{size:11}}} } } }); const mu = dist.mu; const sigma = dist.sigma || empSD * Math.sqrt(n); document.getElementById('cs-mu').textContent = mu.toFixed(3); document.getElementById('cs-sigma').textContent = sigma.toFixed(3); document.getElementById('cs-se').textContent = (sigma/Math.sqrt(n)).toFixed(4); document.getElementById('cs-obs').textContent = empSD.toFixed(4); document.getElementById('cs-skew').textContent = empSkew.toFixed(3); } document.getElementById('clt-n').addEventListener('input', e => { document.getElementById('clt-n-val').textContent = e.target.value; run(); }); document.getElementById('clt-sim').addEventListener('input', e => { document.getElementById('clt-sim-val').textContent = e.target.value; }); document.getElementById('clt-dist').addEventListener('change', run); document.getElementById('clt-run-btn').addEventListener('click', run); if(typeof Chart !== 'undefined') { run(); } else { window.addEventListener('load', run); } })(); </script> ``` ::: {.callout-tip} ## How to read the simulation - **Left panel** — the population distribution. Notice it can be far from bell-shaped. - **Right panel** — the sampling distribution of $\bar Y$ across all simulated samples, with an orange normal curve overlaid. - Try the **bimodal** distribution first. Set $n=1$: two clear humps. Set $n=5$: humps soften. By $n=30$ the fit is already very good. - Try **exponential** (heavily right-skewed). The convergence is slower — you need a larger $n$ before the skewness stat drops close to zero. - The **Bernoulli** (p = 0.3) case is a good reminder that the CLT applies even to discrete distributions: for large $n$, the sampling distribution of a proportion is approximately normal. ::: --- ## Key takeaways {#sec-clt-takeaways} 1. **Centering and scaling matter.** The sample mean collapses to a point (LLN). Multiplying by $\sqrt{n}$ holds the variance fixed and reveals the distributional shape. 2. **Higher moments vanish at increasing rates.** Skewness shrinks at $n^{-1/2}$, kurtosis at $n^{-1}$, and so on. The $\sqrt{n}$-scaled sum retains only the variance (the mean was already pinned down at zero, due to recentering in $S_n$). 3. **The normal is the attractor.** It is the unique fixed point of the averaging operator for distributions with finite variance, the maximum-entropy distribution given mean and variance, and the limit of any convolution sequence. It is not assumed, it emerges in the limit as $n\to\infty$. 4. **Finite variance is the key condition.** If $\mathbb{V}[Y] = \infty$ (e.g., Cauchy distribution), the CLT fails and the $\sqrt{n}$ rate no longer applies. 5. **Practical consequence.** Even when errors are skewed, bimodal, or heteroskedastic, we can make inference (e.g., confidence intervals) based on the normal approximation under a framework when we take $n\to\infty$, because the CLT operates on the *sample mean* or smoothed functions of sample means.

Term	Shrinks at rate
Variance (\(\kappa_2\))	Stays fixed (by design of the \(\sqrt{n}\) scaling)
Skewness (\(\kappa_3\))	\(n^{-1/2} \to 0\)
Excess kurtosis (\(\kappa_4\))	\(n^{-1} \to 0\)
\(k\)-th cumulant (\(\kappa_k\), \(k \geq 3\))	\(n^{-(k-2)/2} \to 0\)