4 Details on Central Limit Theorem
Previously, we illustrated the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) for the \(\hat{\beta}\) using Monte Carlo simulations. In particular, the LLN tells us that \(\hat{\beta}\) converges to \(\beta\) as \(n\) grows. The Central Limit Theorem goes further: it tells us the rate at which this happens and, more strikingly, the shape of the sampling distribution.
Let us now build more intuition in a simpler case, namely the sample mean (the intuition carries over to the estimator of the slope parameter since \(\hat{\beta}\) is a smoothed function of sample averages). Specifically, in class we stated that the sample mean based on an i.i.d. sample from a distribution with finite variance
\[\sqrt{n}\!\left(\mathbb{E}_n[Y] - \mathbb{E}[Y]\right) \xrightarrow{d} \mathcal{N}\!\left(0,\, \mathbb{V}[Y]\right)\]
In words: if we recenter the sample mean by subtracting the true population mean and rescale by \(\sqrt{n}\), the resulting quantity converges in distribution to a standard normal — no matter what the underlying distribution of \(Y\) is, provided it has finite variance.
The normality in the limit does not come from \(Y\) being normally distributed. In fact, we did not assume what the distribution of origin was, only that such a data generating process from which we draw \(Y_1, \ldots, Y_n\) has finite variance. The normality behavior emerged purely from the averaging process itself. This is the mathematical foundation we will use to make approximate inference in this course.
4.1 The scaled and centered sum
Let \(Y_1, \ldots, Y_n\) be i.i.d. with mean \(\mu = \mathbb{E}[Y]\) and finite variance \(\sigma^2 = \mathbb{V}[Y]\). Arguing as in the lecture, define the sample mean as
\[\mathbb{E}_n[Y] = \frac{1}{n}\sum_{i=1}^n Y_i\]
Now consider the centered version, scaled by \(\sqrt{n}\):
\[S_n \;=\; \sqrt{n}\!\left(\mathbb{E}_n[Y] - \mu\right) \;=\; \frac{1}{\sqrt{n}}\sum_{i=1}^n (Y_i - \mu)\]
Mean and Variance of \(S_n\) During the lecture, we obtained the mean and variance of \(S_n\) under the i.i.d. assumption as \[\mathbb{E}[S_n] = \frac{1}{\sqrt{n}}\sum_{i=1}^n \mathbb{E}[Y_i - \mu] = 0\] and \[\mathbb{V}[S_n] = \frac{1}{n}\sum_{i=1}^n \mathbb{V}[Y_i - \mu] = \frac{1}{n} \cdot n\sigma^2 = \sigma^2\]
The sample mean \(\mathbb{E}_n[Y]\) has variance \(\sigma^2/n\), which collapses to zero as \(n\to\infty\), producing a degenerate distribution in the limit. Thus, multiplying by \(\sqrt{n}\) exactly offsets this shrinkage, stabilizing the variance at a fixed, finite level \(\sigma^2\). This is why \(\sqrt{n}\) is the right normalizing rate: it is precisely the scaling that produces a non-degenerate limit.
4.2 Why the normal though? The characteristic function argument
To see why the limit is normal and not something else, consider the characteristic function of \(S_n\). Let \(\phi_Y(t) = \mathbb{E}[e^{itY}]\) be the characteristic function of a single centered observation \(Y_i - \mu\).
Since the \(\{Y_i,1\le i \le n\}\) are i.i.d., the characteristic function of \(S_n = \frac{1}{\sqrt{n}}\sum(Y_i - \mu)\) is:
\[\phi_{S_n}(t) = \left[\phi_Y\!\left(\tfrac{t}{\sqrt{n}}\right)\right]^n\]
Now expand \(\phi_Y\) in a Taylor series around \(t = 0\). Using \(\mathbb{E}[Y_i - \mu] = 0\) and \(\mathbb{E}[(Y_i-\mu)^2] = \sigma^2\):
\[\phi_Y\!\left(\tfrac{t}{\sqrt{n}}\right) = 1 - \frac{\sigma^2 t^2}{2n} - \underbrace{\frac{i^3 \kappa_3}{6} \cdot \frac{t^3}{n^{3/2}}}_{\text{skewness term}} - \underbrace{\frac{\kappa_4}{24} \cdot \frac{t^4}{n^2}}_{\text{kurtosis term}} - \cdots\]
where \(\kappa_3, \kappa_4, \ldots\) are the cumulants (skewness, excess kurtosis, …) of \(Y\).
We make the following observation: \[\log\left(1+\frac{x}{n}\right)^n=n\log\left(1+\frac{x}{n}\right)\] Now let \(u=x/n\) so \(u\to0\) as \(n\to\infty\). Use the Taylor expansion of \(\log(1+u)=u-\frac{u^2}{2}+\frac{u^3}{3}-\ldots\) to conclude
\[n\log\left(1+\frac{x}{n}\right)=n\left(\frac{x}{n}-\frac{x^2}{2n^2}+O(n^{-3})\right)=x-\frac{x^2}{2n}+O(n^{-2})\] where \(a_n:=O(n^{-2})\) means \(a_n\) converges to 0 at least as fast as \(n^{-2}\), so as \(n\to\infty\), all terms beyond \(x\) vanish. Exponentiating both sides gives \[\lim_{n\to\infty}\left[1+\frac{x}{n}\right]^n=e^x\] so, if we take \(x=-\sigma^2 t^2/2\), then raising \(\phi_Y\) to the \(n\)-th power and taking \(n \to \infty\) yields the desired result:
\[ \phi_{S_n}(t) = \left[1 - \frac{\sigma^2 t^2}{2n} - O(n^{-3/2})\right]^n \;\longrightarrow\; e^{-\sigma^2 t^2/2}\]
This is precisely the characteristic function of \(\mathcal{N}(0, \sigma^2)\). Observe that the higher-order terms vanish because they contribute, after raising to the power \(n\), \(O(n^{-1/2})\), so they go to zero rather than accumulating.
Notice what happens to each higher-order term as \(n \to \infty\):
| Term | Shrinks at rate |
|---|---|
| Variance (\(\kappa_2\)) | Stays fixed (by design of the \(\sqrt{n}\) scaling) |
| Skewness (\(\kappa_3\)) | \(n^{-1/2} \to 0\) |
| Excess kurtosis (\(\kappa_4\)) | \(n^{-1} \to 0\) |
| \(k\)-th cumulant (\(\kappa_k\), \(k \geq 3\)) | \(n^{-(k-2)/2} \to 0\) |
Averaging destroys higher-moment information faster than it destroys variance. The mean and variance are pinned — everything else vanishes. The normal distribution is the unique distribution with finite variance for which all cumulants beyond the second are zero. It is the only distribution that survives.
4.3 Intuition
While we accepted the statement of the CLT, it was less clear why the sampling distribition of \(S_n\) appears to behave like the normal distribution rather than some other distribution. Here are three complementary perspectives, with varying degrees of detail.
4.3.1 Take 1: The normal distribution as a fixed point (technical)
The characteristic function argument above shows that repeated averaging is an operator on distributions, and \(\mathcal{N}(0, \sigma^2)\) is the fixed point of that operator. Any distribution with finite variance gets pulled toward the normal under averaging, the way iterating a contraction mapping pulls you to a fixed point.
4.3.2 Take 2: Entropy maximization (technical but more intuitive)
Recall the intuition behind entropy (differential Shannon entropy): it measures how spread out a distribution is. Roughly speaking the more spread out a distribution is, the more uncertainty there is (in terms of information). High uncertainty means high entropy, because if the distribution is spread out all over, I know very little about where the next draw will land on the support. As an extreme case, a degenerate distribution, where all the mass is concentrated at a single point, has no uncertainty and the entropy is \(-\infty\).
Now let’s go back to our example at hand. It turns out that, among all distributions with a fixed mean \(\mu\) and fixed variance \(\sigma^2\), the normal distribution uniquely maximizes entropy (differential Shannon entropy). Entropy measures how much information a distribution carries beyond its first two moments.
So, when we average \(n\) i.i.d. draws, the skewness of the result shrinks at rate \(1/\sqrt{n}\), the excess kurtosis at rate \(1/n\), and so on. All the idiosyncratic structure of the original distribution is progressively erased. What remains, once only the mean and variance are left standing, is the distribution that makes the fewest additional claims, that is, the one with maximum entropy. That distribution is the normal.
A silly analogy. Imagine describing a person’s face with many numbers: eye color, nose shape, jawline, freckles, skin texture, etc. Now suppose a process keeps blurring the photo slightly at each step. The specific details, like freckles or eye color, will disappear first. Eventually all you can make out from the original photo is the rough oval shape of a face. That oval shape would be the “maximum entropy face” (least informative) given the constraint that it’s still a human head. The oval shape would be the fixed mean and variance, and the normal distribution is the statistical equivalent of that oval limit process.
4.3.3 Take 3: Convolutions smooth everything out (technical, easy to visualize)
When we add two independent random variables, you convolve their densities. Each convolution makes the result smoother and more symmetric. It is like ironing out bumps, filling in gaps, and spreading mass more evenly. After enough convolutions, the shape is entirely determined by just the mean and variance, and the only shape with those two properties and no others is the normal.
We can actually see this happening in the simulation below: even a bimodal or heavily skewed population produces the familiar bell-shaped sampling distribution once \(n\) is large enough.
4.4 Interactive simulation
The simulation below lets you choose any population distribution and watch the sampling distribution of the mean evolve as \(n\) grows. The orange curve is the normal fit — notice how it tracks the histogram more and more closely as \(n\) increases.
Drag the n slider slowly and watch the bell shape emerge. Even bimodal or heavily skewed populations converge to normal. The skewness stat above tracks how quickly the asymmetry washes out.
- Left panel — the population distribution. Notice it can be far from bell-shaped.
- Right panel — the sampling distribution of \(\bar Y\) across all simulated samples, with an orange normal curve overlaid.
- Try the bimodal distribution first. Set \(n=1\): two clear humps. Set \(n=5\): humps soften. By \(n=30\) the fit is already very good.
- Try exponential (heavily right-skewed). The convergence is slower — you need a larger \(n\) before the skewness stat drops close to zero.
- The Bernoulli (p = 0.3) case is a good reminder that the CLT applies even to discrete distributions: for large \(n\), the sampling distribution of a proportion is approximately normal.
4.5 Key takeaways
Centering and scaling matter. The sample mean collapses to a point (LLN). Multiplying by \(\sqrt{n}\) holds the variance fixed and reveals the distributional shape.
Higher moments vanish at increasing rates. Skewness shrinks at \(n^{-1/2}\), kurtosis at \(n^{-1}\), and so on. The \(\sqrt{n}\)-scaled sum retains only the variance (the mean was already pinned down at zero, due to recentering in \(S_n\)).
The normal is the attractor. It is the unique fixed point of the averaging operator for distributions with finite variance, the maximum-entropy distribution given mean and variance, and the limit of any convolution sequence. It is not assumed, it emerges in the limit as \(n\to\infty\).
Finite variance is the key condition. If \(\mathbb{V}[Y] = \infty\) (e.g., Cauchy distribution), the CLT fails and the \(\sqrt{n}\) rate no longer applies.
Practical consequence. Even when errors are skewed, bimodal, or heteroskedastic, we can make inference (e.g., confidence intervals) based on the normal approximation under a framework when we take \(n\to\infty\), because the CLT operates on the sample mean or smoothed functions of sample means.