Set seed for reproducibility
Load packages we will need in this tutorial. Make sure you have them
installed. If not, you can install them using
install.packages("package_name").
library(here)
library(sandwich)
library(lmtest)
library(car)
To avoid clutter, I have coded a couple of functions elsewhwere that we can outsource and use here right away. These functions will help us generate data for the Monte Carlo simulations, and plot the results of our simulations.
source(here("scripts/gen_data.R"))
source(here("scripts/plots_estimates.R"))
The goal of this tutorial is to illustrate how to perform heteroskedasticity-robust inference in the context of the linear regression model. We will consider single and multiple hypotheses, both linear and non-linear.
Let us generate data according to model 2 (see the
gen_data function for details). We will then estimate the
OLS model using the generated data.
data <- gen_data(n = 1000, dgp = "model 2")
Y <- data$Y
X <- data$X
Base R includes an intercept by default. We need to remove it from
our model because gen_data.R returns a matrix with an
intercept.
fitModel <- lm(Y ~ X - 1)
summary(fitModel)
##
## Call:
## lm(formula = Y ~ X - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8948 -0.6204 0.0040 0.6261 3.4329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## XIntercept 0.45611 0.06257 7.290 6.3e-13 ***
## XX1 2.02521 0.03045 66.513 < 2e-16 ***
## XX2 -0.97326 0.10958 -8.882 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9896 on 997 degrees of freedom
## Multiple R-squared: 0.8191, Adjusted R-squared: 0.8186
## F-statistic: 1505 on 3 and 997 DF, p-value: < 2.2e-16
and the estimate of the covariance matrix
Vhat <- vcovHC(fitModel, type = "HC0")
Before we proceed with the inference, recall that for this example,
we generated data according to model 2, which means that the true
parameter values are \(\beta=(0.5,2,-1)\). We can use the
linearHypothesis function from the car package
to test linear hypotheses. For example, we can test if the second
coefficient is equal to 2. Notice that this is a true hypothesis, so the
test should not reject \(H_0\). Indeed,
this is the case:
test_result <- linearHypothesis(fitModel, "XX1 = 2", vcov = vcovHC(fitModel, type = "HC0"), test = "Chisq")
summary(test_result)
## Res.Df Df Chisq Pr(>Chisq)
## Min. :997.0 Min. :1 Min. :0.5668 Min. :0.4515
## 1st Qu.:997.2 1st Qu.:1 1st Qu.:0.5668 1st Qu.:0.4515
## Median :997.5 Median :1 Median :0.5668 Median :0.4515
## Mean :997.5 Mean :1 Mean :0.5668 Mean :0.4515
## 3rd Qu.:997.8 3rd Qu.:1 3rd Qu.:0.5668 3rd Qu.:0.4515
## Max. :998.0 Max. :1 Max. :0.5668 Max. :0.4515
## NA's :1 NA's :1 NA's :1
We now test the join hypothesis \(H_0: \beta_1=1,\:\beta_2=0\). In thi case, the null hypothesis is false, so we expect the test to reject \(H_0\):
joint_test_result <- linearHypothesis(fitModel, c("XX1 = 1", "XX2 = 0"), vcov = vcovHC(fitModel, type = "HC0"), test = "Chisq")
summary(joint_test_result)
## Res.Df Df Chisq Pr(>Chisq)
## Min. :997.0 Min. :2 Min. :1100 Min. :0
## 1st Qu.:997.5 1st Qu.:2 1st Qu.:1100 1st Qu.:0
## Median :998.0 Median :2 Median :1100 Median :0
## Mean :998.0 Mean :2 Mean :1100 Mean :0
## 3rd Qu.:998.5 3rd Qu.:2 3rd Qu.:1100 3rd Qu.:0
## Max. :999.0 Max. :2 Max. :1100 Max. :0
## NA's :1 NA's :1 NA's :1
Let us consider two types of non-linear hypotheses. First, we want to test whether \(H_0: \beta_0\beta_1+\beta=0\). Then, we will test the joint hypothesis \(H_0: \beta_0\beta_1+\beta_2=0,\:\beta_1^2=3\). Observe that the first non-linear hypothesis is true, while the second one is false. We expect the first test to not reject \(H_0\), while the second one should reject it.
Before we proceed, let’s define a couple of functions that compute \(\hat{\theta}\) and matrix \(\hat{A}\) for the non-linear hypotheses (the Jacobian matrix); see Assumption A5 in the lecture notes for details.
theta <- function(a, type = c("single", "multiple")) {
type <- match.arg(type)
if (type == "single") {
return(c(a[1] * a[2] + a[3]))
}
if (type == "multiple") {
return(c(a[1]*a[2] + a[3], a[2]^2 - 3))
}
stop("Unrecognized type. Use 'single' or 'multiple'.")
}
A_jacobian <- function(a, type = c("single", "multiple")) {
type <- match.arg(type)
if (type == "single") {
return(c(a[2], a[1], 1))
}
if (type == "multiple") {
A <- c(a[2], a[1], 1, 0, 2 * a[2], 0 )
return(matrix(A, nrow = 3, ncol = 2, byrow = FALSE))
}
stop("Unrecognized type. Use 'single' or 'multiple'.")
}
Let’s calculate the Wald statistic for the first non-linear hypothesis \(H_0: \beta_0\beta_1+\beta=0\).
theta_hat_1 <- theta(coef(fitModel), type = "single")
Ahat_1 <- A_jacobian(coef(fitModel), type = "single")
Wald_1 <- t(theta_hat_1) %*% solve(t(Ahat_1) %*% Vhat %*% Ahat_1) %*% theta_hat_1
Wald_1
## [,1]
## [1,] 0.4860079
To compute the p-value, we need to know the degrees of freedom. In this case, we have one restriction, so the degrees of freedom is 1.
p_value_1 <- pchisq(Wald_1, df = 1, lower.tail = FALSE)
p_value_1
## [,1]
## [1,] 0.4857135
Suppose we wanted to test the second non-linear hypothesis at a significance level of \(\alpha = 0.05\). We can compare the p-value with \(\alpha\) to decide whether to reject or not the null hypothesis. In this case, we have that the p-value is not smaller than \(\alpha\), so we cannot reject the null hypothesis.
Now, let’s calculate the Wald statistic for the second non-linear hypothesis \(H_0: \beta_0\beta_1+\beta_2=0,\:\beta_1^2=3\).
theta_hat_2 <- theta(coef(fitModel), type = "multiple")
Ahat_2 <- A_jacobian(coef(fitModel), type = "multiple")
Wald_2 <- t(theta_hat_2) %*% solve(t(Ahat_2) %*% Vhat %*% Ahat_2) %*% theta_hat_2
Wald_2
## [,1]
## [1,] 104.233
To compute the p-value, we need to know the degrees of freedom. In this case, we have two restrictions, so the degrees of freedom is 2.
p_value_2 <- pchisq(Wald_2, df = 2, lower.tail = FALSE)
p_value_2
## [,1]
## [1,] 2.323185e-23
Suppose we wanted to test the second non-linear hypothesis at a significance level of \(\alpha = 0.05\). We can compare the p-value with \(\alpha\) to decide whether to reject or not the null hypothesis. In this case, we have that the p-value is smaller, so we reject the null hypothesis.