1 Preliminaries

Set seed for reproducibility

Load packages we will need in this tutorial. Make sure you have them installed. If not, you can install them using install.packages("package_name").

library(here)
library(sandwich)
library(lmtest)
library(car)

To avoid clutter, I have coded a couple of functions elsewhwere that we can outsource and use here right away. These functions will help us generate data for the Monte Carlo simulations, and plot the results of our simulations.

source(here("scripts/gen_data.R"))
source(here("scripts/plots_estimates.R"))

1.1 Introduction

The goal of this tutorial is to illustrate how to perform heteroskedasticity-robust inference in the context of the linear regression model. We will consider single and multiple hypotheses, both linear and non-linear.

1.2 Generate the Data and Estimate the model

Let us generate data according to model 2 (see the gen_data function for details). We will then estimate the OLS model using the generated data.

data <- gen_data(n = 1000, dgp = "model 2")
Y <- data$Y
X <- data$X

Base R includes an intercept by default. We need to remove it from our model because gen_data.R returns a matrix with an intercept.

fitModel <- lm(Y ~ X - 1)
summary(fitModel)
## 
## Call:
## lm(formula = Y ~ X - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8948 -0.6204  0.0040  0.6261  3.4329 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## XIntercept  0.45611    0.06257   7.290  6.3e-13 ***
## XX1         2.02521    0.03045  66.513  < 2e-16 ***
## XX2        -0.97326    0.10958  -8.882  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9896 on 997 degrees of freedom
## Multiple R-squared:  0.8191, Adjusted R-squared:  0.8186 
## F-statistic:  1505 on 3 and 997 DF,  p-value: < 2.2e-16

and the estimate of the covariance matrix

Vhat <- vcovHC(fitModel, type = "HC0")

1.3 Testing Linear Hypotheses

Before we proceed with the inference, recall that for this example, we generated data according to model 2, which means that the true parameter values are \(\beta=(0.5,2,-1)\). We can use the linearHypothesis function from the car package to test linear hypotheses. For example, we can test if the second coefficient is equal to 2. Notice that this is a true hypothesis, so the test should not reject \(H_0\). Indeed, this is the case:

test_result <- linearHypothesis(fitModel, "XX1 = 2", vcov = vcovHC(fitModel, type = "HC0"), test = "Chisq")
summary(test_result)
##      Res.Df            Df        Chisq          Pr(>Chisq)    
##  Min.   :997.0   Min.   :1   Min.   :0.5668   Min.   :0.4515  
##  1st Qu.:997.2   1st Qu.:1   1st Qu.:0.5668   1st Qu.:0.4515  
##  Median :997.5   Median :1   Median :0.5668   Median :0.4515  
##  Mean   :997.5   Mean   :1   Mean   :0.5668   Mean   :0.4515  
##  3rd Qu.:997.8   3rd Qu.:1   3rd Qu.:0.5668   3rd Qu.:0.4515  
##  Max.   :998.0   Max.   :1   Max.   :0.5668   Max.   :0.4515  
##                  NA's   :1   NA's   :1        NA's   :1

We now test the join hypothesis \(H_0: \beta_1=1,\:\beta_2=0\). In thi case, the null hypothesis is false, so we expect the test to reject \(H_0\):

joint_test_result <- linearHypothesis(fitModel, c("XX1 = 1", "XX2 = 0"), vcov = vcovHC(fitModel, type = "HC0"), test = "Chisq")
summary(joint_test_result)
##      Res.Df            Df        Chisq        Pr(>Chisq)
##  Min.   :997.0   Min.   :2   Min.   :1100   Min.   :0   
##  1st Qu.:997.5   1st Qu.:2   1st Qu.:1100   1st Qu.:0   
##  Median :998.0   Median :2   Median :1100   Median :0   
##  Mean   :998.0   Mean   :2   Mean   :1100   Mean   :0   
##  3rd Qu.:998.5   3rd Qu.:2   3rd Qu.:1100   3rd Qu.:0   
##  Max.   :999.0   Max.   :2   Max.   :1100   Max.   :0   
##                  NA's   :1   NA's   :1      NA's   :1

1.4 Testing Non-Linear Hypotheses

Let us consider two types of non-linear hypotheses. First, we want to test whether \(H_0: \beta_0\beta_1+\beta=0\). Then, we will test the joint hypothesis \(H_0: \beta_0\beta_1+\beta_2=0,\:\beta_1^2=3\). Observe that the first non-linear hypothesis is true, while the second one is false. We expect the first test to not reject \(H_0\), while the second one should reject it.

Before we proceed, let’s define a couple of functions that compute \(\hat{\theta}\) and matrix \(\hat{A}\) for the non-linear hypotheses (the Jacobian matrix); see Assumption A5 in the lecture notes for details.

theta <- function(a, type = c("single", "multiple")) {
  type <- match.arg(type)
  if (type == "single") {
    return(c(a[1] * a[2] + a[3]))
  }
  if (type == "multiple") {
    return(c(a[1]*a[2] + a[3], a[2]^2 - 3))
  }
  stop("Unrecognized type. Use 'single' or 'multiple'.")
}

A_jacobian <- function(a, type = c("single", "multiple")) {
type <- match.arg(type)
  if (type == "single") {
    return(c(a[2], a[1], 1))
}
  if (type == "multiple") {
    A <- c(a[2], a[1], 1, 0, 2 * a[2], 0 )
    return(matrix(A, nrow = 3, ncol = 2, byrow = FALSE))
  }
  stop("Unrecognized type. Use 'single' or 'multiple'.")
}

2 Test Non-Linear Hypothesis 1

Let’s calculate the Wald statistic for the first non-linear hypothesis \(H_0: \beta_0\beta_1+\beta=0\).

theta_hat_1 <- theta(coef(fitModel), type = "single")
Ahat_1 <- A_jacobian(coef(fitModel), type = "single")
Wald_1 <- t(theta_hat_1) %*% solve(t(Ahat_1) %*% Vhat %*% Ahat_1) %*% theta_hat_1
Wald_1
##           [,1]
## [1,] 0.4860079

To compute the p-value, we need to know the degrees of freedom. In this case, we have one restriction, so the degrees of freedom is 1.

p_value_1 <- pchisq(Wald_1, df = 1, lower.tail = FALSE)
p_value_1
##           [,1]
## [1,] 0.4857135

Suppose we wanted to test the second non-linear hypothesis at a significance level of \(\alpha = 0.05\). We can compare the p-value with \(\alpha\) to decide whether to reject or not the null hypothesis. In this case, we have that the p-value is not smaller than \(\alpha\), so we cannot reject the null hypothesis.

3 Test Non-Linear Hypothesis 2

Now, let’s calculate the Wald statistic for the second non-linear hypothesis \(H_0: \beta_0\beta_1+\beta_2=0,\:\beta_1^2=3\).

theta_hat_2 <- theta(coef(fitModel), type = "multiple")
Ahat_2 <- A_jacobian(coef(fitModel), type = "multiple")
Wald_2 <- t(theta_hat_2) %*% solve(t(Ahat_2) %*% Vhat %*% Ahat_2) %*% theta_hat_2
Wald_2
##         [,1]
## [1,] 104.233

To compute the p-value, we need to know the degrees of freedom. In this case, we have two restrictions, so the degrees of freedom is 2.

p_value_2 <- pchisq(Wald_2, df = 2, lower.tail = FALSE)
p_value_2
##              [,1]
## [1,] 2.323185e-23

Suppose we wanted to test the second non-linear hypothesis at a significance level of \(\alpha = 0.05\). We can compare the p-value with \(\alpha\) to decide whether to reject or not the null hypothesis. In this case, we have that the p-value is smaller, so we reject the null hypothesis.