Lee, David S., Justin McCrary, Marcelo J. Moreira, and Jack Porter. 2022. "Valid t-Ratio Inference for IV." American Economic Review, 112 (10): 3260-90.DOI: 10.1257/aer.20211063.
STATA code to add tF critical values and standard errors to ivreg2 output
To install from STATA:
. ssc install ranktest, replace
. ssc install ivreg2, replace
. net install tf, force from(http://www.princeton.edu/~davidlee/wp/)
. help tf
Frequently Asked Questions about “Valid t-ratio Inference for IV”
Why using $\hat{\beta}\pm1.96\times(std.error)$ as a 95 percent confidence interval for (just-identified) IV is incorrect$-$and what to do about it.
What do you mean by “${\hat{\beta}}$ ± 1.96 x (std.err.)” for (just-identified) IV”?
In the discussion below, $\hat{\beta}$ is the 2SLS coefficient estimate on X (the endogenous regressor of interest) when one, for example, uses the STATA command “ivregress 2sls Y W (X=Z W), robust” or “ivreg2 Y W (X=Z), cluster(id)” (where W are additional controls). “(std.err.)”, which we will equivalently refer to as $\sqrt{\hat{V}_{N}\left(\hat{\beta}\right)}$ is the reported standard error for $\hat{\beta}$ . By “just-identified”, we are referring the case where there is a single excluded instrument Z.
More formally, the usual textbook treatment of the just-identified instrumental variable (IV) model would look something like this:
\[ Y = X\beta+u \tag{1} \\ COV(Z,u)=0 \\ COV(X,Z) \ne 0 \]
where $X$ is the endogenous regressor of interest, and $Z$ is the single excluded instrument.
When $(Y, X, Z)$ are random variables that represent the observations of those variables from a randomly drawn unit from the population, the typical textbook recommendation is to use the “sandwich” or “robust” formula for the standard error of the 2SLS estimator $\hat{\beta}$, which is given by
\[
\sqrt{\hat{V}_{N}\left(\hat{\beta}\right)}\equiv(``\text{robust" IV standard error)}\equiv\sqrt{\frac{\hat{V}\left(Z\hat{u}\right)}{\left(\mathbf{Z^{\prime}X}\right)^{2}}}
\]
where $\hat{u} ≡ Y − X \hat{\beta}$, and the bold indicates a random vector with each element being a different unit of observation. $\hat{V}\left(Z\hat{u}\right)=\sum_{i}Z_{i}^{2}\hat{u}_{i}^{2}$ in the case of heteroskedasticity-robust standard errors.
All the results we discuss in the paper and below generalize to cases where external controls/covariates are included, but we focus on the case of no covariates for exposition here.
-
-
Practitioners typically interpret the interval $\hat{\beta}\pm1.96\sqrt{\hat{V}_{N}\left(\hat{\beta}\right)}$ as a “95 percent confidence interval”. They are effectively making the statement: “Under the assumptions of the traditional model in the previous tab, in repeated samples, the interval $\hat{\beta}\pm1.96\sqrt{\hat{V}_{N}\left(\hat{\beta}\right)}$ is expected to contain the true $\beta$ at least 95 percent of the time $-$ approximately, where “approximately" means that there are negligible approximation errors that exist in large sample sizes.”
While this statement is true for other common estimators (e.g. the mean, or least squares regression coefficients) it is not true in the context of instrumental variable models, including just-identified IV models. This has been known in the econometrics literature for a long time (e.g. Staiger and Stock, 1997; and Dufour, 1997), and is implicitly recognized by applied researchers who acknowledge the “weak instruments” problem.
-
-
The econometric literature (e.g. Dufour, 1997) has recognized the problems with “t-ratio-based” inference for instrumental variables for a long time now, so there are a number of different ways of understanding the problem.
One way to see it is as follows. With some algebra, it can be shown that the square of the (just-identified) 2SLS t-ratio (for testing the null hypothesis that β = β$_0$) is numerically equal to
\[
\frac{\left(\hat{\beta}-\beta_{0}\right)^{2}}{\hat{V}_{N}\left(\hat{\beta}\right)} \equiv\hat{t}^{2}=
\underset{\begin{array}{c}
\text{distributed approximately} \chi^{2}(1) \\
\text{in large samples, regardless} \\ \text{of instrument strength}
\end{array}}{\underbrace{\hat{t}_{AR}^{2}}}\times\underset{\begin{array}{c}
\text{“adjustment factor"}\\
\end{array}}{\underbrace{\frac{1}{1-\hat{\rho}\frac{\hat{t}_{AR}}{\hat{f}}+\frac{\hat{t}_{AR}^{2}}{\hat{f}^{2}}}}}
\]where $\hat{f}\equiv\frac{\hat{\pi}}{\sqrt{\hat{V}_{N}\left(\hat{\pi}\right)}}$, $\hat{\pi}$ is the first stage coefficient, $\hat{\rho}$ is an empirical correlation between the first stage residual and an estimate of $u$ that imposes the null, and $\hat{t}_{AR}=\frac{\hat{\pi}\left(\hat{\beta}-\beta_{0}\right)}{\sqrt{\hat{V}_{N}\left(\hat{\pi}\left(\hat{\beta}-\beta_{0}\right)\right)}}$. No approximations have been used here$-$ this is simply a re-writing of the formula for the square of the 2SLS t-ratio. Importantly, under the null hypothesis, the distribution of $\hat{t}_{AR}$ is well-approximated by a standard normal for the same reason that least squares regression coefficients are well-approximated by normal distributions. Therefore, the distribution of $\hat{t}_{AR}^{2}$ (which also happens to be the statistic proposed by Anderson and Rubin, 1949) will be well-approximated by a chi-squared with one degree of freedom. If the “adjustment factor” above were approximately equal to 1, then the usual approach would work fine, since under the null hypothesis, a chi-squared with 1 degree of freedom is less than $1.96^{2}$ with probability $0.95$.
But even with very large samples, in repeated samples the “adjustment factor” above will not be equal to 1. Instead, it will have a non-degenerate distribution; even though $\hat{\rho}$ will be approximately equal to the constant $\rho\equiv COV\left(Zu,Zv\right)$, $\hat{f}$ will be approximately normal with unit variance and non-zero mean. Bottom line: $\hat{t}^{2}$ is clearly not distributed like a chi-squared with 1 degree of freedom, and therefore, the critical values $\pm1.96$ would not be expected to work.
If you'd like another way to convince yourself there is a problem here, the following link allows you to run a Monte Carlo simulation for yourself: STATA demo program.The program produces 10,000 Monte Carlo draws of a sample size of 1,000, using the same Monte Carlo design used in Angrist and Pischke (2009), and produces the histogram of the 10,000 t-ratios. In the figure below, the Monte Carlo was based on $E[F]=10$ and $\rho=0.8$. The histogram clearly does not follow the standard normal density, also shown in the figure. Instead, it closely matches the theoretical density, as predicted by the formulas given by Staiger and Stock (1997) and Stock and Yogo (2005).
-
-
Potentially, really bad. The absolute worst case scenario is when the first stage is very weak, $\pi\approx0$, and when the degree of endogeneity, $\rho\approx\pm1$; in this scenario, the $\hat{\beta}\pm1.96 \times (std.err.)$ interval will contain the true parameter $\beta$ nearly zero percent of the time. Put equivalently, the 5 percent test of significance for the null hypothesis $\beta=\beta_{0}$ is, by definition, supposed to reject in repeated samples, when the null is true, no more than 5 percent of the time. But under this worst-case scenario, it will reject nearly 100 percent of the time.
At the opposite extreme, if $\rho=0$, or if the expected value of $\hat{f}$ is extremely large, then “$\hat{\beta}\pm1.96 \times (std.err.)$” works fine: it will contain $\beta$ at least 95 percent of the time. There are, of course, intermediate scenarios: for example, when $\rho=0.8$ (the MHE example above) coverage rates can be as low as 87 percent (see Figure 2(a) from the AER paper). -
-
Unless you possess more information about the underlying parameters beyond what is assumed in (1) you won't. And that is the crux of why the “usual” approach doesn't work.
The magnitude of $\left|\rho\right|$, as well as the strength of the first stage, captured by $E\left[F\right]$, where $F$ is a noncentral chi-squared random variable (the approximate distribution of $\hat{f}^{2}$), together dictate how bad the inferential distortion will be.But these two parameters $-$ $\rho$ and $E\left[F\right]$ $-$ are just as unknown as the parameter $\beta$. Therefore, without information beyond the model invoked in (1) standard frequentist statistical inference dictates that in order to obtain 95 percent confidence, you have to ensure that you obtain at least 95 percent coverage for all scenarios $-$ including the worst case scenario.
-
-
There are certainly assumptions that one could make, additional to the model (1) such that the usual $\hat{\beta}\pm1.96\sqrt{\hat{V}_{N}(\hat{\beta})}$ can properly be interpreted as a 95 percent confidence interval. For example, in an earlier version of the AER paper (Lee et al., 2020), we graphically depict the entire set of values of $E [F]$ and |$\rho$|, for which it would be true that the usual confidence intervals deliver valid inference.
As the figure demonstrates, one could add to the assumptions in (1), the assertion that $E\left[F\right]>142.6$, and be assured that the usual 95 percent confidence intervals are valid. This practice, however, would require one to ignore the available evidence on whether $E\left[F\right]\ge142.6$, since the first-stage $F$ statistic $-$ while not equal to $E\left[F\right]$ $-$ is observed. As we show in Figure 1 of Lee et al. (2022) , in specifications drawn from a recent sample of 61 AER articles that employ just-identified IV, the majority of first-stage $F$ statistics are below 142.6.
Alternatively, as seen from Figure 2a above, one could add to the assumptions in (1), the assertion that $|\rho|\le.565$, which can be thought of as an upper bound of $.565^2=.32$ for the R-squared of an imagined/infeasible regression of $u$ on $v$. The .565 number is also reported in Angrist and Kolesar (2021). This approach is different from the assertion about $E\left[F\right]$ in the sense that asserting that $\left|\rho\right|\le.565$ is at least somewhat in tension with the very motivation of IV models, which is that there is some endogeneity, $\rho$ (i.e. omitted variable bias) and we don't know about its sign or magnitude. (After all, if we are willing to assume $\rho=0$, we could use OLS).
-
-
The main contribution of Lee et al. (2022) is to address the problem that the usual standard errors that practitioners compute are understated. It provides a simple solution:
\[
(adjusted.std.err)=``\text{0.05 tF }\text{standard error"}=\underset{\begin{array}{c}
tF\text{ adjustment}\\
\text{factor}
\end{array}}{\underbrace{\left(\frac{\sqrt{c_{0.05}\left(\hat{F}\right)}}{1.96}\right)}}\times\underset{\begin{array}{c}
\text{usual robust}\\
\text{standard error}
\end{array}}{\underbrace{\sqrt{\hat{V}_{N}\left(\hat{\beta}\right)}}}
\]So blow up your usual standard errors by a factor that depends on your first stage $F$ statistic, using Table 3(a), and then your $\hat{\beta}\pm1.96 \times (adjusted.std.err.)$ interval will be a valid 95 percent confidence interval. (Or, if you want to test a specific hypothesis $\beta=\beta_{0}$, you can compute the square of the $t$-ratio (using the usual standard errors), and use the critical value $c_{0.05}\left(\hat{F}\right)$ instead of $1.96^{2}$).
Here is a STATA.ado file that will apply the adjustment reflected in Table 3(a) for you.
The paper proves that the adjustment delivers the intended inferences. For example, Figure 2(b) of Lee et al. (2022) plots the rejection probabilities under the null, and they now are all below 0.05 for any values of the unknown parameters $E\left[F\right]$ and $\rho$. If you need more tangible evidence to feel comfortable with the adjustment, you can run Monte Carlo simulations to demonstrate its validity (for example, see the STATA demo program).Finally, note that the standard error inflation factors will change for other significance levels. We include the factors for the 1 percent (99 percent) levels of significance (confidence) in Table 3(b) of the AER paper.
-
-
Lee et al. (2022) address this question by examining specifications taken from 61 recent papers published in the American Economic Review, and applying the adjustment when we can. We find that for one-quarter of specifications in 61 AER papers, corrected standard errors are at least 49 and 136 percent larger than conventional 2SLS standard errors at the 5-percent and 1- percent significance levels, respectively.
-
-
In addition to making explicit the connection between bias in the 2SLS estimator and the first-stage $F$ statistic, the seminal work of Staiger and Stock (1997) and Stock and Yogo (2005) explicitly quantified the connection between the degree of over-rejection of $t$-ratio tests and parameters like $E[F]$.
Unfortunately, the implications of the theory developed by Staiger and Stock (1997) and Stock and Yogo (2005) have been mis-understood and mis-applied by practitioners. That is, practitioners commonly have been reporting the $\hat{\beta}\pm1.96 \times (std.err.)$ interval as a 95 percent confidence interval, loosely justifying its validity because the observed first-stage $F$ is “large enough” (greater than 10 or 16.38 from the Stock and Yogo (2005) tables), while ignoring the fact (as made explicit in Stock and Yogo, 2005) that even with the additional assumption that $E[F]>6.88$, the confidence level of $\hat{\beta}\pm1.96 \times (std.err.)$ cannot exceed 90 percent.
Furthermore, a careful reading and application of the Bonferroni approach of Staiger and Stock (1997) (Section 4.B.) leads to the conclusion that the $\hat{\beta}\pm1.96 \times (std.err.)$ is an 85 percent confidence interval provided that you additionally let the confidence set be $(-\infty,\infty)$ if $F<16.38$. A more refined (non-Bonferroni) calculation shows that such an approach actually yields a 90.7 percent confidence interval. Note that the confidence level is even lower than 90.7 percent if a finite interval is used in the event that $F<16.38$, and also lower if one uses $16.38$ as a “screening threshold” (whereby the inference is conditional on observing $F>16.38$; i.e. if one continues with the analysis only if $F>16.38$). (See discussion in Andrews, Stock, and Sun, 2019).
Since using $\hat{\beta}\pm1.96 \times (std.err.)$ if $F>16.38$ (and $(-\infty,\infty)$ otherwise) delivers an 90.7 percent confidence interval, a natural question to ask is whether there exists a single threshold for $F$ that would deliver a 95 percent confidence interval? Lee et al., 2020 (now in the appendix of Lee et al., 2022) answers that question: 95 percent confidence is given by the interval $\hat{\beta}\pm1.96 \times (std.err.)$ if $F>104.67$ (and $\left(-\infty,\infty\right)$ otherwise).
As shown in this figure, in practice, many first stage F-statistics are below 104.67. So adhering to a simple single threshold for the F-statistic would require many studies to accept infinite confidence intervals.
A key benefit of the $tF$ critical value function is that it allows researchers to obtain valid finite confidence intervals even when the F-statistics is far below 104.67 $-$ potentially as low as 3.84. In this way, the $tF$ critical value function is a natural descendent of the approach of addressing the size distortions via the first-stage $F$, as pioneered by Staiger and Stock (1997), and Stock and Yogo (2005). -
-
Yes, one option is to use the confidence sets of Anderson and Rubin (1949).
However, the $AR$ procedure has two important disadvantages, relative to an enhanced version of $tF$: the $VtF$ procedure, as described in detail in Lee et. al. (2023) and summarized at this web page.
First, $AR$ cannot be used in conjunction with the familiar "1.96" confidence interval construction while still remaining valid. That is, while it would be tempting to use the usual "1.96" intervals if the first-stage $F$ were large, but resort to $AR$ if the first-stage $F$ were small, as pointed out in Lee et. al. (2020) this "two-step" procedure can still over-reject, and thus is an invalid procedure. By contrast, we show in Lee et. al. (2023) that the following procedure is valid: if $\hat{F}>10+100\hat{r}$, use the usual "1.96" intervals; otherwise use the $VtF$ intervals (where $\hat{r}$ is the empirical correlation between the residuals from the first-stage and the main equation using $\hat{\beta}$).
Second, practitioners will discover that $AR$ intervals are almost always longer than $VtF$ intervals, and often by a substantial margin. We illustrate this phenomenon with a sample of published papers in top general interest journals. In 100 percent of the studies, $VtF$ intervals are shorter than $AR$ intervals. Using this sample of studies, we also document that $VtF$ was more successful than $AR$, $tF$ and even the usual 1.96 procedure in producing statistically significant results. The paper provides a theoretical explanation as to what is driving these results.
Download Appendices
References
Anderson, T. W., and Herman Rubin. 1949. “Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations.” Annals of Mathematical Statistics, 20: 46–63.
Andrews, Isaiah, James H. Stock, and Liyang Sun. 2019. “Weak Instruments in Instrumental Variables Regression: Theory and Practice.” Annual Review of Economics, 11: 727–753.
Angrist, Joshua, and Jorn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press.
Angrist, Joshua, and Michal Kolesár. 2021. “One Instrument to Rule Them All: The Bias and Coverage of Just-ID IV.” NBER Working Paper 29417.
Dufour, Jean-Marie. 1997. “Some Impossibility Theorems in Econometrics with Applications to Structural and Dynamic Models.” Econometrica, 65: 1365–1388.
Lee, David S., Justin McCrary, Marcelo J. Moreira, and Jack Porter. 2020. “Valid t-ratio Inference for IV.” arXiv working paper.
Lee, David S., Justin McCrary, Marcelo J. Moreira, and Jack Porter. 2022. “Valid t-ratio Inference for IV.” American Economic Review, 112.
Lee, David S., Justin McCrary, Marcelo J. Moreira, Jack Porter, and Luther Yap. 2023. “What to do when you can't use '1.96' Confidence Intervals for IV". NBER Working Paper #31893.
Staiger, Douglas, and James H. Stock. 1997. “Instrumental Variables Regression with Weak Instruments.” Econometrica, 65: 557–586.
Stock, James H., and Motohiro Yogo. 2005. “Testing for Weak Instruments in Linear IV Regression.” In Identification and Inference in Econometric Models: Essays in Honor of Thomas J. Rothenberg, ed. Donald W.K. Andrews and James H. Stock, Chapter 5, 80–108. Cambridge University Press.