Lee, David S., Justin McCrary, Marcelo J. Moreira, Jack Porter, and Luther Yap. 2023. "What to do when you can't use '1.96' Confidence Intervals for IV" NBER Working Paper #31893.
STATA code to add VtF critical values and confidence intervals to ivreg2 output
To install from STATA:
. ssc install ranktest, replace . ssc install ivreg2, replace . cap mkdir "`c(sysdir_plus)'v" . net set other "`c(sysdir_plus)'v" . net install vtf, force from(http://www.princeton.edu/~davidlee/wp/) all
. help vtf
Frequently Asked Questions about “What to do when you can't use '1.96' Confidence Intervals for IV"
What's the difference between $tF$ critical values from Lee et. al. (2022) and the new "$VtF$" critical values in this new paper?
The $tF$ critical values from Lee et. al. (2022) were motivated by the following question: "I've got the 2SLS $t$-ratio and the first-stage $F$ statistic. How can I use only the $F$ statistic in the spirit of Stock and Yogo (2005) to obtain, for example, a 5 percent test (or 95 percent confidence interval)?" It turns out that if you wanted to rely on a single threshold for $F$ for when the "1.96" procedure would work, it would have to be quite large -- greater than 104.67. Therefore, Lee et. al. (2022) provide a refinement to the widely adopted approach of Stock and Yogo, by providing $t$-ratio critical values that smoothly depend on the first-stage $F$-statistic. This makes it possible to do inference that is valid even if the $F$-statistic is low (potentially as low as $1.96^2=3.84$).
$tF$ is especially well-suited for re-assessing the inferences made in past studies, if one does not have access to the original micro-data, and the study does not report more information than the 2SLS estimate and standard error and the first-stage $F$-statistic. We are unaware of a more powerful way of doing inference if one only has access to the first-stage $F$. See this page for more background on why the usual "1.96" interval is incorrect for just-identified IV.
As we make clear in Lee et. al. (2023), there is valuable information in the data beyond the $F$ statistic -- specifically, $\hat{r}$, the empirical correlation between the residuals in the main equation and first stage. Therefore, going forward, there is no reason to discard this valuable information and limit ourselves to only using the $F$ statistic. For hypothesis testing, $VtF$ critical values additionally depend on the empirical correlation between the main equation and the first stage equation residuals (while imposing the null). For confidence interval construction, instead of using the scaling 1.96, $VtF$ scaling factors additionally depend on the empirical correlation between the residuals formed from using the 2SLS estimate.
The use of this extra information allows $VtF$ to be a much more powerful procedure -- i.e. shorter confidence intervals, and a higher likelihood of obtaining statistical significance when the null is not true.
-
-
Yes. As we show in Lee et. al. (2023) $VtF$ confidence intervals are substantially shorter than $tF$ confidence intervals. We illustrate this using a sample of published studies from prominent general interest journals. In our sample, about 80 percent of the time, $tF$ intervals are at least 30 percent longer. Furthermore, in this sample of studies, $VtF$ inference leads to substantially more frequent statistically significant results, compared to $tF$ inference.
-
-
It is true that the confidence sets of Anderson and Rubin (1949) will be valid (see discussion in Andrews, Stock, and Sun (2019)), no matter how strong or weak the instrument is -- just like $VtF$ intervals. Note that $AR$, also requires the same "extra information" (beyond the F-statistic) that $VtF$ does.
However, Lee et. al. (2023) shows that the $AR$ procedure has two important disadvantages relative to $VtF$.
First, $AR$ cannot be used in conjunction with the familiar "1.96" confidence interval construction while still remaining valid. That is, while it would be tempting to use the usual "1.96" intervals if the first-stage $F$ were large, but resort to $AR$ if the first-stage $F$ were small, as pointed out in Lee et. al. (2020) this "two-step" procedure can still over-reject, and thus is an invalid procedure. By contrast, we show that the following procedure is valid: if $\hat{F}>10+100\hat{r}$, use the usual "1.96" intervals; otherwise use the $VtF$ intervals (where $\hat{r}$ is the empirical correlation between the residuals from the first-stage and the main equation using $\hat{\beta}$).
Second, practitioners will discover that $AR$ intervals are almost always less precise than $VtF$ intervals, and often by a substantial margin. We illustrate this phenomenon with a sample of published papers in top general interest journals. In 100 percent of the studies, $VtF$ intervals are shorter than $AR$ intervals. Using this sample of studies, we also document that $VtF$ was more successful than $AR$, $tF$ and even the usual 1.96 procedure in producing statistically significant results. The paper provides a theoretical explanation as to what is driving these results.
-
-
Yes, $VtF$ intervals will generally be found to be shorter, and in practice would often be expected to be entirely contained within $AR$ intervals. Lee et. al. (2023) documents $VtF$'s clear superior confidence interval performance. Furthermore, it shows that $VtF$ can also be more powerful for a large range of alternatives.
Yes, $AR$ has been called "optimal", and "uniformly most powerful", which at first glance may seem contradictory to these findings. If the optimality result applied to all tests, then it indeed would be a contradiction to find $VtF$ to be more powerful for some alternative values of $\beta$.
However, the literature actually describes $AR$ as being optimal in more limited and qualified way. For example, one of the ways $AR$ is described as optimal is that it is uniformly most powerful among all the tests that produce power curves that are always uniformly above the intended significance level (e.g. 0.05).
This result, thus, has nothing to say about how $AR$ might compare to a test whose power curve dips below e.g. 0.05 even for a small region of alternative values of $\beta$. $VtF$ is precisely an example of such a test. That's why the literature's optimality result on $AR$ is not a contradiction to the findings in Lee et. al. (2023) on the power of $VtF$.
And as we show in this paper, even though $VtF$ power curves can dip below 0.05 for a small range of alternatives, this does not prevent the confidence intervals from generally being shorter. See Section IV of the paper for additional discussion of $AR$ and $VtF$, and in particular, a new perspective on the connection between power and confidence intervals.
-
-
Testing hypotheses is straightforward. Just compute your usual 2SLS t-ratio for whatever null hypothesis you are interested in, and compare it to the correct critical value, as provided in Lee, McCrary, Moreira, Porter, and Yap (2023).
The STATA package referenced at the top of the page will give you the critical value for a specified hypothesis for your dataset. It will also invert the test procedure and provide confidence intervals.
Note that there is a rule of thumb that maintains validity while simplifying matters. Specifically, the following rule is perfectly valid: "if $\hat{F}>10+100\times\hat{r}$, use the usual '1.96' confidence intervals; otherwise use the $VtF$ intervals". This is a bit conservative, compared to simply just using the $VtF$ intervals all the time, but it still works.
We have also provided a simple monte carlo simulation program (VtFsim.do) in STATA that you can run and experiment with, to assure yourself that these critical values/confidence interval inflation factors are valid.
Download Appendices
Online Appendix to “What to do when you can't use '1.96' Confidence Intervals for IV", Lee, McCrary, Moreira, Porter, and Yap (2023).
References
Anderson, T. W., and Herman Rubin. 1949. “Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations.” Annals of Mathematical Statistics, 20: 46–63.
Andrews, Isaiah, James H. Stock, and Liyang Sun. 2019. “Weak Instruments in Instrumental Variables Regression: Theory and Practice.” Annual Review of Economics, 11: 727–753.
Lee, David S., Justin McCrary, Marcelo J. Moreira, and Jack Porter. 2020. “Valid t-ratio Inference for IV.” arXiv working paper.
Lee, David S., Justin McCrary, Marcelo J. Moreira, and Jack Porter. 2022. “Valid t-ratio Inference for IV.” American Economic Review, 112.
Lee, David S., Justin McCrary, Marcelo J. Moreira, Jack Porter, and Luther Yap. 2023. “What to do when you can't use '1.96' Confidence Intervals for IV". NBER Working Paper #31893.
Staiger, Douglas, and James H. Stock. 1997. “Instrumental Variables Regression with Weak Instruments.” Econometrica, 65: 557–586.
Stock, James H., and Motohiro Yogo. 2005. “Testing for Weak Instruments in Linear IV Regression.” In Identification and Inference in Econometric Models: Essays in Honor of Thomas J. Rothenberg, ed. Donald W.K. Andrews and James H. Stock, Chapter 5, 80–108. Cambridge University Press.