Corrigendum to “Lowering the Threshold of Statistical Significance to p < 0.005 to Encourage Enriched Theories of Politics” and “Questions and Answers: Reproducibility and a Stricter Threshold for Statistical Significance”

December 02, 2018

By Justin Esarey

Although The Political Methodologist is a newsletter and blog, not a peer-reviewed publication, I still think it’s important for us to recognize and correct substantively important errors.  In this case, I’m sad to report such errors in two things I wrote for TPM. The error is the same in both cases.

In“Lowering the Threshold of Statistical Significance to p 0.005 to Encourage Enriched Theories of Politics,” I claimed that:

When K-many statistically independent tests are performed on pre-specified hypotheses that must be jointly confirmed in order to support a theory, the chance of simultaneously rejecting them all by chance is αK where p is the critical condition for statistical significance in an individual test. As increases, the α value for each individual study can fall and the overall power of the study often (though not always) increases.

This argument is offered to support the conclusion that “moving the threshold for statistical significance from α = 0.05 to α = 0.005 would benefit political science if we adapt to this reform by developing richer, more robust theories that admit multiple predictions.”

Similarly, in “Questions and Answers: Reproducibility and a Stricter Threshold for Statistical Significance,” I claimed that:

Another measure to lower  Type I error (and the one that I discuss in my article in The Political Methodologist ) is to pre-specify a larger number of different hypotheses from a theory and to jointly test these hypotheses. Because the probability of simultaneously confirming multiple disparate predictions by chance is (almost always) lower than the probability of singly confirming one of them, the size of each individual test can be larger than the overall size of the test, allowing for the possibility that the overall test is substantially more powerful at a given size.

This reasoning, which is similar to reasoning offered in Esarey and Sumner (2018b), is incorrect; it would only be true when all predicted parameters were equal to zero. When the alternative hypothesis is that multiple directional predictions for parameters,  for example βi 0 for ∈ 1…K, separate t-tests rejecting each individual null (βi ≤ 0) separately using t-tests with size α will jointly reject all the null hypotheses at most α proportion of the time. The key insight is that the joint null hypothesis space includes the possibility that some βi parameters match the predictions while others do not; if (for example) β1 = 0 and all other βi=/=1 are very large, the probability of falsely rejecting the joint null hypothesis is the α for the test of β1. As we note in Esarey and Sumner (2018a), this is discussed and proved in Silvapulle and Sen (2005, Section 5.3), especially in proposition 5.3.1, and in Casella and Berger 2002, Section 8.2.3 and 8.3.3. Silvapulle and Sen cite Lehmann (1952); Berger (1982); Cohen, Gatsonis and Marden (1983); and Berger (1997) (among others) as sources for this argument. Associated calculations (such as that in Figure 4 of “Lowering the Threshold of Statistical Significance to p

The upshot is that my argument for making additional theoretical predictions in order to facilitate lowering the threshold for statistical significance to α = 0.005 is based on faulty reasoning and incorrect.

I plan to post this correction as an addendum to both of the print editions featuring these articles.

References

Berger, Roger L. 1982. “Multiparameter Hypothesis Testing and Acceptance Sampling.” Technometrics 24(4):295–300.

Berger,  Roger  L.  1997.  Likelihood  ratio  tests  and  intersection-union  tests.  In  Advances  in statistical decision theory and applications, ed. Subramanian Panchapakesan and Narayanaswamy Balakrishnan.  Boston:  Birkhäuser pp. 225–237.

Casella, George and Roger L.. Berger. 2002. Statistical Inference, Second Edition. Belmont,
CA: Brooks/Cole.

Cohen, Arthur, Constantine Gatsonis and John I. Marden. 1983. “Hypothesis testing  for marginal probabilities in a 2 x 2 x 2 contingency table with conditional independence.”   Journal of the American Statistical Association 78(384):920–929.

Esarey, Justin and Jane Lawrence Sumner. 2018a. “Corrigendum to Marginal Effects in Interaction Models: Determining and Controlling the False Positive Rate.” Online. URL: http://justinesarey.com/interaction-overconfidence-corrigendum.pdf.

Esarey, Justin and Jane Lawrence Sumner. 2018b. “Marginal Effects in Interaction Mod-  els: Determining and Controlling the False Positive Rate.” Comparative Political Studies 51(9):1144–1176. DOI: https://doi.org/10.1177/0010414017730080.

Lehmann, Erich L. 1952. “Testing multiparameter hypotheses.” The Annals of Mathematical Statistics pp. 541–552.

Silvapulle, Mervyn J. and Pranab K. Sen. 2005. Constrained Statistical Inference: Inequality, Order, and Shape Restrictions. Hoboken, NJ: Wiley.