Corrigendum to “Lowering the Threshold of Statistical Signiﬁcance to p < 0.005 to Encourage Enriched Theories of Politics” and “Questions and Answers: Reproducibility and a Stricter Threshold for Statistical Signiﬁcance”

By Justin Esarey

Although The Political Methodologist is a newsletter and blog, not a peer-reviewed publication, I still think it’s important for us to recognize and correct substantively important errors. In this case, I’m sad to report such errors in two things I wrote for TPM. The error is the same in both cases.

In“Lowering the Threshold of Statistical Signiﬁcance to p 0.005 to Encourage Enriched Theories of Politics,” I claimed that:

When K-many statistically independent tests are performed on pre-speciﬁed hypotheses that must be jointly conﬁrmed in order to support a theory, the chance of simultaneously rejecting them all by chance is α^K where p is the critical condition for statistical signiﬁcance in an individual test. As K increases, the α value for each individual study can fall and the overall power of the study often (though not always) increases.

This argument is oﬀered to support the conclusion that “moving the threshold for statistical signiﬁcance from α = 0.05 to α = 0.005 would beneﬁt political science if we adapt to this reform by developing richer, more robust theories that admit multiple predictions.”

Similarly, in “Questions and Answers: Reproducibility and a Stricter Threshold for Statistical Signiﬁcance,” I claimed that:

Another measure to lower Type I error (and the one that I discuss in my article in The Political Methodologist ) is to pre-specify a larger number of diﬀerent hypotheses from a theory and to jointly test these hypotheses. Because the probability of simultaneously conﬁrming multiple disparate predictions by chance is (almost always) lower than the probability of singly conﬁrming one of them, the size of each individual test can be larger than the overall size of the test, allowing for the possibility that the overall test is substantially more powerful at a given size.

This reasoning, which is similar to reasoning oﬀered in Esarey and Sumner (2018b), is incorrect; it would only be true when all predicted parameters were equal to zero. When the alternative hypothesis is that multiple directional predictions for parameters, for example β_i > 0 for i ∈ 1…K, separate t-tests rejecting each individual null (β_i ≤ 0) separately using t-tests with size α will jointly reject all the null hypotheses at most α proportion of the time. The key insight is that the joint null hypothesis space includes the possibility that some β_i parameters match the predictions while others do not; if (for example) β₁ = 0 and all other β_i=/=₁ are very large, the probability of falsely rejecting the joint null hypothesis is the α for the test of β₁. As we note in Esarey and Sumner (2018a), this is discussed and proved in Silvapulle and Sen (2005, Section 5.3), especially in proposition 5.3.1, and in Casella and Berger 2002, Section 8.2.3 and 8.3.3. Silvapulle and Sen cite Lehmann (1952); Berger (1982); Cohen, Gatsonis and Marden (1983); and Berger (1997) (among others) as sources for this argument. Associated calculations (such as that in Figure 4 of “Lowering the Threshold of Statistical Significance to p

The upshot is that my argument for making additional theoretical predictions in order to facilitate lowering the threshold for statistical signiﬁcance to α = 0.005 is based on faulty reasoning and incorrect.

I plan to post this correction as an addendum to both of the print editions featuring these articles.

References

Berger, Roger L. 1982. “Multiparameter Hypothesis Testing and Acceptance Sampling.” Technometrics 24(4):295–300.

Berger, Roger L. 1997. Likelihood ratio tests and intersection-union tests. In Advances in statistical decision theory and applications, ed. Subramanian Panchapakesan and Narayanaswamy Balakrishnan. Boston: Birkhäuser pp. 225–237.

Casella, George and Roger L.. Berger. 2002. Statistical Inference, Second Edition. Belmont,
CA: Brooks/Cole.

Cohen, Arthur, Constantine Gatsonis and John I. Marden. 1983. “Hypothesis testing for marginal probabilities in a 2 x 2 x 2 contingency table with conditional independence.” Journal of the American Statistical Association 78(384):920–929.

Esarey, Justin and Jane Lawrence Sumner. 2018a. “Corrigendum to Marginal Eﬀects in Interaction Models: Determining and Controlling the False Positive Rate.” Online. URL: http://justinesarey.com/interaction-overconfidence-corrigendum.pdf.

Esarey, Justin and Jane Lawrence Sumner. 2018b. “Marginal Eﬀects in Interaction Mod- els: Determining and Controlling the False Positive Rate.” Comparative Political Studies 51(9):1144–1176. DOI: https://doi.org/10.1177/0010414017730080.

Lehmann, Erich L. 1952. “Testing multiparameter hypotheses.” The Annals of Mathematical Statistics pp. 541–552.

Silvapulle, Mervyn J. and Pranab K. Sen. 2005. Constrained Statistical Inference: Inequality, Order, and Shape Restrictions. Hoboken, NJ: Wiley.

Corrigendum to “Lowering the Threshold of Statistical Signiﬁcance to p < 0.005 to Encourage Enriched Theories of Politics” and “Questions and Answers: Reproducibility and a Stricter Threshold for Statistical Signiﬁcance”

References

Subscribe to our Mailing List