Pitfalls when Estimating Treatment Effects Using Clustered Data

By James G. MacKinnon, Department of Economics, Queen’s University[1], and Matthew D. Webb, Department of Economics, Carleton University

Extended Abstract

There is a large and rapidly growing literature on inference with clustered data, that is, data where the disturbances (error terms) are correlated within clusters. This type of correlation is commonly observed whenever multiple observations are associated with the same political jurisdictions. Observations might also be clustered by time periods, industries, or institutions such as hospitals or schools.

When estimating regression models with clustered data, it is very common to use a “cluster-robust variance estimator” or CRVE. However, inference for estimates of treatment effects with clustered data requires great care when treatment is assigned at the group level. This is true for both pure treatment models and difference-in-differences regressions, where the data have both a time dimension and a cross-section dimension and it is common to cluster at the cross-section level.

Even when the number of clusters is quite large, cluster-robust standard errors can be much too small if the number of treated (or control) clusters is small. Standard errors also tend to be too small when cluster sizes vary a lot, resulting in too many false positives. Bootstrap methods based on the wild bootstrap generally perform better than t-tests, but they can also yield very misleading inferences in some cases. In particular, what would otherwise be the best variant of the wild bootstrap can underreject extremely severely when the number of treated clusters is very small. Other bootstrap methods can overreject extremely severely in that case.

In Section 2, we briefly review the key ideas of cluster-robust covariance matrices and standard errors. In Section 3, we then explain why inference based on these standard errors can fail when there are few treated clusters. In Section 4, we discuss bootstrap methods for cluster-robust inference. In Section 5, we report (graphically) the results of several simulation experiments which illustrate just how severely both conventional and bootstrap methods can overreject or underreject when there are few treated clusters. In Section 6, the implications of these results are illustrated using an empirical example from Burden, Canon, Mayer, and Moynihan (2017). The final section concludes and provides some recommendations for empirical work.

Full Article


Replication File

Replication files for the Monte Carlo simulations and the empirical example can be found at: doi:10.7910/DVN/GBEKTO .

  1. We are grateful to Justin Esarey for several very helpful suggestions and to Joshua Roxborough for valuable research assistance. This research was supported, in part, by a grant from the Social Sciences and Humanities Research Council of Canada. Some of the computations were performed at the Centre for Advanced Computing at Queen’s University.
jgm-mdw-pitfalls.pdf547 KB