Credibility Toryism: Causal Inference, Research Design, and Evidence

September 30, 2013

By Justin Esarey

In a prior post on my personal blog, I argued that it is misleading to label matching procedures as causal inference procedures (in the Neyman-Rubin sense of the term). My basic argument was that the causal quality of these inferences depends on untested (and in some cases untestable) assumptions about the matching procedure itself. A regression model is also a “causal inference” model if various underlying assumptions are met, with one primary difference being that regression depends on linearity of the response surface while matching does not. Presumably, regression will be more efficient than matching if this assumption is correct, but less accurate if it is not.

So, if I don’t think that causal inferences come out of a particular research design or model, where do I think they come from?

Let’s step back for a moment. Research designs and statistical models are designed to allow us to surmount some barriers to causal inference. Presuming that our interest is in the effect of a treatment X on a response Y, we can easily make a long and far from comprehensive list of these barriers:

  1. endogeneity (Y causes X)
  2. spurious correlation (W causes Y and X)
  3. selection bias (only cases with certain values of Y, or with a certain relationship between X and Y, are in our sample)
  4. ecological inference problem (e.g., average X is correlated with average Y in aggregated units, but a different relationship exists at the individual level)
  5. unrepresentative sample (the relationship between X and Y in our study does not match the relationship in the population because the distribution of contextual factors Z in the sample does not match the population)
  6. external validity (contextual influences in the study environment do not match contextual influences on the population of interest)
  7. invalid or unreliable measurement
  8. chance produced a relationship between Y and X where no causal relationship exists
  9. model dependence (the statistical technique we use to fix the above overdetermines the estimate in misleading ways)

…and on and on. Many people consider the laboratory experiment to be the gold standard for causal inference because it easily overcomes many of these barriers. If I can bring a proper (e.g., random) sample of a target population into a laboratory, break them into two groups, apply a treatment to one and not the other, carefully control the conditions between the groups, and I observe some average response in the treatment group that I don’t observe in the control… well, that’s pretty good evidence that the treatment caused the response. Problems 1, 2, 3, 4, and 9 are nearly impossible by design.

But even in these rarefied conditions, there are caveats. Perhaps the relationship between treatment X and response Y depends on contextual factor Z, and in a live environment levels of Z are substantially different compared to the laboratory. Outside the lab, where these contexts are present, perhaps X does not cause Y–or even has the opposite effect! (This is the heart of an external validity critique, #6.)

Research designs and statistical models usually face tradeoffs in how they address these barriers to causal inference. For example, a 2SLS model designed to fix an endogeneity problem with an instrumental variable hopes to fix problem 1, but increases the risk of problems of 7 or 8 (by requiring an accurate instrument, and imposing additional structural requirements via the relationship of instrument to X to Y). In the lab experiment above, we reduced problems 1-4 and 9 at the expense of problem 6 (and still must worry about 5, 7, and 8 despite the design).

Let’s index every potential inferential problem with k \in 1...K, where K is the size of the set of problems. For each problem, we specify a probability p_{k} that problem k will adversely impact our inference in some way. We can imagine p_{k} being different depending on the nature of the adverse impact. For example, even if we are sure that some degree of endogeneity exists in the relationship between X and Y, it is not clear that this will lead us to conclude that X \rightarrow Y when it does not. That depends on the nature and strength of the endogenous relationship. But it will almost certainly cause us to under- or over-estimate the size of the impact of X on Y.

Thus, in some cases, p_{k} \approx 1, but in most cases, it’s much less. Now consider all the possible barriers to inference as a group. Insomuch that these barriers are unrelated to one another mutually exclusive–they aren’t, but this provides a nice upper bound–we can write the total probability that any design produces a faulty inference as:

total probability of a faulty inference = \sum^{K}_{k=1}{p_{k}}

If the events aren’t mutually exclusive but are independent, we’d have to write something akin to:

total probability of a faulty inference = 1-\prod^{K}_{k=1}{(1-p_{k})}

But the same basic ideas to follow still apply.

Now, here’s the heart of my argument. For any particular study, making different research design choices changes the p_{k} terms. Continuing the above example, we might think that the probability that endogeneity is interfering with the desired inference \approx 1, so that the study contains little added value in terms of knowledge. Ergo, we use a 2SLS estimator with an instrumental variable or two. This decreases the probability that endogeneity is influencing the estimator, and raises the probability that model dependence or measurement problems are impacting the result in meaningful ways. The tradeoff is a good one as long as the sum of probabilities goes down.

That’s great–there’s nothing wrong with that. But I believe it’s reasonable to suspect that, in most cases, even perfect design and model choices can’t get \sum^{K}_{k=1}{p_{k}}=0. Hopefully, we can get it low–and lower is definitely better–but at a certain point unavoidable tradeoffs kick in and design choices can only push around the nature of the problems.

But there is a way to get small but non-zero probabilities down to zero: conduct a new study. Ideally, the p_{k} values for this new study are completely uncorrelated with those of the past study. This happens when the new study uses a different design, different data sets, and tests different predictions/hypotheses from the same underlying theory. If we index studies with j and we conduct J many studies, now we’re in a world where:

total probability that all inferences are flawed = \prod^{J}_{j=1}{\sum^{K}_{k=1}{p_{jk}}}

if all the probabilities are independent across the J studies, when faulty inferences are mutually exclusive inside of a single study. And this is very good, because \prod{p_{j}}\rightarrow 0 as J rises. [A clarifying note: this is the probability that all the studies lead to the same conclusion, and that this conclusion is flawed.] If faulty inferences are not mutually exclusive but are independent inside of a single study, we have:

total probability that all inferences are flawed = \prod^{J}_{j=1}{(\prod^{K}_{k=1}{(1-p_{jk})})}

and the same idea applies.

This also makes a second point clear: if a study has some design flaw that yields a somewhat high p_{k} of some barrier to inference, this does not make the study scientifically valueless. It would, of course, be better to make p_{k} lower for a given study. But a collection of five studies with a \sum_{k}{p_{k}} = 0.1 each still yields a collective probability of faulty inference of (0.1)^5, again presuming independence of the probabilities. It’ll be a bit higher if the studies’ flaws are in common (i.e., if one study’s flaw is shared by multiple studies because it uses the same data, etc.).

This is my basic reason for being less-than-enthusiastic about certain aspects of the credibility revolution in economics and political science; I am not quite a credibility counter-revolutionary, but maybe a Tory. It’s not because I don’t think that we shouldn’t strive to maximize the causal value of a particular study, what I would call 1-\sum{p_k}. We should! But I think there are limits on how far we can push this program, and I also think that even studies that have smallish but definitely non-zero \sum{p_k} have scientific value.

In brief, I think that our causal inferences are most solid when they come from a collection of studies that tackle different aspects of a theory using different methods. These studies complement each others strengths, and consequently are greater than the sum of their parts. I also think that a collection of somewhat flawed studies can provide a better cut at causal inference than a single study that has fewer individual flaws, simply because we can never totally eliminate flaws but we can negate their importance with multiple studies.

Even worse, if we force every individual study to reach some minimum value of 1-\sum{p_k}, we probably won’t even answer some questions. Whenever we can’t find a valid instrument for some hypothesized cause, can’t conduct a field or laboratory experiment, can’t do matching, etc., we simply won’t publish the results. And some of the most solid and important findings of political science (e.g., the democratic peace) come from situations like this, where experiments and instruments are impractical. I want work like that to continue to be done, and recognized as providing valuable information about causality.

[Update, 9/30/13 @ 4:31 PM]: some changes made to the structure of the probability of a faulty inference to correct errors and clarify points.
[Update, 10/1/13 @ 10:35 AM]: subscripts in some places changed to match rest of the paper.