Offering (constructive) criticism when reviewing (experimental) research

By Yanna Krupnikov and Adam Seth Levine

No research manuscript is perfect, and indeed peer reviews can often read like a laundry list of flaws. Some of the flaws are minor and can be easily eliminated by an additional analysis or a descriptive sentence. Other flaws often stand – at least in the mind of the reviewer – as a fatal blow to the manuscript.

Identifying a manuscript’s flaws is part of a reviewer’s job. And reviewers can potentially critique every kind of research design. For instance, they can make sweeping claims that survey responses are contaminated by social desirability motivations, formal models rest on empirically-untested assumptions, “big data” analyses are not theoretically-grounded, observational analyses suffer from omitted variable bias, and so on.

Yet, while the potential for flaws is ever-present, the key for reviewers is to go beyond this potential and instead ascertain whether such flaws actually limit the contribution of the manuscript at hand. And, at the same time, authors need to communicate why we can learn something useful and interesting from their manuscript despite the research design’s potential for flaws.

In this essay we focus on one potential flaw that is often mentioned in reviews of behavioral research, especially research that uses experiments: critiques about external validity based on characteristics of the sample.

In many ways it is self-evident to suggest that the sample (and, by extension, the population that one is sampling from) is a pivotal aspect of behavioral research. Thus it is not surprising that reviewers often raise questions not only about the theory, research design, and method of data analysis, but also the sample itself. Yet critiques of the sample are often stated in terms of potential flaws – that is, they are based on the possibility that a certain sample could affect the conclusions drawn from an experiment rather than stating how the author’s particular sample affects the inferences that we can draw from his or her particular study.

Here we identify a concern with certain types of sample-focused critiques and offer recommendations for a more constructive path forward. Our goals are complimentary and twofold: first, to clarify authors’ responsibilities when justifying the use of a particular sample in their work and, second, to offer constructive suggestions for how reviewers should evaluate these samples. Again, while our arguments could apply to all manuscripts containing behavioral research, we pay particular attention to work that uses experiments.

What’s the concern?

Researchers rely on convenience samples for experimental research because it is often the most feasible way to recruit participants (both logistically and financially). Yet, when faced with convenience samples in manuscripts, reviewers may bristle. At the heart of such critiques is often the concern that the sample is too “narrow” (Sears 1986). To argue that a sample is narrow means that the recruited participants are homogenous in a way that differs from other populations to which authors might wish to generalize their results (and in a way that affects how participants respond to the treatments in the study). Although undergraduate students were arguably the first sample to be classified as a “narrow database” (Sears 1986), more recently this label has been applied to other samples, such as university employees, residents of a single town, travelers at a particular airport, and so on.

Concerns regarding the narrowness of a sample typically stem from questions of external validity (Druckman and Kam 2011). External validity refers to whether a “causal relationship holds over variations in persons, settings, treatments and outcomes” (Shadish, Cook and Campbell 2002, 83). If, for example, a scholar observes a result in one study, it is reasonable to wonder whether the same result could be observed in a study that altered the participants or slightly adjusted the experimental context. While the sample is just one of many aspects that reviewers might use when judging the generalizability of an experiment’s results – others might include variations in the setting of the experiment, its timing, and/or the way in which theoretical entities are operationalized – sample considerations have often proved focal.

At times during the review process, the type of sample has become a “heuristic” for evaluating the external validity of a given experiment. Relatively “easy” critiques of the sample – those that dismiss the research simply because they involve a particular convenience sample – have evolved over time. A decade ago such critiques were used to dismiss experiments altogether, as McDermott (2002:334) notes: “External validity…tend[s] to preoccupy critics of experiments. This near obsession…tend[s] to be used to dismiss experiments.” More recently, Druckman and Kam (2011) noted such concerns were especially likely to be directed toward experiments with student samples: “For political scientists who put particular emphasis on generalizability, the use of student participants often constitutes a critical, and according to some reviewers, fatal problem for experimental studies.” Even more recently, reviewers lodge this critique against other convenience samples such as those from Amazon’s Mechanical Turk.

Note that, although they are writing almost a decade apart, both McDermott and Druckman and Kam are observing the same underlying phenomenon: reviewers dismissing experimental research simply because it involves a particular sample. The review might argue that the participants (for example, undergraduate students, Mechanical Turk workers, or any other convenience sample) are generally problematic, rather than arguing that they pose a problem for the specific study in the manuscript.

Such general critiques that identify a broad potential problem with using a certain sample can, in some ways, be more even damning than other types of concerns that reviewers might raise. An author could address questions of analytic methods by offering robustness checks. In a well-designed experiment, the author could reason through questions of alternative explanations using manipulation checks and alternative measures. When a review suggests that the core of the problem is that a sample is generally “bad”, however, the reviewer is (indirectly) stating that readers cannot glean much about the research question from the authors’ study and that the reviewer him/herself is unlikely to be convinced by any additional arguments the author could make (save a new experiment on a different sample).

None of the above is to suggest that critiques of samples should not be made during the review process. Rather, we believe that they should adhere to a similar structure as concerns that reviewers might raise about other parts of a manuscript. Just as reviewers evaluate experimental treatments and measures within the context of the authors’ hypotheses and specific experimental design, evaluations of the sample also benefit from being experiment-specific. Rather than asking “is this a `good’ or `bad’ sample?”, we suggest that reviewers ask a more specific question: “is this a `good’ or `bad’ sample given the author’s research goals, hypotheses, measures, and experimental treatments?”

A constructive way forward

When reviewing a manuscript that relies on a convenience sample, reviewers sometimes dismiss the results based on the potential narrowness of a sample. Such a dismissal, we argue, is a narrow critique. The narrowness of a sample certainly can threaten the generalizability of the results, but it does not do so unconditionally. Indeed, as Druckman and Kam (2011) note, the narrowness of a sample is limiting if the sample lacks variance on characteristics that affect the way a participant responds to the particular treatments in a given study.

Consider, for example, a study that examines the attitudinal impact of alternative ways of framing health care policies. Suppose the sample is drawn from the undergraduate population at a local university, but the researcher argues (either implicitly or explicitly) that the results can help us understand how the broader electorate might respond to these alternative framings.

In this case, one potential source of narrowness might stem from personal experience. We might (reasonably) assume that undergraduate students are highly likely to have experience interacting with a doctor or a nurse (just like non-undergraduate adults). Yet, they are perhaps far less likely to have experience interacting with health insurance administrators (unlike non-undergraduate adults). When might this difference threaten the generalizability of the claims that the author wishes to make?

The answer depends upon the specifics of the study. If we believe that personal experience with health care providers and insurance administrators does not affect how people respond to the treatments, then we would not have reason to believe that the narrowness of the undergraduate sample would threaten the authors’ ability to generalize the results. If instead we only believe that experience with a doctor or nurse may affect how people respond to the treatments (e.g. perhaps how they comprehend the treatments, the kinds of considerations that come to mind, and so on) then again we would not have reason to believe that the narrowness of the undergraduate sample would threaten the ability to generalize. Lastly, however, if we also believe that experience with insurance administrators affects how people respond to the treatments, then that would be a situation in which the narrowness might limit the generalizability of the results.

What does this mean for reviewers? The general point is that, even if we have reason to believe that the results would differ if a sample were drawn from a different population, this fact does not render the study or its results entirely invalid. Instead, it changes the conclusions we can draw. Returning to the example above, a study in which experience with health insurance administrators affects responses still offers some political implications about health policy messages. But (for example) its scope may be limited to those with very little experience interacting with insurance administrators.

It’s worth noting that in some cases narrowness might be based on more abstract, psychological factors that apply across several experimental contexts. For instance, perhaps reviewers are concerned that undergraduates are narrow because they are both homogeneous and different in their reasoning capacity from several other populations to which authors often wish to generalize. In that case, the most constructive review would explain why these reasoning capacities would affect the manuscript’s conclusions and contribution.

More broadly, reviewers may also consider the researcher’s particular goals. Given that some relationships are otherwise difficult to capture, experimental approaches often offer the best means for identifying a “proof of concept” – that is, whether under theorized conditions a “particular behavior emerges” (McKenzie 2011). These types of “proof of concept” studies may initially be performed only in the laboratory and often with limited samples. Then, once scholars observe some evidence that a relationship exists, more generalizable studies may be carried out. Under these conditions, a reviewer may want to weigh the possibility of publishing a “flawed” study against the possibility of publishing no evidence of a particularly elusive concept.

What does this mean for authors? The main point is that it is the author’s responsibility to clarify why the sample is appropriate for the research question and the degree to which the results may generalize or perhaps be more limited. It is also the author’s responsibility to explicitly note why the result is important even despite the limitations of the sample.

What about Amazon’s Mechanical Turk?

Thus far we have (mostly) avoided mentioning Amazon’s Mechanical Turk (MTurk). We have done so deliberately, as MTurk is an unusual case. On the one hand, MTurk provides a platform for a wide variety of people to participate in tasks such as experimental studies for money. One result is that MTurk typically provides samples that are much more heterogeneous than other convenience samples and are thus less likely to be “narrow” on important theoretical factors (Huff and Tingley 2015). These participants often behave much like people recruited in more traditional ways (Berinsky, Huber and Lenz 2012). On the other hand, MTurk participants are individuals who were somehow motivated to join the platform in the first place and over time (due to the potentially unlimited number of studies they can take) have become professional survey takers (Krupnikov and Levine 2014; Paolacci and Chandler 2014). This latter characteristic in particular suggests that MTurk can produce an unusual set of challenges for both authors and reviewers during the manuscript review process.

Much as we argued that a narrow sample is not in and of itself a reason to advocate for a manuscript’s rejection (though the interaction between the narrowness of the sample and the author’s goals, treatments and conclusions may provide such a reason), so too when it comes to MTurk we believe that this recruitment approach does not provide prima facie evidence to advocate rejection.

When using MTurk samples, it is the author’s responsibility to acknowledge and address any potential narrowness of the sample that might stem from the sample. It is also the author’s responsibility to design a study that accounts for the fact that MTurkers are professionalized participants (Krupnikov and Levine 2014) and to explain why a particular study is not limited by the characteristics that make MTurk unusual. At the same time, we believe that reviewers should avoid using MTurk as an unconditional heuristic for rejection and instead should always consider the relationship between treatment and sample in the study at hand.

Conclusions

We are not the first to note that reviewers can voice concerns about experiments and/or the samples used in experiments. These types of sample critiques may often seem unconditional, as in: there is no amount of information that the author could offer that could lead the reviewer to reconsider his or her position on the sample. Put another way, the sample type is used as a heuristic, with little consideration of the specific experimental context in the manuscript.

We are not arguing that reviewers should never critique samples. Rather, our argument is that the fact that researchers chose to recruit a convenience sample from the population of undergraduates at a local university, the population of MTurk workers, and so on is not a justifiable reason on its own for recommending rejection of a paper. Rather, the validity of the sample depends upon the author’s goals, the experimental design, and the interpretation of the results. The use of undergraduate students may have few limitations for one experiment but may prove largely crippling for another one. And, echoing Druckman and Kam (2011), even a nationally-representative sample is no guarantee of external validity.

The reviewer’s task, then, is to examine how the sample interacts with all the other components of the manuscript. The author’s responsibility, in turn, is to clarify such matters. And, in both cases, both the reviewer and the author should acknowledge that the only way to truly answer questions about generalizability is to continue examining the question in different settings as part of an ongoing research agenda (McKenzie 2011).

Lastly, while we have focused on a common critique of experimental research, this is just one example of a broader phenomenon. All research designs are imperfect in one way or another, and thus the potential for flaws is always present. Constructive reviews should evaluate such flaws in the context of the manuscript at hand, and then decide if the manuscript credibly contributes to our knowledge base. And, similarly, authors are responsible for communicating the value of their manuscript despite any potential flaws stemming from their research design.

Bibliography

Berinsky, A. J., Huber, G. A., and Lenz, G. S. 2012. Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk. Political Analysis 20: 351–68.

Druckman, J. N., and Kam, C. D. 2011. Students as Experimental Participants: A Defense of the ‘Narrow Data Base’. In Handbook of Experimental Political Science. eds. Druckman, Green, Kuklinski, and Lupia, New York: Cambridge University Press.

Huff, C. and Tingley, D. 2015. “Who are these people?”Evaluating the Demographic Characteristics and Political Preferences of Mturk Survey Respondents. Working Paper.

Krupnikov, Y and Levine, A.S. 2014. Cross-Sample Comparisons and External Validity. Journal of Experimental Political Science. 1: 59-80.

McDermott, R. 2002. Experimental Methodology in Political Science. Political Analysis 10: 325–42.

McKenzie, D. 2011 “A Rant on the External Validity Double Double-Standard” Development Impact: The World Bank. http://blogs.worldbank.org/impactevaluations/a-rant-on-the-external-validity-double-double-standard (Accessed: Dec 10, 2015).

Paolacci, G. and Chandler, J. 2014. Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science 23: 184-188.

Sears, D. 1986. College Sophomores in the Laboratory: Influences of a Narrow Data Base on Social Psychology’s View of Human Nature. Journal of Personality and Social Psychology 51: 515–30.

Shadish, W. R., Cook, T. D., and Campbell, D. T. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin.