The Use of Replication in Graduate Education and Training

December 29, 2014

By Wendy Martinek

Editor’s note: this post is contributed by Wendy Martinek, Associate Professor of Political Science at Binghamton University.

Writing almost 20 years ago as part of a symposium on the subject,[1] King (1995) articulated a strong argument in favor of the development of a replication standard in political science. As King wrote then, “Good science requires that we be able to reproduce existing numerical results, and that other scholars be able to show how substantive findings change as we apply the same methods in new contexts” (451). Key among the conditions necessary for this to occur is that the authors of published work prepare replication data sets that contain everything needed to reproduce reported empirical results. And, since a replication data set is not useful if it is not accessible, also important are authors’ efforts to make replication data sets easily available. Though his argument and the elements of his proposed replication standard were not tied to graduate education per se, King did make the following observation: “Reproducing and then extending high-quality existing research is also an extremely useful pedagogical tool, albeit one that political science students have been able to exploit only infrequently given the discipline’s limited adherence to the replication standard” (445). Given the trend towards greater data access and research transparency, King (2006) developed a guide for the production of a publishable manuscript based on the replication of a published article.

With this guide in hand, and informed by their own experiences in teaching graduate students, many faculty members have integrated replication assignments into their syllabi. As Herrnson has observed, “Replication repeats an empirical study in its entirety, including independent data collection” (1995: 452). As a technical matter, then, the standard replication assignment is more of verification assignment than a true replication assignment. Regardless, such assignments have made their way onto graduate syllabi in increasing numbers. One prominent reason—and King’s (2006) motivation—is the facilitation of publication by graduate students. The academic job market is seemingly tighter than ever (Jaschik 2009) and publications are an important element of an applicant’s dossier, particularly when applying for a position at a national university (Fuerstman and Lavertu 2005). Accordingly, incorporating an assignment that helps students produce a publishable manuscript whenever appropriate makes good sense. Well-designed replication assignments, however, can also serve other goals. In particular, they can promote the development of practical skills, both with regard to the technical aspects of data access/manipulation and with regard to best practices for data coding/maintenance. Further, they can help students to internalize norms of data accessibility and research transparency. In other words, replication assignments are useful vehicles for advancing graduate education and training.

A replication assignment that requires students to obtain the data set and computer code to reproduce the results reported in a published article (and then actually reproduce those results) directs student attention to three very specific practical tasks. They are tasks that require skills often taken for granted once mastered, but which most political science graduate students do not possess when starting graduate school (something more advanced graduate students and faculty often forget). Most basically, it requires students to work out how to obtain the data and associated documentation (e.g., codebook). Sometimes this task turns out to be ridiculously easy, as when the data is publicly archived, either on an author’s personal webpage or through a professional data archive (e.g., Dataverse Network, ICPSR). But that is certainly not always the case, much to students’ chagrin and annoyance (King 2006: 120; Carsey 2014: 74-75). To be sure, the trend towards greater data accessibility is reflected in, for example, the editorial policies of many political science journals[2] and the data management and distribution requirements imposed by funding agencies like the National Science Foundation.[3] Despite this trend, students undertaking replication assignments not infrequently find that they have to contact the authors directly to obtain the data and/or the associated documentation. The skills needed for the simple (or sometimes not-so-simple) task of locating data may seem so basic as to be trivial for experienced researchers. However, those basic skills are not something new graduate students typically possess. A replication assignment by definition requires inexperienced students to plunge in and acquire those skills.

The second specific task that such a replication assignment requires is actually figuring out how to open the data file. A data file can be in any number of formats (e.g., .dta, .txt, .xls, .rda). For the lucky student, the data file may already be available in a format that matches the software package she intends to use. Or, if not, the student has access to something like Stat/Transfer or DBMS-Copy to convert the data file to a format compatible with her software package. This, too, may seem like a trivial skill to an experienced researcher. That is because it is a trivial skill to an experienced researcher. But it is not trivial for novice graduate students. Moreover, even more advanced graduate students (and faculty) can find accessing and opening data files from key repositories such as ICPSR daunting. For example, students adept at working with STATA, SAS, and SPSS files might still find it less than intuitive to open ASCII-format data with setup files. The broader point is that the mere act of opening a data file once it has been located is not necessarily all that obvious and, as with locating a data file, a replication assignment can aid in the development of that very necessary skill.

The third specific task that such a replication assignment requires is learning how to make sense of the content of someone else’s data file. In an ideal world (one political scientists rarely if ever occupy), the identity of each variable and its coding are crystal clear from a data set’s codebook alone. Nagler outlines best practices in this regard, including the use of substantively meaningful variable names that indicate the subject and (when possible) the direction of the coding (1995: 490). Those conventions are adhered to unevenly at best, however, and the problem is exacerbated when relying on large datasets that use either uninformative codebook numbers or mnemonics that make sense but only to experienced users. For example, the General Social Survey (GSS) includes the SPWRKSTA variable. Once the description of the variable is known (“spouse labor force status”) then the logic of the mnemonic makes some sense: SP = spouse, WRK = labor force, STA = status. But it makes no sense to the uninitiated and even an experienced user of the GSS might have difficulty recalling what that variable represents without reference to the codebook. There is also a good deal of variation in how missing data is coded across data sets. Not uncommonly, numeric values like 99 and -9 are used to denote a missing value for a variable. That is obviously problematic if those codes are used as nonmissing numeric values for the purposes of numeric calculations. Understanding what exactly “mystery name” variables reference and how such things as missing data have been recorded in the coding process are crucial for a successful replication. The fact that these things are so essential for a successful replication forces students to delve into the minutia of the coded data and become familiar with it in a way that is easier to avoid (though still unadvisable) when simply using existing data to estimate an entirely new model.

Parenthetically, the more challenges students encounter early on when learning these skills, the better off they are in the long run for one very good reason. Students receive lots of advice and instruction regarding good data management and documentation practices (e.g., Nagler 1995). But there is nothing like encountering difficulty when using someone else’s data to bring home the importance of relying on best practices in coding one’s own data. The same is true with regard to documenting the computer code (e.g., STATA do-files, R scripts). In either case, the confusions and ambiguities with which students must contend when replicating the work of others provide lessons that are much more visceral and, hence, much more effective in fostering the development of good habits and practices than anything students could read or be told by their instructor.

These three specific tasks (acquiring a data set and its associated documentation, then opening and using that data set) require skills graduate students should master very early on in their graduate careers. This makes a replication assignment especially appealing for a first- or second-semester methods course. But replication assignments are also valuable in more advanced methods courses and substantive classes. An important objective in graduate education is the training and development of scholars who are careful and meticulous in the selection and use of methodological tools. But, with rare exception, the goal is not methodological proficiency for its own sake but, rather, methodological proficiency for the sake of advancing theoretical understanding of the phenomena under investigation. A replication assignment is ideal for grounding the development of methodological skills in a substantively meaningful context, thereby helping to fix the notion in students’ minds of methodological tools as in the service of advancing theoretical understanding.

Consider, for example, extreme bounds analysis (EBA), a useful tool for assessing the robustness of the relationship between a dependent variable and a variety of possible determinants (Leamer 1983). The basic logic of EBA is that, the smaller the range of variation in a coefficient of interest given the presence or absence of other explanatory variables, the more robust that coefficient of interest is. It is easy to imagine students focusing on the trivial aspects of determining the focus and doubt variables (i.e., the variables included in virtually all analyses and the variables that may or may not be included depending upon the analysis) in a contrived class assignment. A replication assignment by its nature, however, requires a meaningful engagement with the extant literature to understand the theoretical consensus among scholars as to which variable(s) matter (and, hence, which should be considered focus rather than doubt variables). Matching methods constitute another example. Randomized experiments, in which the treatment and control groups differ from one another only randomly vis-à-vis both observed and unobserved covariates, are the gold standard for causal inference. However, notwithstanding innovative resources such as Time-Sharing Experiments for the Social Sciences (TESS) and Amazon’s Mechanical Turk and the greater prevalence of experimental methods, much of the data available to political scientists to answer their questions of interest are observational. Matching methods are intended to provide leverage for making causal claims based on observational data through the balancing of the distribution of covariates in treatment and control groups regardless of the estimation technique employed post-matching (Ho et al. 2007). Considering matching in the context of a published piece of observational research of interest to a student necessitates that the student is thinking in substantive terms about what constitutes the treatment and what the distribution of covariates looks like. As with EBA, a replication assignment in which students are obligated to apply matching methods to evaluate the robustness of a published observational study would insure that the method was tied directly to the assessment (and, hopefully, advancement) of theoretical claims rather than as an end to itself.

Though there remain points of contention and issues with regard to implementation that will no doubt persist, there is currently a shared commitment to openness in the political science community, incorporating both data access and research transparency (DA-RT). This is reflected, for example, in the data access guidelines promulgated by the American Political Science Association (Lupia and Elman 2014). The training and mentoring provided to graduate students in their graduate programs are key components of the socialization process by which they learn to become members of the academic community in general and their discipline in particular (Austin 2002). Replication assignments in graduate classes serve to socialize students into the norms of DA-RT. As Carsey notes, “Researchers who learn to think about these issues at the start of their careers, and who see value in doing so at the start of each research project, will be better able to produce research consistent with these principles” (2014: 75). Replication assignments serve to inculcate students with these principles. And, while they have obvious value in the context of methods courses, to fully realize the potential of replication assignments in fostering the development of these professional values in graduate students they should be part of substantive classes as well. The more engaged students are with the substantive questions at hand, the easier it should be to engage their interest in understanding the basis of the inferences scholars have drawn to answer those questions and where the basis for those inferences can be improved to the betterment of theoretical understanding. In sum, the role of replication in graduate education and training is both to develop methodological skills and enhance theory-building abilities.

Works Cited

Austin, Ann E. 2002. “Preparing the Next Generation of Faculty: Graduate School as Socialization to the Academic Career.” Journal of Higher Education 73(1): 94-122.

Carsey, Thomas M. 2014. “Making DA-RT a Reality.” PS: Political Science & Politics 47(1): 72-77.

Fuerstman, Daniel and Stephen Lavertu. 2005. “The Academic Hiring Process: A Survey of Department Chairs.” PS: Political Science and Politics 38(4): 731-736.

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15(3): 199-236.

Jaschik, Scott. 2009. “Job Market Realities.” Inside Higher Ed, September 8. https://www.insidehighered.com/news/2009/09/08/market (November 29, 2014).

King, Gary. 2006. “Publication, Publication.” PS: Political Science & Politics 39(1): 119-125.

Leamer, Edward. 1983. “Let’s Take the ‘Con’ Out of Econometrics.” American Economic Review 73(1): 31-43.

Lupia, Arthur and Colin Elman. 2014. “Openness in Political Science: Data Access and Research Transparency.” PS: Political Science & Politics 47(1): 19-42.

Nagler, Jonathan. 1995. “Coding Style and Good Computing Practices.” PS: Political Science & Politics 39(1): 488-492.

Notes

[1] The symposium appeared in the September 1995 issue of PS: Political Science and Politics.

[2] See, for example, http://ajps.org/guidelines-for-accepted-articles/ (November 15, 2014).

[3] See http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf (November 15, 2014).