Reproducibility and Transparency

December 05, 2014

By Rick Wilson

The Political Methodologist is joining with 12 other political science journals in signing the Data Access and Research Transparency (DA-RT) joint statement.

The social sciences receive little respect from politicians and segments of the mass public. There are many reasons for this, including:

A partial solution to building trust is to increase the transparency of our claims and this is why The Political Methodologist is signing on to DA-RT.

As researchers we need to ensure that the claims we make are supported by systematic argument (either formal or normative theory) or by marshaling empirical evidence (either qualitative or quantitative). I am going to focus on empirical quantitative claims here (in large part because many of the issues I point to are more easily solved for quantitative research). This idea of DA-RT is simple and has three elements. First, an author should ensure that data are available to the community. This means putting it in a trusted digital repository. Second, an author should ensure that the analytic procedures on which the claims are based are public record. Third, data and analytic procedures should be properly cited with a title, version and persistent identifier. Interest in DA-RT extends beyond political science. From November 3-4, 2014 the Center for Open Science co-sponsored a workshop designed to produce standards for data accessibility, transparency and reproducibility. At the table were journal editors from the social sciences and Science. The latter issued a rare joint editorial with Nature detailing standards for the biological sciences to ensure reproducibility and transparency. Science magazine aims at doing the same for the social sciences.

Ensuring that our claims are grounded in evidence may seem non-controversial. Foremost, the evidence used to generate claims needs to be publicly accessible and interpretable. For those using archived data (e.g., COW or ANES) this is relatively easy. For those collecting original data it may be more difficult. Original data require careful cataloging in a trusted digital repository (more on this in a bit). It means that the data you have carefully collected will persist and will be available to other scholars. Problematic are proprietary data. Some data may be sensitive, some may be protected under Human Subjects provisions and some may be privately owned. In lieu of providing such data, authors have a special responsibility to carefully detail the steps that could, in principle, be taken to access the data. Absent the data supporting claims, readers should be skeptical of any conclusions drawn by an author.

Surprisingly, there are objections to sharing data. Many make the claim that original data is proprietary. After all, the researcher worked hard to generate them and doesn’t need to share. This is not a principled defense. If the researcher chooses not to share data, I see no point in allowing the researcher to share findings. Both can remain private. A second claim to data privacy is that the data have not yet been fully exploited. Editors have the ability to embargo the release of data, although this should happen under rare circumstances. It seems odd that a researcher would request an embargo, given that the data of concern is that which supports the claims of the researcher. Unless the author intends to use exactly the same data for another manuscript, there is no reason to grant an embargo. If the researcher is intending to exactly use the same data, editors should be concerned about self-plagiarism. Reproducible data should focus on the data used to make a claim.

The second feature of reproducibility and transparency involves making the analytic procedures publicly available. This gets to the key element of transparency. The massaged data that are publicly posted have been generated through numerous decisions by the researcher. A record of those decisions is critical for understanding the basis of empirical claims. For most researchers, this means providing a complete listing of data transformation steps. All statistical programs allow for some form of a log file that document what a researcher did. More problematic may be detailing the instruments that generated some of the data. Code used for scraping data from websites, videos used as stimuli for an experiment or physical recoding devices all pose problems for digital storage. However, if critical for reaching conclusions, a detailed record of the steps taken by a researcher must be produced. The good news is that most young scholars are trained to do this routinely.

There are objections to providing this kind of information. Typically it has to do with it being too difficult to recreate what was done to get to the final data set. If true, then it is likely that the data are problematic. If the researcher is unable to recreate the data, then how can it be judged?

The final element of transparency deals with the citation of data and code. This has to be encouraged. Assembling and interpreting data is an important intellectual endeavor. It should be rewarded by proper citation – not just by the researcher, but by others. This means that the record of the researcher must have a persistent and permanent location. Here is where trusted digital repositories come into play. These may be partners in the Data Preservation Alliance for the Social Sciences (Data-PASS) http://www.data-pass.org > or institutional repositories. They are not an author’s personal website. If you’re like me, my website is outdated, and I should not be trusted to maintain it. The task of a trusted data repository is to ensure that the data are curated and appropriately archived. Repositories do not have the responsibility for documenting the data and code – this is the responsibility of the researcher. All too often stored data have obscure variable names that are only meaningful to the researcher and there is little way to match the data to what the researcher did in a published article.

The aim for transparency, of course, is to ensure that claims can be subject to replication. Replication has a troubled history in that it often looks like “gotcha” journalism. There is bias in publication in that replications overturning a finding are much more likely to be published. This obscures the denominator and raises the question of how often findings are confirmed, rather than rejected. We have very few means for encouraging the registration of replications. It is a shame, since we have as much to learn from instances where a finding appears to be confirmed as when it may not. If journals had unlimited resources, no finding would be published unless independently replicated. This isn’t going to happen. However, good science should ensure that findings are not taken at face value, but subjected to further test. In this age of electronic publication it is possible to link to studies that independently replicate a finding. Journals and Associations are going to have to be more creative about how claims published in their pages are supported. Replication is going to have to be embraced.

It may be that authors will resist data sharing or making their analytic decisions public. However, resistance may be futile. The journals, including The Political Methodologist, are taking the high road and eventually will require openness in science.