On the Replication of Experiments in Teaching and Training

By Jon Rogers

Editor’s note: this piece is contributed by Jon Rogers, Visiting Assistant Research Professor and member of the Social Science Experimental Laboratory (SSEL) at NYU Abu Dhabi.

Introduction

Students in the quantitative social sciences are exposed to high levels of rational choice theory. Going back to Marwell and Ames (1981), we know that economists free ride, but almost no one else does (in the strict sense anyway). In part, this is because many social science students are essentially taught to free ride. They see these models of human behavior and incorrectly take the lesson that human beings should be rational and free ride. To not free ride would be irrational. Some have difficulty grasping that these are models meant to predict, not prescribe and judge human behavior.

Behaviorally though, it is well established that most humans are not perfectly selfish. Consider the dictator game, where one player decides how much of her endowment to give to a second player. A simple Google Scholar search for dictator game experiments returns nearly 40,000 results. It is no stretch to posit that almost none of these report that every first player kept the whole endowment for herself (Engel, 2011). When a new and surprising result is presented in the literature, it is important for scholars to replicate the study to examine its robustness. Some results however, are so well known and robust, that they graduate to the level of empirical regularity.

While replication of surprising results is good for the discipline, replication of classic experiments is beneficial for students. In teaching, experiments can be used to demonstrate the disconnect between Nash equilibrium and actual behavior and to improve student understanding of the concept of modeling. Discussions of free-riding, the folk theorem, warm glow, and the like can all benefit from classroom demonstration. For graduate students, replication of experiments is also useful training, since it builds programming, analysis, and experimenter skills in an environment where the results are low risk to the grad student’s career. For students of any type, replication is a useful endeavor and one that should be encouraged as part of the curriculum.

Replication in Teaching

Budding political scientists and economists are virtually guaranteed to be introduced, at some level, to rational choice. Rational choice is characterized by methodological individualism and the maximization of self interest. That is, actors (even if the actor of interest is a state or corporation) are assumed to be individuals who make choices based on what they like best. When two actors are placed in opposition to one another, they are modeled as acting strategically to maximize their own payoffs and only their own payoffs.

Consider the classic ultimatum game. Player A is granted an endowment of 10 tokens and is tasked with choosing how much to give to player B. Player B can then decide to either accept or reject the offer. If she accepts, then the offer is enforced and subjects receive their payments. If she rejects the offer, then both players receive nothing. In their game theory course work, students are taught to identify the Nash equilibrium through backward induction. In the second stage, player B chooses between receiving 0 and receiving the offer x, with certainty. Since she is modeled as being purely self interested, she accepts the offer, no matter how small. In the first stage, player A knows that player B will accept any offer, so she gives the smallest ε > 0 possible. This yields equilibrium payoffs of (10-ε , ε).

Students are taught to identify this equilibrium and are naturally rewarded by having test answers marked correct. Through repeated drilling of this technique, students become adept at identifying equilibria in simple games, but make the unfortunate leap of seeing those who play the rational strategy as being smarter or better. A vast literature reports that players rarely make minimal offers and that such offers are frequently rejected (Oosterbeek, Sloof, and van de Kuilen, 2004). Sitting with their textbooks however, students are tempted to misuse the terminology of rational choice and deem irrational any rejection or non-trivial offer. Students need to be shown that Nash equilibria are sets of strategy profiles derived from models and not inherently predictions in and of themselves. Any model is an abstraction from reality and may omit critical features of the scenario it attempts to describe. A researcher may predict that subjects will employ equilibrium strategies, but she may just as easily predict that considerations such as trust, reciprocity, or altruism might induce non-equilibrium behavior. The Nash Equilibrium is a candidate hypothesis, but it is not necessarily unique.

This argument can be applied to games with voluntary contribution mechanisms. In the public goods game for example, each player begins with an endowment and chooses how much to contribute to a group account. All contributions are added together, multiplied by an efficiency factor, and shared evenly among all group members, regardless of any individual’s level of contribution. In principal, the group as a whole would be better off, if everyone gave the maximum contribution. Under strict rationality however, the strong free rider hypothesis predicts 0 contribution from every player. Modeling certain situations as public goods games then leads to the prediction that public goods will be under-provided. Again however, students are tempted to misinterpret the lesson and consider the act of contribution to be inherently irrational. Aspects of other-regarding behavior can be rational, if they are included in the utility function (Andreoni, 1989).

In each of the above circumstances, students could benefit from stepping back from their textbooks and remembering the purpose of modeling. Insofar as models are neither true nor false, but useful or not (Clarke and Primo, 2012), they are meant to help researchers predict behavior, not prescribe what a player should do, when playing the game. Simple classroom experiments, ideally before lecturing on the game, combined with post experiment discussion of results, help students to remember that while a game may have a pure strategy Nash equilibrium, it’s not necessarily a good prediction of behavior. Experiments can stimulate students to consider why behavior may differ from the equilibrium and how they might revise models to be more useful.

Returning to voluntary contribution mechanisms, it is an empirical regularity in repeated play that in early rounds contributions are relatively high, but over time tend to converge to zero. Another regularity is that even if contributions have hit zero, if play is stopped and then restarted, then contributions will leap upward, before again trending toward zero. Much of game theory in teaching is focused on identifying equilibria without consideration of how these equilibria (particularly Nash equilibria) are reached. Replication of classic experiments allows for discussion of equilibrium selection, coordination mechanisms, and institutions that support pro-social behavior.

One useful way to engage students in a discussion of modelling behavior is to place them in a scenario with solution concepts other than just pure strategy Nash equilibrium. For instance, consider k-level reasoning. The beauty contest game takes a set of N players and gives them three options: A, B, and C. The player’s task is to guess which of the three options will be most often selected by the group. Thus, players are asked not about their own preferences over the three options, but their beliefs on the preferences of the other players. In a variant of this game, Rosmary Nagel (1995) takes a set of N players and has them pick numbers between one and one hundred. The player’s task is to pick a number closest to what they believe will be the average guess, times a parameter p. If p = 0.5, then subjects are attempting to guess the number between one and one hundred that will be half of the average guess. The subject with the guess closest to the amount wins.

In this case, some players will notice that no number x ∈ (50,100] can be the correct answer, since these numbers can never be half of the average. A subject who answers 50 would be labeled level-0, as she has avoided strictly dominated strategies. Some subjects however, will believe that all subjects have thought through the game at least this far and will realize that the interval of viable answers is really (0,50]. These level-1 players then respond that one half of the average will be x = 25. The process iterates to its logical (Nash Equilibrium) conclusion. If all players are strictly rational, then they will all answer 0. Behaviorally though, guesses of 0 virtually never win.

In a classroom setting, this game is easy to implement and quite illustrative. Students become particularly attentive, if the professor offers even modest monetary stakes, say between $0.00 and $10.00, with the winning student receiving her guess as a prize. A class of robots will all guess 0 and the professor will suffer no monetary loss. But all it takes is a small percentage of the class to enter guesses above 0 to pull the winning guess away from the Nash Equilibrium. Thus the hyper-rational students who guessed 0 see that the equilibrium answer and the winning answer are not necessarily the same thing (note: the 11-20 money request game by Arad and Rubinstein (2012) is an interesting variant of this without a pure strategy Nash equilibrium at all.).

In each of the above settings, it is well established that many subjects do not employ the equilibrium strategy. This is surprising to no one beyond those students who worship too readily at the altar of rational choice. By replicating classic experiments to demonstrate to students that models are not perfect in their ability to predict human behavior, we demote game theory from life plan to its proper level of mathematical tool. We typically think of replication as a check on faulty research or a means by which to verify the robustness of social scientific results. Here, we are using replication of robust results to inspire critical thinking about social science itself. For other students however, replication has the added benefit of enabling training in skills needed to carry out more advanced experiments.

Replication in Training

To some extent, the internet era has been a boon to the graduate student of social sciences, providing ready access to a wide variety of data sources. Responsible researchers make their data available on request at the very least, if not completely available online. Fellow researchers can then attempt to replicate findings to test their robustness. Students, in turn, can use replication files to practice the methods they’ve learned in their classes.

The same is true of experimental data sets. However the data analysis of experiments is rarely a complex task. Indeed, the technical simplicity of analysis is one of the key advantages of true experiments. For the budding experimentalist, replication of data analysis is a useful exercise, but one not nearly as useful as the replication of experimental procedures. Most data generating processes are, to some extent, sensitive to choices made by researchers. Most students however, are not collecting their own nationally representative survey data. Particularly at early levels of development, students may complete course work entirely from existing data. The vast majority of their effort is spent on the analysis. Mistakes can be identified and often corrected with what may be little more than a few extra lines of code.

For experimentalists in training though, the majority of the work comes on the front end, as does the majority of the risk. From writing the experimental program in a language such as zTree (Fischbacher, 2007), which is generally new to the student, to physically running the experimental sessions, a student’s first experiment is an ordeal. The stress of this endeavor is compounded, when its success or failure directly relates to the student’s career trajectory and job market potential. It is critical for the student to have solid guidance from a well trained advisor.

This is, of course, true of all research methods. The better a student’s training, the greater their likelihood of successful outcomes. Data analysis training in political science graduate programs has become considerably more sophisticated in recent years, with students often required to complete three, four, or even more methods courses. Training for experimentalists however, exhibits considerably more variance and formal training may be unavailable. Some fortunate students are trained on the job, assisting more senior researchers with their experiments. But while students benefit from an apprenticeship with an experimentalist, they suffer, ironically enough, from a lack of experimentation.

Any student can practice working with large data. Many data sets can be accessed for free or via an institutional license. A student can engage in atheoretical data mining and practice her analysis and interpretation of results. She can do all of this at home with a glass of beer and the television on. When she makes a mistake, as a young researcher is wont to do, little is lost and the student has gained a valuable lesson. Students of experiments, however, rarely get the chance to make such mistakes. A single line of economic experiments can cost thousands of dollars and a student is unlikely to have surplus research funds with which to gain experience. If she is lucky enough to receive research funding, it will likely be limited to subject payments for her dissertation’s experiment(s). A single failed session could drain a meaningful portion of her budget, as subjects must be paid, even if the data is unusable. The rule at many labs is that subjects in failed sessions must still receive their show up fees and then additional compensation for any time they have spent up to the point of the crash. Even with modest subject payments, this could be hundreds of dollars.

How then is the experimentalist to develop her craft, while under a tight budget constraint? The answer lies in the empirical regularities discussed earlier. The size of financial incentives in an experiment does matter, at least in terms of salience (Morton and Williams, 2010), but some effects are so robust as to be present in experiments with even trivial or non-financial incentives. In my own classroom demonstrations, I have replicated prisoner’s dilemma, ultimatum game, public good game, and many other experiments, using only fractions of extra credit points as incentives and the results are remarkably consistent with those in the literature.[1] At zero financial cost, I gained experience in the programming and running of experiments and simultaneously ran a lesson on end game effects, the restart effect, and the repeated public goods game.

Not all graduate students teach courses of their own, but all graduate students have advisors or committee members who do. It is generally less of an imposition for an advisee to ask a faculty member to grant their students a few bonus points than it is to ask for research funds, especially funds that would not be directly spent on the dissertation. These experiments can be run in every way identical to how one would be run with monetary incentives, but without the cost or risk to the student’s career. This practice is all the more important at institutions without established laboratories, where the student is responsible for building an ad hoc network.

Even for students with experience assisting senior researchers, independently planning and running an experiment from start to finish, without direct supervision, is invaluable practice. The student is confronted with the dilemma of how she will run the experiment, not how her advisor would do so. She then writes her own program and instructions, designs her own physical procedures, and plans every detail on her own. She can and should seek advice, but she is free to learn and develop her own routine. The experiment may succeed or fail, but the end product is similar to atheoretical playing with data. It won’t likely result in a publication, but it will prove to be a valuable learning experience (note: A well-run experiment is the result of not only a properly written program, but also of strict adherence to a set of physical procedures such as (among many others) how to seat subjects, how to convey instructions, and how to monitor laboratory conditions. A program can be vetted in a vacuum, but the experimenter’s procedures are subject to failure in each and every session, thus practice is crucial.)

Discussion

Many of the other articles in this special issue deal with the replication of studies, as a matter of good science, in line with practices in the physical sciences. But in the physical sciences, replication also plays a key role in training. Students begin replicating classic experiments often before they can even spell the word science. They follow structured procedures to obtain predictable results, not to advance the leading edge of science, but to build core skills and methodological discipline.

Here though, physical scientists have a distinct advantage. Their models are frequently based on deterministic causation and are more readily understood, operationalized, tested, and (possibly) disproved. To the extent that students have encountered scientific models in their early academic careers, these models are likely to have been deterministic. Most models in social science however, are probabilistic in nature. It is somewhat understandable that a student in the social sciences, who reads her textbook and sees the mathematical beauty of rational choice, would be enamored with its clarity. A student, particularly one who has self selected into majoring in economics or politics, can be forgiven for seeing the direct benefits of playing purely rational strategies. It is not uncommon for an undergraduate to go her entire academic career without empirically testing a model. By replicating classic experiments, particularly where rational choice fails, we can reinforce the idea that these are models meant to predict behavior, not instructions for how to best an opponent.

In contrast, graduate students explicitly train in designing and testing models. A key component of training is the ability to make and learn from mistakes. Medical students learn by practicing on cadavers who cannot suffer. Chemists learn by following procedures and comparing results to established parameters. Large-n researchers learn by working through replication files and testing for robustness of results. In the same spirit, experimentalists can learn by running low risk experiments based on established designs, with predicable results. In doing so, even if they fail, they build competence in the skills they will need to work independently in the future. At any rate, while the tools employed in the social sciences differ from those in the physical sciences, the goal is the same: to improve our understanding of the world around us. Replicating economic experiments aids some in their study of human behavior and others on their path to learn how to study human behavior. Both are laudable goals.

Notes

[1] Throughout the course, students earn “Experimental Credit Units.” The top performing student at the end of the semester receives five extra credit points. All other students receive extra credit indexed to that of the top performer. I would love to report the results of the experiments here, but at the time I had no intention of using the data for anything other than educational purposes and thus did not apply for IRB approval.

References

1. Andreoni, James. 1989. “Giving with Impure Altruism: Applications to Charity and Ricardian Equivalence.” The Journal of Political Economy 97(6):1447-1458.

2. Arad, Ayala & Ariel Rubinstein. 2012. “The 11-20 Money Requestion Game: A Level-k Reasoning Study.” The American Economic Review 102(7):3561-3573.

3. Clarke, Kevin A. & David M. Primo. 2012. A Model Discipline: Political Science and the Logic of Representations. New York, NY: Oxford University Press.

4. Engel, Christoph. 2011. “Dictator Games: a Meta Study.” Experimental Economics 14:583-610.

5. Fischbacher, Urs. 2007. “z-Tree: Zurich Toolbox for Ready-made Economic Experiments.” Experimental Economics 10(2):171-178.

6. Marwell, Gerald & Ruth E. Ames. 1981. “Economists Free Ride, Does Anyone Else? Experiments on the Provision of Public Goods, IV.” Journal of Public Economics 15 (3):295-310.

7. Morton, Rebecca B. & Kenneth C. Williams. 2010. Experimental Political Science and the Study of Causality. New York, NY: Cambridge University Press.

8. Nagel, Rosemarie. 1995. “Unraveling in Guessing Games: And Experimental Study.” The American Economic Review 85(5):1313-1326.

9. Oosterbeek, Hessel, Randolph Sloof & Gijs van de Kuilen. 2004. “Cultural Dierences in Ultimatum Game Experiments: Evidence from a Meta-Analysis.” Experimental Economics 7(2):171-188.