Scientific Conclusions versus Scientific Decisions, or We’re Having Tukey for Thanksgiving

December 01, 2013

By Justin Esarey

I recently noticed this Tweet from Carlisle Rainey, a methodologist at SUNY Buffalo and a fellow Florida State alumnus:

An essay by Tukey offers perspective on @justinesarey's alternative/supplement to hypothesis tests. http://t.co/8PzNNwtL0b

— Carlisle Rainey (@carlislerainey) November 27, 2013

Tukey’s article is well-worth a read, but it seemed to me that the distinction between conclusions and decisions was not established cleanly enough for me to distinguish them as a matter of methodological implementation. As a matter of qualitative description, the categories seem clean enough:

Decisions to “act for the present as if” are attempts to do as well as possible in specific situations, to choose wisely among the available gambles. …A conclusion is a statement which is to be accepted as applicable to the conditions of an experiment or observation unless and until unusually strong evidence to the contrary arises.

The final sentence in the article sums it up quite well:

There is a place for both “doing one’s best” [viz., making a decision] and “saying only what is certain” [viz., drawing a conclusion], but it is important to know, in each instance, both which one is being done, and which one ought to be done.

That makes sense to me: a decision is a course of action chosen under uncertainty in order to maximize benefit, whilst a conclusion is a statement of some fact that we believe to be certain enough to consider true. But Tukey himself realizes that this distinction quickly gets quite muddy in practice, especially when statistics are used for the scientific purpose of hypothesis testing. Indeed, in Appendix 3 Tukey seems to think that statistical significance testing produces conclusions, while Neyman-Pearson hypothesis testing (which in practice reduces to significance testing) produces decisions:

In view of Neyman’s continued insistence on “inductive behavior”, words which relate more naturally to decisions than to conclusions, it is reasonable to suppose that the Neyman-Pearson theory of testing hypotheses was, at the very least, a long step in the direction of decision theory, and that the appearance of 5%, 1% and the like in its development and discussion was a carryover from the then dominant qualitative conclusion theory, the theory of tests of significance. If this view is correct, Wald’s decision theory now does much more nearly what tests of hypothesis were intended to do. …If one aspect of the theory of testing hypotheses has been embodied in modern decision theory, what of its other aspects? The notion of the power function of a test, which is of course strictly analogous to the notion of the operating characteristic of a sampling plan, is just as applicable to tests of significance (conclusions) as to tests of hypotheses (decisions). And, indeed, its natural generalization to confidence procedures (conclusions) seems more natural and reasonable than such conventional criteria as the average length of confidence intervals.

It is curious that two procedures which are, in practice, mathematically identical are considered to be epistemologically distinct. And, with hindsight, the significance testing regime has failed to produce conclusions in Tukey’s sense. Moreover, we immediately encounter a knotty epistemological problem when we think about how one would draw scientific conclusions using statistical evidence:

On the other hand, all of us make decisions about conclusion procedures. Some of us do it every day. “How is it best to analyze this data?” is a question which cannot be left to the experimenter alone, which the statistician is bound by his profession to try to answer. If the answer should clearly be a procedure to provide a conclusion, then he must do something about a conclusion procedure. Does he decide about it, or conclude about it?

We have to make a decision about how to draw conclusions. How deliciously hermeneutic!

I think the way out is suggested by Tukey in Appendix 2:

A scientist is helped little to know that another, given different evidence and facing a different specific situation, decided (even decided wisely) to act as if so-and-so were the true state of nature. The communication (for information, not as directives) of decisions is often inappropriate, and usually inefficient. A scientist is helped much to know that another reached a certain conclusion, that he felt that the correctness of so-and-so was established with high confidence.

I would suggest a modification of this principle: a conclusion is a hypothesis that an overwhelming majority of the scientific community would choose to accept on the basis of the evidence available. This is precisely what Nathan Danneman and I argue for in our article:

we argue that it is more helpful to assess a result’s substantive robustness, the degree to which a community with heterogeneous standards for interpreting evidence would agree that the result is substantively significant, rather than whether it meets any individual standard. This focuses attention away from the contentious choice of utility function and onto the breadth of evaluation standards that can be satised by a particular piece of evidence. The idea is to enable a researcher to objectively demonstrate whether his/her results should be regarded as substantively significant by a scientific community, even if there is significant disagreement in that community over what a substantively significant result looks like.

This definition also makes it possible to begin constructing a quantifiable “conclusion theory” (in Tukey’s terms), as we can ask how different dimensions of preference and judgment factor into making some decision and then label the most unanimous decisions as “conclusions.” That is, in fact, how Nathan’s and my article proceeds.

Tukey might disagree, as certainty rather than unanimity characterizes his definition of a conclusion:

If nothing is to be concluded, only something decided, there is no need to control the probability of error. (Only the mathematical expectation of gain needs to be positive to make a small gamble profitable. There is no need for high confidence in winning individual bets. A coin which comes heads 60% of the time will win more money safely than one that comes heads 95% or 99% of the time.)

But I would argue back that scientific decisions (to accept some hypothesis, for example) do hinge on certainty. As Tukey himself points out, choosing to “believe in uncertainty” rather than to come to a firm conclusion is itself a conclusion, and has unique consequences. In the context of hypothesis testing, the level of uncertainty determines whether accepting the null hypothesis (of no meaningful relationship that needs to be investigated and theorized about) is better than accepting the alternative (of some substantively meaningful relationship that must be integrated into our body of knowledge).

Now, perhaps it would make more sense to allow for three possibilities: no relationship, a relationship, and we don’t know whether there’s a relationship. That reflects on some of Carlisle’s own (very good!) work, and I think I’d be favorably disposed to such a proposal. But my binary classification does make sense, fits well into our current schema of hypothesis testing, and I think integrates nicely into a “conclusion theory” that hinges on the consensus of the scientific community around some question.