Inferential Statistics and Replications

By Stephan Lewandowsky
Professor, School of Experimental Psychology and Cabot Institute, University of Bristol
Posted on 10 October 2012
Filed under Cognition
and Klaus Oberauer

When you drop a glass it'll crash to the floor. Wherever you are on this planet, and whatever glass it is you were disposing of, gravity will ensure its swift demise. The replicability of phenomena is one of the hallmarks of science: once we understand a natural "law" we expect it to yield the same outcome in any situation in which it is applicable. (This outcome may have error bars associated with it but that doesn't affect our basic conclusion).

Nobel-winning cognitive scientist Dan Kahneman recently voiced his concern about the apparent lack of replicability of some results in an area of social psychology that concerns itself with "social priming", the modification of people’s behavior without their awareness. For example, it has been reported that people walk out of the lab more slowly after being primed with words that relate to the concept “old age” (Bargh et al., 1996). Alas, notes Kahneman, those effects have at least sometimes failed to be reproduced by other researchers. Kahneman's concern is therefore understandable.

How can experiments fail to replicate? There are several possible reasons but here we focus on the role of inferential statistics in scientific research generally. It isn't just social psychology that relies on statistics; many other disciplines do too. In a nutshell, statistics enables us to decide whether or not an observed effect has occurred simply by chance. Researchers routinely test whether an observed effect is “significant”. A “significant” effect is one that is so large that it is unlikely to arise from chance alone. An effect is declared “significant” if the probability to observe an effect this large or larger by chance alone is smaller than a pre-defined “significance level”, usually set to 5% (.05).

So, statistics can help us decide whether people walked down the hallway more slowly by chance or because they were primed by “old” words. However, our conclusion that the effect is "real" and not due to chance is inevitably accompanied by some uncertainty.

Here is the rub: if the significance level is .05 (5%), then there is still a 1 in 20 chance that we erroneously concluded the effect was real even when it was due to chance—or put another way, out of 20 experiments, there may be 1 that reports an effect when in fact that effect does not exist. This possibility can never be ruled out (although the probability can be minimized by various means).

There is one more catch: as an experimenter, when reporting a single experiment, one can never be 100% sure whether one's effect is real or due to chance. One can be very confident that the effect is real if it is extremely unlikely to observe such an effect by chance alone, but the possibility that one's experiment will fail to replicate can never be ruled out with absolute certainty.

So does this mean that we can never be sure of the resilience of an effect in psychological research?

No.

Quite on the contrary, we know much about how people function and how they think.

This is readily illustrated with Dan Kahneman's own work, as he has produced several benchmark results in cognitive science. Consider the following brief passage about a hypothetical person called Linda:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. In university, she was involved in several social issues, including the environment, the peace campaign, and the anti-nuclear campaign.

Do you think Linda is more likely to be a bank teller or a bank teller and an active feminist?

Every time this experiment is done—and we have performed it literally dozens of times in our classes—most people think Linda is more likely to be a feminist bank teller than a mere bank teller. After all, she was engaged in environmental issues, wasn't she?

However, this conclusion is false, and people's propensity to endorse it is known as the "conjunction fallacy" (Tversky & Kahneman, 1983). It's a fallacy because an event defined by multiple conditions can never be more likely than an event requiring only one of the constituent conditions: Because there are bound to be some bank tellers who are not feminists, Linda is necessarily more likely to be a bank teller than a bank teller and an active feminist.

Replicable effects such as the conjunction fallacy are obviously not confined to cognitive science. In climate science, for example, the iconic "hockey stick" which shows that the current increase in global temperatures is unprecedented during the past several centuries if not millennia, has been replicated numerous times since Mann et al. published their seminal paper in 1998. (Briffa et al., 2001; Briffa et al., 2004; Cook et al. 2004; D’Arrigo et al., 2006; Esper et al., 2002; Hegerl et al., 2006; Huang et al., 2000; Juckes et al., 2007; Kaufman et al., 2009 ; Ljungqvist, 2010; Moberg et al., 2005; Oerlemans, 2005 ; Pollack & Smerdon, 2004; Rutherford et al., 2005; Smith et al., 2006).

Crucially, those replications relied on a variety of proxy measures to reconstruct past climates—from tree rings to bore holes to sediments and so on. The fact that all reconstructions arrive at the same conclusion therefore increases our confidence in the robustness of the hockey stick. The sum total of replications has provided future generations with a very strong scientific (and moral) signal by which to evaluate our collective response to climate change at the beginning of the 21st century.

Let us now illustrate the specifics of the process of replication within the context of one of my recent papers, with colleagues Klaus Oberauer and Gilles Gignac, which showed (among other things) that conspiracist ideation predicted rejection of a range of scientific propositions, from the link between smoking and lung cancer to the fact that the globe is warming due to human greenhouse gas emissions. This effect was highly significant but the possibility that it represented a statistical fluke—though seemingly unlikely—cannot be ruled out.

To buttress one's confidence in the result, a replication of the study would thus be helpful.

But that doesn't mean it should be the same exact study done over again. On the contrary, this next study should differ slightly, so that the replication of the effect would underscore its breadth and resilience, and would buttress its theoretical impact.

For example, one might want to conduct the study using a large representative sample of the U.S. population. The kind of sample that professional survey and market research companies specialize in.

One might refine the set of items based on the results of the first study. One might provide a "neutral" option for the items this time round: the literature recognizes both strengths and weaknesses of including a neutral response option, so running the survey both ways and getting the same result would be particularly helpful.

One might also expand the set of potential worldview predictors, and one might query other controversial scientific propositions, such as GM foods and vaccinations—both said to be rejected by the political Left even though data on that claim are sparse.

Yes, such a replication would be quite helpful.

References

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71, 230-244.

Briffa, K.R., et al., 2001: Low-frequency temperature variations from a northern tree ring density network. J. Geophys. Res., 106(D3), 2929–2941.

Briffa, K.R., T.J. Osborn, and F.H. Schweingruber, 2004: Large-scale temperature inferences from tree rings: a review. Global Planetary Change, 40(1–2), 11–26.

Cook, E.R., J. Esper, and R.D. D’Arrigo, 2004a: Extra-tropical Northern Hemisphere land temperature variability over the past 1000 years. Quat.Sci. Rev., 23(20–22), 2063–2074.

D’Arrigo, R., R. Wilson, and G. Jacoby, 2006: On the long-term context for late twentieth century warming. J. Geophys. Res., 111(D3), doi:10.1029/2005JD006352.

Esper, J., E.R. Cook, and F.H. Schweingruber, 2002: Low-frequency signals in long tree-ring chronologies for reconstructing past temperature variability. Science, 295(5563), 2250–2253.

Hegerl, G.C., T.J. Crowley, W.T. Hyde, and D.J. Frame, 2006: Climate sensitivity constrained by temperature reconstructions over the past seven centuries. Nature, 440, 1029–1032.

Huang, S. and Pollack, H. S. and Shen, P.-Y. (2000). Temperature trends over the past five centuries reconstructed from borehole temperatures. Nature, 403, 756-758.

Juckes, M. N. et al. (2007). Millennial Temperature Reconstruction Intercomparison and  Evaluation. Climate of the Past, 3, 591–609.

Kaufman, D. S. et al. (2009). Recent Warming Reverses Long-Term Arctic Cooling. Science, 325, 1236.

Ljungqvist, F. C. (2010). A New Reconstruction of Temperature Variability in the Extra-tropical Northern Hemisphere During the Last Two Millennia. Geografiska Annaler , 92A, 339–351.

Mann, M. E., Bradley, R. S., & Hughes, M. K. (1998). Global-Scale Temperature Patterns and Climate Forcing over the Past Six Centuries. Nature, 392, 779–787.

Moberg, A., et al., 2005: Highly variable Northern Hemisphere temperatures reconstructed from low- and high-resolution proxy data. Nature, 433(7026), 613–617.

Oerlemans, J., 2005: Extracting a climate signal from 169 glacier records. Science, 308(5722), 675–677.

Pollack, H.N., and J.E. Smerdon, 2004: Borehole climate reconstructions: Spatial structure and hemispheric averages. J. Geophys. Res., 109(D11), D11106, doi:10.1029/2003JD004163.

Smith, C. L. and Baker, A. and Fairchild, I. J. and Frisia, S. and Borsato, A. (2006). Reconstructing hemispheric-scale climates from multiple stalagmite records. International Journal of Climatology, 26, 1417 – 1424.

Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review, 90, 293-315.

Wahl, E. R. and Ammann, C. R. (2007).  Robustness of the Mann, Bradley, Hughes reconstruction of Northern Hemisphere surface temperatures: Examination of criticisms based on the nature and processing of proxy climate evidence. Climatic Change, 85, 33-69.

Bookmark and Share

22 Comments


Comments 1 to 21:

  1. Could it be, that the conjunction fallacy is a result of the question being understood differently than actually stated by the general public. Could it be, that the question is not understood as "Do you think Linda is more likely to be a bank teller or a bank teller and an active feminist?" but rather as "Assuming Linda is a bank teller, is she more likely to be an active feminist or not?"

    I consider it likely, that this alternative explanation has been discussed in the literature. Could you please comment on it?
  2. @bluegrue yes, people differing in their interpretation of the question is one of the alternative accounts of the conjunction fallacy. Gerd Gigerenzer, a psychologist at the Max Planck Institute, has published a criticism along these lines.
  3. Stephan Lewandowsky at 12:27 PM on 13 October, 2012
    @1 and @2: Yes, there are alternative explanations of the conjunction fallacy (e.g., if I remember correctly, Peter Juslin and Henrik Olsson have also done some work on this). However, the interpretation does not impact the basic replicability and reliability of the phenomenon. Linda is always thought to be a feminist bank teller.
  4. @1. bluegrue
    Could it be, that the conjunction fallacy is a result of the question being understood differently than actually stated by the general public.

    I sympathise with your question because the way the Linda problem is presented here for me doesn't strike me a very easy to follow. I realise that it depends on the way people think to begin with. I first heard of the Linda problem in Kahneman's book "Thinking Fast And Slow", and I have a very visual way of understanding so when I read Kahneman saying that we should "Think in terms of Venn diagrams." I immediately started to get a better handle on the issue.

    Also the method Kahneman used makes me feel more of a surprise than I get from the formulation above. If I may describe they way Kahneman presented it:

    What the respondents are ask to do in Kahneman test is rank the list below in order of resemblance or likelihood to apply to Linda:

    Linda is a teacher in elementary school.
    Linda works in a bookstore and takes yoga classes.
    Linda is active in the feminist movement.
    Linda is a psychiatric social worker.
    Linda is a member of the League of Women Voters.
    Linda is a bank teller.
    Linda is an insurance salesperson.
    Linda is a bank teller and is active in the feminist movement.



    The idea of forming an either or question is now less imposing I think, and leaves more scope for the respondent to just make their assumptions; in fact if you read down the Kahneman list above you may see that a priming now exists that would decrease the chance of illogical responses.

    Kahneman says:

    We made up the questionnaire as you saw it, with "bank teller" in the sixth position in the list and "feminist bank teller" as the last item. We were convinced that subjects would notice the relation between the two outcomes, and that their rankings would be consistent with logic.

    Yet the consensus response was still to list the last item “feminist bank teller” the highest! I love Kahneman’s personal notes of his feelings when he saw this result, he says:
    I was so surprised that I still retain a "flashbulb memory" of the gray color of the metal desk and of where everyone was when I made that discovery.

    The bottom line, it seems, is in cases like these that giving more information can sometimes actually prompt people to inject their own assumptions and make less reliable judgements because their prejudices or “intuitions” are giving more freedom to inject their influence.

    I heartily recommend Kahneman’s "Thinking Fast And Slow" to any fellow lay people out there. :)
  5. Thank you for the thought provoking article.

    How these studies are reported may also contribute to a poor understanding of the science of climate change, indeed most science. Even if you were to replicate the exact same study many times, the p value would vary. I have always been uncomfortable with the arbitrary division of results into 'significant' and 'non significant', and the lay person may well ask 'what is considered significant?'. Even scientist can sometimes get carried away with this dichotomous approach.

    How do you think reporting of all science (especially controversial scientific propositions such as the increase in global temperatures) using confidence intervals instead (as recommended in the APA Pub Manual as the 'best' way to report results) would effect the climate change conversation?

    Regards,
    Stephanie
  6. Stephan Lewandowsky at 16:14 PM on 14 October, 2012
    @5: I agree that the dichotomization between "significant" and "non-significant" has serious problems. For example, I have seen too many students dismissing an effect as being non-existent because p=.058.

    I am less certain that confidence intervals are really the panacea that they are in the eyes of some people (i.e., they are still based on conventional frequentist logic, and they are effectively just a transformation of the p-value).

    I personally think that a Bayesian approach has numerous advantages because it can deliver what we really want--namely, the probabilities associated with the various competing hypotheses. Within cognitive science, the Bayesian revolution is well on the way and I'd bet that within 5-10 years most research will be expressed in Bayesian terms.

    In climate science, Annan and Hargreaves (Annan, J. D. & Hargreaves, J. C. On the generation and interpretation of probabilistic estimates of climate sensitivity Climatic Change, 2011, 104, 423-436) have created a relevant Bayesian precedent.
  7. re: 6
    I certainly agree, turning continuous into binary never made sense to me, from the first stats course I took.
    For those interested in Bayesian turf, Andrew Gelman's blog often has good discussions.
  8. Thanks for the feedback.
  9. @6 & 7 (Steve and John)
    Thanks for the comments, and also the recommended reading. Not familiar with Bayesian, but now plan to be.
  10. (-snip-).

    (-snip-).
    Moderator Response: Intentional misquoting and strawman argumentation snipped.
  11. I think it would be useful if the authors defined what they mean by "replication" and "prediction".
  12. From the above they are not using them in a way that mainstream statistics would recognise.
    Moderator Response: Non sequitur. You fail to demonstrate that they are not being used appropriately.
  13. Stephan Lewandowsky at 00:45 AM on 17 October, 2012
    @11 and @12: "Replication" simply means to run a similar--but preferably not identical--experiment or study again that produces the same effects. You learn more about the effect if the procedure is changed slightly rather than by repeating it exactly.

    "Prediction" in this instance refers to the independent (latent) variables in the regression model: It simply means that if you know one thing you can predict another (e.g., height from shoe size). This should not be taken to imply causality.
  14. Replication: In experiments in (chemistry, physics, etc) one can rerun the same experiment, given an adequate description and expect to get identical results within measurement error.

    But social sciences and medical research do not work that way, since humans are not electrons and many studies simply cannot be run twice on the same subjects.

    Some people don't seem to understand this :-)
  15. Ted Kirkpatrick at 05:24 AM on 17 October, 2012
    A statistical nit (but an important one):

    "put another way, out of 20 experiments, there may be 1 that reports an effect when in fact that effect does not exist. "

    In fact, out of 20 experiments, all of them---or none---could be reporting an effect when it does not exist. And within a frequentist perspective, we can never compute the probability of that occurring. Whereas the formula .05^20 only gives us the probability of 20 false positives assuming the nonexistence of actual effects, the more interesting probability, the likelihood of false positives in the presence of one or more actual effects, is not computable.

    The Bayesian perspective allow us to compute that "interesting probability" but requires expert judgement to select prior distributions for the effects.

    This point doesn't reduce the reliability of well-established results in psychology or climate science. Rather, it shifts our focus to the real basis of our confidence: The web of theories and convergent lines of evidence supporting the results, together with the absence of credible alternative explanations of that evidence. Statistical significance is just a small counterfactual part ("if our effect didn't exist at all, our results would only be this probable") of the bigger argument. I recall that Ronald Fisher made this point when he described the null hypothesis framework, although I can't give an actual quote.
  16. To understand replication and prediction one has to go back to the underlying probabilistic models one is assuming. This applies as much in the social sciences as it does in the physical sciences because it is the assumptions about these models that allow the use of statistical techniques including statistical inference.

    Being formal here also helps think about being explicit about hypothesis forming and testing and measurement issues because it also forces formality in dealing with the observed data and the underlying (assumed) probabilistic model. This is particularly true in the social sciences where both the theoretical constructs (the underlying probabilistic model)and measurement are often ill developed.

    If your underlying model assumes certain statistical behaviours in the constructs then multiple sample measurements can provide additional information about the population (i.e. the model). Testing other aspects of a model using additional experiments is obviously desirable, but not normally referred to as replication (think "hitting a ball" rather than "dropping a glass" as a means to investigate Newtonian mechanics).

    Sitting behind all this is the testing of the underlying model - does it conform to the probabilistic assumptions and can it be likely falsified. To use a model for statistical inference (and this includes prediction) one expects any model to have been demonstrated as complying with its basic assumptions and to have been shown not to have been falsified. With the latter the testing of the model on observations that were not used in its construction is the minimum (and doing this is not replication it is validation).

    Now L. et al fails on all accounts. There is no statement of the underlying theoretical probalistic model (rather a model is developed based on the data - this could be tested/validated in the future), the measures are idiosyncratic, the sample is not drawn in a way that allows any of the statistical test used to draw inferences about the population to do that, there is no testing of the data to show that the underlying assumptions do in fact apply, and (as noted) there is no validation of the model on independent data sets that is a basic requirement if one is to use it to start to make inferences including predictions.

    And yet the authors claim it "showed (among other things) that conspiracist ideation predicted rejection of a range of scientific propositions".

    It doesn't.
  17. Nonscientists who are only familiar with the rather narrow meaning of "replication" in statistics may reasonably be confused by the broader meaning of the term as used by scientists. For example, In a single experiment, I may include multiple "replicates." If I were using statistical instead of scientific jargon, however, I would call them "samples" -- these are intended to be multiple measurements of the same thing, but which will exhibit some degree of random variation due to the the accuracy limits of the measurement method and perhaps small variations in the way the sample was taken. Then I will generally carry out multiple "replicate experiments." This comes closest to what a statistician or philosopher of science might call replication. I'm repeating the same experiment as exactly as possible. This may be in part to get an idea of whether there is some uncontrolled but variable factor that changes over time and that adds additional variance above that inherent in the preparation and measurement of the samples. It is additionally a check for errors.

    On the other hand, I may attempt to replicate another scientist's conclusions. In this case, I generally will not attempt to reproduce his experiment exactly; instead, I will do an "equivalent" experiment. I may use a different experimental system, different measurement method, additional controls. If I have any reservations about his methods, or if the technology has advanced, I will attempt to improve upon them. What am I replicating, then? I am replicating the logic of his experiment. The changes that I make are expected (if my theoretical understanding is correct) not to alter the outcome or conclusions. And I'll probably also do some more stuff as well, to extend or broaden the conclusions of the original author. This will make my results of greater interest to other scientists (and therefore more publishable), because they are testing not merely whether the original experimenter "got it right" in a technical or statistical sense, but also whether the conclusions is robust--in other words, whether we correctly understand what variables are consequential for the outcome and which are not.

    There's really only one circumstance in which I will attempt to exactly reproduce another scientist's experiment, and that is if my attempt to replicate his conclusions fails. In that case, I will wonder if there is some unrecognized uncontrolled variable, so I will try both approaches side-by-side to see if it makes a difference. In particularly confusing cases of conflicting result, the two scientists may get together and do the same experiment side-by-side
  18. This is going to be great fun:) Pass the popcorn.
  19. I don't really think trrll at 09:15 AM was intended as a response to my comment, but just in case it was I'd just note that the post on which we are commenting is to do with replication and statistical inference.

    On the other hand if trrll's rather whimsical description of a scientist's experience using the word "replication" is simply to keep the thread going, I'd observe in the same spirit that it is also apparently an album by indie-metal band KanZer (h/t Wikipedia).
  20. HAS at 06:16 AM on 17 October, 2012 ... makes some very good points.

    I marvel here that posting surveys on-line, having little control over the type of indeed the quality of the relies, then applying sophisticated statistical analyses to the survey results can actually generate a publishable and peer reviewed paper.

    Whilst I do not know whether this is a good or a bad thing, and also acknowledging that the underlying information sought is often of great import...

    ...I can't help contrasting it with research in engineering and physics and thinking, "This isn't exactly rocket science, is it?"
  21. Nope, that's a two semester course.
Comments Policy

Post a Comment

You need to be logged in to post a comment. Login via the left margin or register a new account.