I remember being taught as an undergraduate psychology student that replication, along with the principle of falsification, was a vital ingredient in the scientific method. But when I flipped through the pages of the journals (back in those pre-digital days), the question that frequently popped into my head was ‘Where are all these replications?’ It was a question I never dared actually ask in class, because I was sure I must simply have been missing something obvious. Now, about 30 years later, it turns out I was right to wonder.
In Chris Chambers’ magisterial new book The 7 Deadly Sins of Psychology, he reports that it wasn’t until 2012 that the first systematic study was conducted into the rate of replication within the field of psychology. Makel, Plucker and Hegarty searched for the term “replicat*” among the 321,411 articles published in the top 100 psychology journals between 1900 and 2012. Just 1.57 per cent of the articles contained this term, and among a randomly selected subsample of 500 papers from that 1.57 per cent,
only 342 reported some form of replication – and of these, just 62 articles reported a direct replication of a previous experiment. On top of that, only 47 per cent of replications within the subsample were produced by independent researchers (p.50).
Why does this matter? It seems that researchers, over a long period, have engaged in a variety of ‘Questionable Research Practices’ (QRPs), motivated by ambitions that are often shaped by the perverse incentives of the publishing industry.
A turning point occurred in 2011 when the Journal of Personality and Social Psychology published Daryl Bem’s now-notorious paper ‘Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect’. Taking a classic paradigm in which an emotional manipulation influences the speed of people’s responses on a subsequent task, Bem conducted a series of studies in which the experimental manipulation happened after participants made their responses. His results seemed to indicate that people’s responses were being influenced by a manipulation that hadn’t yet happened. There was general disbelief among the scientific community and Bem himself said that it was important for other researchers to attempt to replicate his findings. However, when the first – failed – replication was submitted to the same journal, they rejected it on the basis that their policy was to not publish replication studies, whether or not they were successful.
In fact, many top journals – e.g. Nature, Cortex, Brain, Psychological Science – explicitly state, in various ways, that they only publish findings that are novel. A December 2015 study in the British Medical Journal, that perhaps appeared too late for inclusion in Chambers’ book, found that over a forty year period scientific abstracts had shown a steep increase in the use of words relating to novelty or importance (e.g. “novel”, “robust”, “innovative” and “unprecedented”). Clearly, then, researchers know what matters when it comes to getting published.
A minimum requirement, though not the only one, for a result being interesting is that it is statistically significant. In the logic of null hypothesis significance testing (NHST), this means that if chance were the only factor in producing a result, then the probability of getting that result would be less than 5 per cent (or less than 1/20). Thus, researchers hope that any of their key tests will lead to a p-value of less than .05, as – agreed by convention – this allows them to reject the null hypothesis in favour of their experimental hypothesis (the explanation that they are actually proposing, and which they may be invested in).
It is fairly easy to see how the academic journals could be – and almost certainly are – overpopulated with papers that claim evidential support for hypotheses that are false. For instance, suppose many different researchers test a hypothesis that is, unknown to them, incorrect. Perhaps just one researcher finds a significant result, which is a fluke result arising by chance. That one person is likely to get published, whereas the others will not. In reality, many researchers will not bother to submit their null findings. But here lies another problem. A single researcher may conduct several studies of the same hypothesis, but only attempt to the publish the one (or ones) that turn out significant. He or she may feel a little guilty about this, but hey! – they have careers to progress and this is the system that the publishers have forced upon them.
Replication is supposed to help discover which hypotheses are false and which are likely to be true. As we have seen, though, failed replications may never see the light of day. More problematic is the use of ‘conceptual replications’, in which a researcher tries to replicate a previous finding using a methodology that is, to a greater or lesser degree, novel. The researcher can claim to be “extending” the earlier research by testing the generality of its findings. Indeed, having this element of originality may increase the chances of publication. However, as Chambers notes, there are three problems with conceptual replications.
Firstly, how similar must the methods be in order that the new study can count as a replication, and who decides this? Second, there is a risk of certain findings becoming unreplicated. If a successful conceptual replication later on turns out to have produced its result through an entirely different causal mechanism, then the original study has just been unreplicated. Thirdly, attempts at conceptual replication can fuel confirmation bias. If the attempt at a conceptual replication produces a different result to the initial study, the authors of the first study will inevitably claim that their own results were not reproduced precisely because the attempted replication didn’t follow exactly the same methodology.
Chambers sums up the replication situation as follows:
To fit with the demands of journals, psychologists have thus replaced direct replication with conceptual replication, maintaining the comfortable but futile delusion that our science values replication while still satisfying the demands of novelty and originality (p.20).
Because psychologists frequently run studies with more than one independent variable, they typically use statistical tests that provide various main effects and interactions. Unfortunately, this can tempt researchers to operate with a degree of flexibility that isn’t warranted by the original hypothesis. They may engage in HARK-ing – Hypothesizing After the Results are Known. Suppose a researcher predicts a couple of main effects, but that these turn out to be non-significant once the analysis has been performed. Nonetheless, there are some unpredicted significant interactions within the results. The researcher now goes through a process of trying to rationalise why the results turned out this way. Having come up with an explanation, he or she now rewrites the hypotheses as though these results were what had been expected all along. Recent surveys show that psychologists believe the prevalence of HARKing to be somewhere between 40%-90%, though the prevalence of those who admit to doing it themselves is, of course, much lower.
Another form of QRP is p-hacking. This refers to a cluster of practices whereby a researcher can illegitimately transform a non-significant result into a significant one. Suppose an experimental result has a p-value of .08, quite close to the magical threshold of .05 but also likely to be a barrier to publication. At this point, the researcher may try recruiting some new participants to the study in the hope that this will boost the p-value to below .05. However, bearing in mind that there will always be some variation in the way that participants respond, regardless of whether or not a hypothesis is true, then “peeking” at the results and recruiting new participants until such point that p falls below .05 is simply inflating the likelihood of obtaining a false positive result.
A second form of p-hacking is to analyse the data in different ways until you get the result you want. There is no single agreed method for the exclusion of ‘outliers’ in the data, so a researcher may run several analyses in which differing numbers of outliers are excluded, until a significant result is returned. Alternatively, there may be different forms of statistical test that can be applied. All tests are essentially estimates, and while equivalent-but-different tests will produce broadly similar results, the difference of a couple of decimal places or so may be all that is needed to transform a non-significant result into a significant one.
A third form of p-hacking is to change your dependent variables. For example, if three different measures of an effect are all just slightly non-significant, then a researcher might try integrating these into one measure to see if these brings the p-value below .05.
Several recent studies have examined the distributions of p-values in similar kinds of studies and have found that there is often a spike in p-values just below .05, which would appear to be indicative of p-hacking. The conclusion that follows from this is that many of the results in the psychological literature are likely to be false.
Chris Chambers also examines a number of other ways in which the scientific literature can be distorted by incorrect hypotheses. One such way is the hoarding of data. Many journals do not require, or even ask, that authors deposit their data with them. Authors themselves often refuse to provide data when a request is received, or will only provide it under certain restrictive conditions (almost certainly not legally enforceable). Yet one recent study found that statistical errors were more frequent in papers where the authors had failed to provide their data. Refusal to share may, of course, be one way of hiding misconduct. Chambers argues that data sharing should be the norm, not least because even the most scrupulous and honest authors may, over time, lose their own data, whether because of the updating of computer equipment or in the process of changing institutions. And, of course, everyone dies sooner or later. So why not ensure that all research data is held in accessible repositories?
Chapter 7 – The Sin of Bean Counting – covers some ground that I discussed in an earlier blog, when I reviewed Jerry Muller’s book The Tyranny of Metrics. Academic journals now have a ‘Journal Impact Factor’ (JIF), which uses the citation counts of their papers to index the overall quality of the work published in the journals. Yet, a journal’s JIF is accounted for only by a very small proportion of the papers they carry. Most papers only have a small number of citations. Worse, the supposedly high impact journals are in fact the ones with the highest rates of retractions owing to fraud or suspected fraud. Chambers argues that it would be more accurate to refer to them as “high retraction” journals rather than “high impact” journals. The JIF is also easily massaged by editors and publishers, and, rather than being objectively calculated, is a matter of negotiation between the journals and the company that determines the JIF (Thomson Reuters).
Despite all the evidence that JIF is more-or-less worthless, the psychological community has become ensnared in a groupthink that lends it value.
It is used within academic institutions to help determine hiring and promotions, and even redundancies. Many would argue that JIF and other metrics have damaged the collegial atmosphere that one associates with universities, which in many instances have become arenas of overwork, stress and bullying.
Indeed, recent years have seen a number of instances of fraudulent behaviour by psychologists, most notably Diederik Stapel, who invented data for over 50 publications before eventually being exposed by a group of junior researchers and one of his own PhD students. By his own account, he began by engaging in “softer” acts of misrepresentation before graduating to more serious behaviours. Sadly, his PhD students, who had unwittingly incorporated his fraudulent results into their own PhDs (which they were allowed to retain) had their peer-reviewed papers withdrawn from the journals in which they had been published. Equally sad is ‘Kate’s Story’ (also recounted in Chapter 5) which describes the unjust treatment meted out to a young scientist after she was caught up in a fraud investigation against the Principal Investigator of the project she was working on, even though she was not the one who had reported him. Kate is reported as saying that if you suspect someone of making up data, but lack definitive proof, then do not expect any sympathy or support for speaking out.
Fortunately, Chris Chambers has given considerable thought as to how psychology’s replication crisis might be addressed. Indeed, he and a number of other psychologists have been instrumental in effecting some positive changes in academic publishing. His view is that it would be hopeless to try to address the biases (many likely unconscious) that researchers possess. Rather, it is the entire framework of the scientific and publishing enterprise which must be changed. His suggestions include:
- The pre-registration of studies. Researchers submit their research idea to a journal in advance of carrying out the work. This includes details of hypotheses to be tested, the methodology and the statistical analyses that will be used. If the peer reviewers are happy with the idea, then the journal commits to publication of the findings – however they turn out – if the researchers have indeed carried out the work in a satisfactory manner.
- The use of p-curve analyses to determine which fields in psychology are suffering from p-hacking.
- The use of disclosure statements. Joe Simmons and colleagues have pioneered a 21-word statement:
We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.
- Data sharing.
- Solutions to allow “optional stopping” during data collection. One method is to reduce the alpha-level every time a researcher “peeks” at the data. A second method is to use Bayesian hypothesis testing instead of NHST. Whereas NHST only actually tests the null hypothesis (and doesn’t provide an estimate of the likelihood of the null hypothesis), the Bayesian approach allows researchers to directly estimate the probability of the null hypothesis relative to the experimental hypothesis.
- Standardization of research practices. This may not always be possible, but where researchers conduct more than one type of analysis then the details of each should be reported and the robustness of the outcomes summarised.
Chambers devotes most space to the discussion of pre-registration. Many objections have been raised against this idea, and Chambers tackles these objections (convincingly, I think) in his Chapter 8: Redemption.
Although the issue of replication and QRPs is not unique to psychology, evidence indicates that it may be a bigger problem than for other disciplines. Therefore, if psychologists wish to be taken seriously then it is incumbent upon them to clean up their act. Fortunately, a number of psychologists – Chambers included – have been at the forefront of both uncovering poor practice and proposing ways to improve matters. A good starting point for anyone wanting to appreciate the scale of the problem and how to deal with it would be to read this book. Indeed, I think every university in the library should have at least one copy of this book on its shelves, and it should be on the reading list for classes in research methods and statistics. Despite being a book on methodology, I didn’t find it a dry read. On the contrary, it is something of a detective story – like Sherlock Holmes explaining how he worked out whodunnit – and, as such, I found it rather gripping.