Review: The 7 Deadly Sins of Psychology

Screenshot 2018-08-18 at 19.05.15

I remember being taught as an undergraduate psychology student that replication, along with the principle of falsification, was a vital ingredient in the scientific method. But when I flipped through the pages of the journals (back in those pre-digital days), the question that frequently popped into my head was ‘Where are all these replications?’ It was a question I never dared actually ask in class, because I was sure I must simply have been missing something obvious. Now, about 30 years later, it turns out I was right to wonder.

In Chris Chambers’ magisterial new book The 7 Deadly Sins of Psychology, he reports that it wasn’t until 2012 that the first systematic study was conducted into the rate of replication within the field of psychology. Makel, Plucker and Hegarty searched for the term “replicat*” among the 321,411 articles published in the top 100 psychology journals between 1900 and 2012. Just 1.57 per cent of the articles contained this term, and among a randomly selected subsample of 500 papers from that 1.57 per cent,

only 342 reported some form of replication – and of these, just 62 articles reported a direct replication of a previous experiment. On top of that, only 47 per cent of replications within the subsample were produced by independent researchers (p.50).

Why does this matter? It seems that researchers, over a long period, have engaged in a variety of ‘Questionable Research Practices’ (QRPs), motivated by ambitions that are often shaped by the perverse incentives of the publishing industry.

A turning point occurred in 2011 when the Journal of Personality and Social Psychology published Daryl Bem’s now-notorious paper ‘Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect’. Taking a classic paradigm in which an emotional manipulation influences the speed of people’s responses on a subsequent task, Bem conducted a series of studies in which the experimental manipulation happened after participants made their responses. His results seemed to indicate that people’s responses were being influenced by a manipulation that hadn’t yet happened. There was general disbelief among the scientific community and Bem himself said that it was important for other researchers to attempt to replicate his findings. However, when the first – failed – replication was submitted to the same journal, they rejected it on the basis that their policy was to not publish replication studies, whether or not they were successful.

In fact, many top journals – e.g. Nature, Cortex, Brain, Psychological Science – explicitly state, in various ways, that they only publish findings that are novel. A December 2015 study in the British Medical Journal, that perhaps appeared too late for inclusion in Chambers’ book, found that over a forty year period scientific abstracts had shown a steep increase in the use of words relating to novelty or importance (e.g. “novel”, “robust”, “innovative” and “unprecedented”). Clearly, then, researchers know what matters when it comes to getting published.

A minimum requirement, though not the only one, for a result being interesting is that it is statistically significant. In the logic of null hypothesis significance testing (NHST), this means that if chance were the only factor in producing a result, then the probability of getting that result would be less than 5 per cent (or less than 1/20). Thus, researchers hope that any of their key tests will lead to a p-value of less than .05, as – agreed by convention – this allows them to reject the null hypothesis in favour of their experimental hypothesis (the explanation that they are actually proposing, and which they may be invested in).

It is fairly easy to see how the academic journals could be – and almost certainly are – overpopulated with papers that claim evidential support for hypotheses that are false. For instance, suppose many different researchers test a hypothesis that is, unknown to them, incorrect. Perhaps just one researcher finds a significant result, which is a fluke result arising by chance. That one person is likely to get published, whereas the others will not. In reality, many researchers will not bother to submit their null findings. But here lies another problem. A single researcher may conduct several studies of the same hypothesis, but only attempt to the publish the one (or ones) that turn out significant. He or she may feel a little guilty about this, but hey! – they have careers to progress and this is the system that the publishers have forced upon them.

Replication is supposed to help discover which hypotheses are false and which are likely to be true. As we have seen, though, failed replications may never see the light of day. More problematic is the use of ‘conceptual replications’, in which a researcher tries to replicate a previous finding using a methodology that is, to a greater or lesser degree, novel. The researcher can claim to be “extending” the earlier research by testing the generality of its findings. Indeed, having this element of originality may increase the chances of publication. However, as Chambers notes, there are three problems with conceptual replications.

Firstly, how similar must the methods be in order that the new study can count as a replication, and who decides this? Second, there is a risk of certain findings becoming unreplicated. If a successful conceptual replication later on turns out to have produced its result through an entirely different causal mechanism, then the original study has just been unreplicated. Thirdly, attempts at conceptual replication can fuel confirmation bias. If the attempt at a conceptual replication produces a different result to the initial study, the authors of the first study will inevitably claim that their own results were not reproduced precisely because the attempted replication didn’t follow exactly the same methodology.

Chambers sums up the replication situation as follows:

To fit with the demands of journals, psychologists have thus replaced direct replication with conceptual replication, maintaining the comfortable but futile delusion that our science values replication while still satisfying the demands of novelty and originality (p.20).

Because psychologists frequently run studies with more than one independent variable, they typically use statistical tests that provide various main effects and interactions. Unfortunately, this can tempt researchers to operate with a degree of flexibility that isn’t warranted by the original hypothesis. They may engage in HARK-ing – Hypothesizing After the Results are Known. Suppose a researcher predicts a couple of main effects, but that these turn out to be non-significant once the analysis has been performed. Nonetheless, there are some unpredicted significant interactions within the results. The researcher now goes through a process of trying to rationalise why the results turned out this way. Having come up with an explanation, he or she now rewrites the hypotheses as though these results were what had been expected all along. Recent surveys show that psychologists believe the prevalence of HARKing to be somewhere between 40%-90%, though the prevalence of those who admit to doing it themselves is, of course, much lower.

Another form of QRP is p-hacking. This refers to a cluster of practices whereby a researcher can illegitimately transform a non-significant result into a significant one. Suppose an experimental result has a p-value of .08, quite close to the magical threshold of .05 but also likely to be a barrier to publication. At this point, the researcher may try recruiting some new participants to the study in the hope that this will boost the p-value to below .05. However, bearing in mind that there will always be some variation in the way that participants respond, regardless of whether or not a hypothesis is true, then “peeking” at the results and recruiting new participants until such point that p falls below .05 is simply inflating the likelihood of obtaining a false positive result.

A second form of p-hacking is to analyse the data in different ways until you get the result you want. There is no single agreed method for the exclusion of ‘outliers’ in the data, so a researcher may run several analyses in which differing numbers of outliers are excluded, until a significant result is returned. Alternatively, there may be different forms of statistical test that can be applied. All tests are essentially estimates, and while equivalent-but-different tests will produce broadly similar results, the difference of a couple of decimal places or so may be all that is needed to transform a non-significant result into a significant one.

A third form of p-hacking is to change your dependent variables. For example, if three different measures of an effect are all just slightly non-significant, then a researcher might try integrating these into one measure to see if these brings the p-value below .05.

Several recent studies have examined the distributions of p-values in similar kinds of studies and have found that there is often a spike in p-values just below .05, which would appear to be indicative of p-hacking. The conclusion that follows from this is that many of the results in the psychological literature are likely to be false.

Chris Chambers also examines a number of other ways in which the scientific literature can be distorted by incorrect hypotheses. One such way is the hoarding of data. Many journals do not require, or even ask, that authors deposit their data with them. Authors themselves often refuse to provide data when a request is received, or will only provide it under certain restrictive conditions (almost certainly not legally enforceable). Yet one recent study found that statistical errors were more frequent in papers where the authors had failed to provide their data. Refusal to share may, of course, be one way of hiding misconduct. Chambers argues that data sharing should be the norm, not least because even the most scrupulous and honest authors may, over time, lose their own data, whether because of the updating of computer equipment or in the process of changing institutions. And, of course, everyone dies sooner or later. So why not ensure that all research data is held in accessible repositories?

Chapter 7 – The Sin of Bean Counting – covers some ground that I discussed in an earlier blog, when I reviewed Jerry Muller’s book The Tyranny of Metrics.  Academic journals now have a ‘Journal Impact Factor’ (JIF), which uses the citation counts of their papers to index the overall quality of the work published in the journals. Yet, a journal’s JIF is accounted for only by a very small proportion of the papers they carry. Most papers only have a small number of citations. Worse, the supposedly high impact journals are in fact the ones with the highest rates of retractions owing to fraud or suspected fraud. Chambers argues that it would be more accurate to refer to them as “high retraction” journals rather than “high impact” journals. The JIF is also easily massaged by editors and publishers, and, rather than being objectively calculated, is a matter of negotiation between the journals and the company that determines the JIF (Thomson Reuters).


Despite all the evidence that JIF is more-or-less worthless, the psychological community has become ensnared in a groupthink that lends it value.

It is used within academic institutions to help determine hiring and promotions, and even redundancies. Many would argue that JIF and other metrics have damaged the collegial atmosphere that one associates with universities, which in many instances have become arenas of overwork, stress and bullying.

Indeed, recent years have seen a number of instances of fraudulent behaviour by psychologists, most notably Diederik Stapel, who invented data for over 50 publications before eventually being exposed by a group of junior researchers and one of his own PhD students. By his own account, he began by engaging in “softer” acts of misrepresentation before graduating to more serious behaviours. Sadly, his PhD students, who had unwittingly incorporated his fraudulent results into their own PhDs (which they were allowed to retain) had their peer-reviewed papers withdrawn from the journals in which they had been published. Equally sad is ‘Kate’s Story’ (also recounted in Chapter 5) which describes the unjust treatment meted out to a young scientist after she was caught up in a fraud investigation against the Principal Investigator of the project she was working on, even though she was not the one who had reported him. Kate is reported as saying that if you suspect someone of making up data, but lack definitive proof, then do not expect any sympathy or support for speaking out.

Fortunately, Chris Chambers has given considerable thought as to how psychology’s replication crisis might be addressed. Indeed, he and a number of other psychologists have been instrumental in effecting some positive changes in academic publishing. His view is that it would be hopeless to try to address the biases (many likely unconscious) that researchers possess. Rather, it is the entire framework of the scientific and publishing enterprise which must be changed. His suggestions include:

  • The pre-registration of studies. Researchers submit their research idea to a journal in advance of carrying out the work. This includes details of hypotheses to be tested, the methodology and the statistical analyses that will be used. If the peer reviewers are happy with the idea, then the journal commits to publication of the findings – however they turn out – if the researchers have indeed carried out the work in a satisfactory manner.
  • The use of p-curve analyses to determine which fields in psychology are suffering from p-hacking.
  • The use of disclosure statements. Joe Simmons and colleagues have pioneered a 21-word statement:

We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.

  • Data sharing.
  • Solutions to allow “optional stopping” during data collection. One method is to reduce the alpha-level every time a researcher “peeks” at the data. A second method is to use Bayesian hypothesis testing instead of NHST. Whereas NHST only actually tests the null hypothesis (and doesn’t provide an estimate of the likelihood of the null hypothesis), the Bayesian approach allows researchers to directly estimate the probability of the null hypothesis relative to the experimental hypothesis.
  • Standardization of research practices. This may not always be possible, but where researchers conduct more than one type of analysis then the details of each should be reported and the robustness of the outcomes summarised.

Chambers devotes most space to the discussion of pre-registration. Many objections have been raised against this idea, and Chambers tackles these objections (convincingly, I think) in his Chapter 8: Redemption.

Although the issue of replication and QRPs is not unique to psychology, evidence indicates that it may be a bigger problem than for other disciplines. Therefore, if psychologists wish to be taken seriously then it is incumbent upon them to clean up their act. Fortunately, a number of psychologists – Chambers included – have been at the forefront of both uncovering poor practice and proposing ways to improve matters. A good starting point for anyone wanting to appreciate the scale of the problem and how to deal with it would be to read this book. Indeed, I think every university in the library should have at least one copy of this book on its shelves, and it should be on the reading list for classes in research methods and statistics. Despite being a book on methodology, I didn’t find it a dry read. On the contrary, it is something of a detective story – like Sherlock Holmes explaining how he worked out whodunnit – and, as such, I found it rather gripping.




Review: ‘The Mind is Flat’ by Nick Chater

The nature of consciousness is a topic that psychologists and philosphers have spilt much ink and many pixels over. Outside of psychoanalytic circles, what has been less discussed is the nature of the ‘unconscious mind’. Claims made by some psychologists about the power of the unconscious mind to influence behaviour have proven controversial.

Now, in a book that will have psychoanalysts and many others protesting loudly, cognitive scientist Nick Chater has plunged a stake through the very concept of an unconscious mind. In The Mind Is Flat Chater argues that our minds have no depths, let alone hidden ones. His primary claim is that the brain exists to make sense of the world by creating a stable perception of it and ourselves; but the brain does not provide us with an account of its own workings. These perceptions are created from our interpretations of a limited number of sensory inputs, with the assistance of various memory traces (themselves based on our interpretations of past events).

Chater’s opening chapter, The Power of Invention, describes how we can create an apparently rich internal picture of a fictional person or location based on a limited description that may have gaps or inconsistencies (Chater discusses Anna Karenina and Gormenghast). So it is with our perceptions of the actual world and, indeed, ourselves.  Most of our visual receptors are incapable of colour detection, yet we perceive the world in glorious colour. Our eyes are continually darting about all over the place, yet our perception of the world is smooth, not jerky. In short, much or most of what we perceive is an illusion foisted upon us by our brains.

Screenshot 2018-08-05 at 16.44.32

For centuries, philosophers consulted their ‘inner oracle’ in order to determine how the world works. Yet, Chater points out, the inner oracle has consistently misled us about concepts such as heat, weight, force and energy. Early researchers in artificial intelligence (AI) tried to do the same thing. They tried to excavate the mental depths of experts, recover ‘common sense theory’ and then devise methods to reason over this database. However, by the 1980s it had become clear that this program was going nowhere, and so was quietly abandoned.

As Chater puts it:

The mind is flat: our mental ‘surface’, the momentary thoughts, explanations and sensory experiences that make up our stream of consciousness is all there is to mental life. (p.31)

One reason why we are unaware of the fictional nature of our perceptions is precisely because our eyes are constantly moving about and picking up new sensory fragments. I may be unaware of the type of flower on the mantelpiece, but if you mention it my eyes go there automatically. In gaze-contingent eye tracking studies, the text on a screen changes according to where a person is looking. In fact, most of the text on the screen consists of Xs. As a participant’s eyes move across the screen the Xs that would have been in their fixation point change to become real words, and the area where they had been looking reverts to Xs. The participant, however, perceives that the entire page consists of meaningful text.

Likewise, when we construct a mental image it is never truly a ‘picture in the mind’. If we are asked to describe some details from the image, we simply ‘create’ those in our imagination in response to the question. Nothing is being retrieved from a complete image.

We often talk about a battle between ‘the heart and the head’, but Chater argues that we are in fact simply posing one reason against another reason. Citing the Kuleshov Effect, and the work of Schacter & Singer (1962) and Aron & Dutton (1974) on the labeling of emotional states, Chater concludes that “our feelings do not burst unbidden from within – they do not pre-exist at all” (p.98). Indeed:

The meaning of pretty much anything comes from its place in a wider network of relationships, causes and effects – not from within. (p.107)

Despite, or perhaps because of, our lack of inner depth, we are extremely good at dreaming up explanations for all kinds of things, including our inner motives. Perhaps my favourite example is from the work on choice blindness, in which participants were asked to choose the most attractive of two faces, each of which was presented on a card. After a participant made their choice, the researcher supposedly passed them the card they had chosen and asked them to explain why they had preferred that face. In fact, the researcher used sleight-of-hand to pass them the face they hadn’t chosen. Most people didn’t spot the discrepancy and readily provided an explanation as to why they preferred the face that they had not in fact chosen.

This research links to a wider body of work in decision making research, which shows that people’s preferences are constructed during the process of choice, depending on various contextual factors, as opposed to the conventional economic account that assumes people to have stable preferences that are revealed by the choices they make.

Chater also goes on to talk about people’s attentional limitations, arguing that – in almost all circumstances – our brains are only able to work on one problem at a time (where a problem is something which requires an act of interpretation on our part, rather than an habitual action such as putting one foot in front of the other when walking). This also fits with decades of work on human judgment, which has repeatedly found that people are unable to reliably integrate multiple items of information when trying to make a judgment.

Finally, Chater isn’t arguing that there are no unconscious processes. However, these unconscious processes aren’t ‘thoughts’. The mind isn’t like an iceberg, with a few thoughts appearing in consciousness and many others below the level of consciousness. Rather, the real nature of the unconscious is “the vastly complex patterns of nervous activity that create and support our slow, conscious experience” (p.175). Thus:

There is just one type of thought, and each thought has two aspects: a conscious read-out, and unconscious processes operating the read-out. And we can have no more conscious access to these brain processes than we can have conscious awareness of the chemistry of digestion or the biophysics of our muscles.

 The Mind is Flat is a book that I wish I’d written, in that it expresses, with evidence, a viewpoint that I have held for some time. The writing is clear and entertaining, and I devoured the book in just a few days. Recommended.