Review: The 7 Deadly Sins of Psychology

Screenshot 2018-08-18 at 19.05.15

I remember being taught as an undergraduate psychology student that replication, along with the principle of falsification, was a vital ingredient in the scientific method. But when I flipped through the pages of the journals (back in those pre-digital days), the question that frequently popped into my head was ‘Where are all these replications?’ It was a question I never dared actually ask in class, because I was sure I must simply have been missing something obvious. Now, about 30 years later, it turns out I was right to wonder.

In Chris Chambers’ magisterial new book The 7 Deadly Sins of Psychology, he reports that it wasn’t until 2012 that the first systematic study was conducted into the rate of replication within the field of psychology. Makel, Plucker and Hegarty searched for the term “replicat*” among the 321,411 articles published in the top 100 psychology journals between 1900 and 2012. Just 1.57 per cent of the articles contained this term, and among a randomly selected subsample of 500 papers from that 1.57 per cent,

only 342 reported some form of replication – and of these, just 62 articles reported a direct replication of a previous experiment. On top of that, only 47 per cent of replications within the subsample were produced by independent researchers (p.50).

Why does this matter? It seems that researchers, over a long period, have engaged in a variety of ‘Questionable Research Practices’ (QRPs), motivated by ambitions that are often shaped by the perverse incentives of the publishing industry.

A turning point occurred in 2011 when the Journal of Personality and Social Psychology published Daryl Bem’s now-notorious paper ‘Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect’. Taking a classic paradigm in which an emotional manipulation influences the speed of people’s responses on a subsequent task, Bem conducted a series of studies in which the experimental manipulation happened after participants made their responses. His results seemed to indicate that people’s responses were being influenced by a manipulation that hadn’t yet happened. There was general disbelief among the scientific community and Bem himself said that it was important for other researchers to attempt to replicate his findings. However, when the first – failed – replication was submitted to the same journal, they rejected it on the basis that their policy was to not publish replication studies, whether or not they were successful.

In fact, many top journals – e.g. Nature, Cortex, Brain, Psychological Science – explicitly state, in various ways, that they only publish findings that are novel. A December 2015 study in the British Medical Journal, that perhaps appeared too late for inclusion in Chambers’ book, found that over a forty year period scientific abstracts had shown a steep increase in the use of words relating to novelty or importance (e.g. “novel”, “robust”, “innovative” and “unprecedented”). Clearly, then, researchers know what matters when it comes to getting published.

A minimum requirement, though not the only one, for a result being interesting is that it is statistically significant. In the logic of null hypothesis significance testing (NHST), this means that if chance were the only factor in producing a result, then the probability of getting that result would be less than 5 per cent (or less than 1/20). Thus, researchers hope that any of their key tests will lead to a p-value of less than .05, as – agreed by convention – this allows them to reject the null hypothesis in favour of their experimental hypothesis (the explanation that they are actually proposing, and which they may be invested in).

It is fairly easy to see how the academic journals could be – and almost certainly are – overpopulated with papers that claim evidential support for hypotheses that are false. For instance, suppose many different researchers test a hypothesis that is, unknown to them, incorrect. Perhaps just one researcher finds a significant result, which is a fluke result arising by chance. That one person is likely to get published, whereas the others will not. In reality, many researchers will not bother to submit their null findings. But here lies another problem. A single researcher may conduct several studies of the same hypothesis, but only attempt to the publish the one (or ones) that turn out significant. He or she may feel a little guilty about this, but hey! – they have careers to progress and this is the system that the publishers have forced upon them.

Replication is supposed to help discover which hypotheses are false and which are likely to be true. As we have seen, though, failed replications may never see the light of day. More problematic is the use of ‘conceptual replications’, in which a researcher tries to replicate a previous finding using a methodology that is, to a greater or lesser degree, novel. The researcher can claim to be “extending” the earlier research by testing the generality of its findings. Indeed, having this element of originality may increase the chances of publication. However, as Chambers notes, there are three problems with conceptual replications.

Firstly, how similar must the methods be in order that the new study can count as a replication, and who decides this? Second, there is a risk of certain findings becoming unreplicated. If a successful conceptual replication later on turns out to have produced its result through an entirely different causal mechanism, then the original study has just been unreplicated. Thirdly, attempts at conceptual replication can fuel confirmation bias. If the attempt at a conceptual replication produces a different result to the initial study, the authors of the first study will inevitably claim that their own results were not reproduced precisely because the attempted replication didn’t follow exactly the same methodology.

Chambers sums up the replication situation as follows:

To fit with the demands of journals, psychologists have thus replaced direct replication with conceptual replication, maintaining the comfortable but futile delusion that our science values replication while still satisfying the demands of novelty and originality (p.20).

Because psychologists frequently run studies with more than one independent variable, they typically use statistical tests that provide various main effects and interactions. Unfortunately, this can tempt researchers to operate with a degree of flexibility that isn’t warranted by the original hypothesis. They may engage in HARK-ing – Hypothesizing After the Results are Known. Suppose a researcher predicts a couple of main effects, but that these turn out to be non-significant once the analysis has been performed. Nonetheless, there are some unpredicted significant interactions within the results. The researcher now goes through a process of trying to rationalise why the results turned out this way. Having come up with an explanation, he or she now rewrites the hypotheses as though these results were what had been expected all along. Recent surveys show that psychologists believe the prevalence of HARKing to be somewhere between 40%-90%, though the prevalence of those who admit to doing it themselves is, of course, much lower.

Another form of QRP is p-hacking. This refers to a cluster of practices whereby a researcher can illegitimately transform a non-significant result into a significant one. Suppose an experimental result has a p-value of .08, quite close to the magical threshold of .05 but also likely to be a barrier to publication. At this point, the researcher may try recruiting some new participants to the study in the hope that this will boost the p-value to below .05. However, bearing in mind that there will always be some variation in the way that participants respond, regardless of whether or not a hypothesis is true, then “peeking” at the results and recruiting new participants until such point that p falls below .05 is simply inflating the likelihood of obtaining a false positive result.

A second form of p-hacking is to analyse the data in different ways until you get the result you want. There is no single agreed method for the exclusion of ‘outliers’ in the data, so a researcher may run several analyses in which differing numbers of outliers are excluded, until a significant result is returned. Alternatively, there may be different forms of statistical test that can be applied. All tests are essentially estimates, and while equivalent-but-different tests will produce broadly similar results, the difference of a couple of decimal places or so may be all that is needed to transform a non-significant result into a significant one.

A third form of p-hacking is to change your dependent variables. For example, if three different measures of an effect are all just slightly non-significant, then a researcher might try integrating these into one measure to see if these brings the p-value below .05.

Several recent studies have examined the distributions of p-values in similar kinds of studies and have found that there is often a spike in p-values just below .05, which would appear to be indicative of p-hacking. The conclusion that follows from this is that many of the results in the psychological literature are likely to be false.

Chris Chambers also examines a number of other ways in which the scientific literature can be distorted by incorrect hypotheses. One such way is the hoarding of data. Many journals do not require, or even ask, that authors deposit their data with them. Authors themselves often refuse to provide data when a request is received, or will only provide it under certain restrictive conditions (almost certainly not legally enforceable). Yet one recent study found that statistical errors were more frequent in papers where the authors had failed to provide their data. Refusal to share may, of course, be one way of hiding misconduct. Chambers argues that data sharing should be the norm, not least because even the most scrupulous and honest authors may, over time, lose their own data, whether because of the updating of computer equipment or in the process of changing institutions. And, of course, everyone dies sooner or later. So why not ensure that all research data is held in accessible repositories?

Chapter 7 – The Sin of Bean Counting – covers some ground that I discussed in an earlier blog, when I reviewed Jerry Muller’s book The Tyranny of Metrics.  Academic journals now have a ‘Journal Impact Factor’ (JIF), which uses the citation counts of their papers to index the overall quality of the work published in the journals. Yet, a journal’s JIF is accounted for only by a very small proportion of the papers they carry. Most papers only have a small number of citations. Worse, the supposedly high impact journals are in fact the ones with the highest rates of retractions owing to fraud or suspected fraud. Chambers argues that it would be more accurate to refer to them as “high retraction” journals rather than “high impact” journals. The JIF is also easily massaged by editors and publishers, and, rather than being objectively calculated, is a matter of negotiation between the journals and the company that determines the JIF (Thomson Reuters).

Yet:

Despite all the evidence that JIF is more-or-less worthless, the psychological community has become ensnared in a groupthink that lends it value.

It is used within academic institutions to help determine hiring and promotions, and even redundancies. Many would argue that JIF and other metrics have damaged the collegial atmosphere that one associates with universities, which in many instances have become arenas of overwork, stress and bullying.

Indeed, recent years have seen a number of instances of fraudulent behaviour by psychologists, most notably Diederik Stapel, who invented data for over 50 publications before eventually being exposed by a group of junior researchers and one of his own PhD students. By his own account, he began by engaging in “softer” acts of misrepresentation before graduating to more serious behaviours. Sadly, his PhD students, who had unwittingly incorporated his fraudulent results into their own PhDs (which they were allowed to retain) had their peer-reviewed papers withdrawn from the journals in which they had been published. Equally sad is ‘Kate’s Story’ (also recounted in Chapter 5) which describes the unjust treatment meted out to a young scientist after she was caught up in a fraud investigation against the Principal Investigator of the project she was working on, even though she was not the one who had reported him. Kate is reported as saying that if you suspect someone of making up data, but lack definitive proof, then do not expect any sympathy or support for speaking out.

Fortunately, Chris Chambers has given considerable thought as to how psychology’s replication crisis might be addressed. Indeed, he and a number of other psychologists have been instrumental in effecting some positive changes in academic publishing. His view is that it would be hopeless to try to address the biases (many likely unconscious) that researchers possess. Rather, it is the entire framework of the scientific and publishing enterprise which must be changed. His suggestions include:

  • The pre-registration of studies. Researchers submit their research idea to a journal in advance of carrying out the work. This includes details of hypotheses to be tested, the methodology and the statistical analyses that will be used. If the peer reviewers are happy with the idea, then the journal commits to publication of the findings – however they turn out – if the researchers have indeed carried out the work in a satisfactory manner.
  • The use of p-curve analyses to determine which fields in psychology are suffering from p-hacking.
  • The use of disclosure statements. Joe Simmons and colleagues have pioneered a 21-word statement:

We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.

  • Data sharing.
  • Solutions to allow “optional stopping” during data collection. One method is to reduce the alpha-level every time a researcher “peeks” at the data. A second method is to use Bayesian hypothesis testing instead of NHST. Whereas NHST only actually tests the null hypothesis (and doesn’t provide an estimate of the likelihood of the null hypothesis), the Bayesian approach allows researchers to directly estimate the probability of the null hypothesis relative to the experimental hypothesis.
  • Standardization of research practices. This may not always be possible, but where researchers conduct more than one type of analysis then the details of each should be reported and the robustness of the outcomes summarised.

Chambers devotes most space to the discussion of pre-registration. Many objections have been raised against this idea, and Chambers tackles these objections (convincingly, I think) in his Chapter 8: Redemption.

Although the issue of replication and QRPs is not unique to psychology, evidence indicates that it may be a bigger problem than for other disciplines. Therefore, if psychologists wish to be taken seriously then it is incumbent upon them to clean up their act. Fortunately, a number of psychologists – Chambers included – have been at the forefront of both uncovering poor practice and proposing ways to improve matters. A good starting point for anyone wanting to appreciate the scale of the problem and how to deal with it would be to read this book. Indeed, I think every university in the library should have at least one copy of this book on its shelves, and it should be on the reading list for classes in research methods and statistics. Despite being a book on methodology, I didn’t find it a dry read. On the contrary, it is something of a detective story – like Sherlock Holmes explaining how he worked out whodunnit – and, as such, I found it rather gripping.

 

 

 

Destroying the soul, by numbers

I think the first time I became aware of metrics in the workplace was between 1990 and 1993, when I was studying for a PhD at the University of Wales, College of Cardiff (now simply ‘Cardiff University’). One day, A4 sheets of paper had appeared on walls and doors in the Psychology Department proclaiming “We are a five star department!” A friend explained to me that this related to our performance in the ‘Research Assessment Exercise’ (RAE), about which I knew nothing. He scoffed at this proclamation in a rather scathing manner, clearly thinking that this kind of rating exercise had little to do with what really mattered in science. I didn’t realise then how right he was. But the RAE was used as a determinant of how much research income institutions could expect from government (via the funding councils).

A few years later, in my first full-time lecturing post, at London Guildhall University, I was put in charge of organising our entry to the next RAE. Part of this pre-exercise exercise was to determine which members of staff would be included and which excluded. Immediately this raised the question in my mind: “If the RAE is supposed to assess a department’s strengths in research, then shouldn’t all staff members be included?” Such was my introduction to the “gaming” of metrics. Every institution was, of course, gaming the system in this and various other ways. Those that could afford it would buy in star performers just before the RAE (often to depart not long afterwards), leading to new rules to prevent such behaviour.

At some point, universities also got landed with the National Student Survey (NSS), which consisted of numerous questions relating to the “student experience”, but with most of the impact falling on lecturing staff who, either explicitly or implicitly, were informed that they needed to improve. With the introduction of – and subsequent increase in – tuition fees, students were now seen as consumers for whom league tables in research and the NSS were sources of information that could be used to distinguish between institutions when applying. The NSS has also led to gaming, sometimes not so subtly – as when lecturers or managers have warned students that they themselves might suffer from a worse educational experience resulting from institutional loss of income as a consequence of their own low ratings.

These changes within universities have been accompanied by another change: an expansion in the number of administrative staff employed and a shift in power away from academics. And academic staff themselves now spend considerably more time on paperwork than was the case in the past.

A new book by Jerry Z. Muller, The Tyranny of Metrics, shows that the experience of higher education is typical of many areas of working life. He traces the history of workplace metrics, the controversies surrounding them and the evidence of their effectiveness (or lack of). As far back as 1862, the Liberal MP Robert Lowe was proposing that the funding of schools should be determined on a payment-by-results basis, a view that was challenged by Matthew Arnold (himself a schools inspector) for the narrow and mechanical conception of education that it promoted.

In the early twentieth century, Frederick Winslow Taylor promoted the idea of “scientific management”, based on his time-and-motion studies of pig iron production in factories. He advocated that people should be paid according to output in a system that required enforced standardisation of methods, enforced adoption of the best implements and working conditions, and enforced cooperation. Note that the use of metrics and pay-for-performance are distinct things, but often go together in practice.

Later in the century, the doctrine of managerialism became more prominent. This is the idea that the differences among organisations are less important than their similarities. Thus, traditional domain-specific expertise is downplayed and senior managers can move from one organisation to another where the same kinds of management techniques are deployed. In the US, Defence Secretary Robert McNamara took metrics to the army, where “body counts” were championed as an index of American progress in Vietnam. Officers increasingly took on a managerial outlook.

The use of metrics found supporters on both the political left and the right. Particularly in the 1960s, the left were suspicious of established elites and demanded greater accountability, whilst the right were suspicious that public sector institutions were run for the benefit of their employees rather than the public. For both sides, numbers seemed to give the appearance of transparency and objectivity.

Other developments included the rising ideology of consumer choice (especially in healthcare), whereby empowerment of the consumer in a competitive market environment would supposedly help to bring down costs. ‘Principal-Agent Theory’ highlighted that there was a gap between the purposes of institutions and the interests of the people who run them and are employed by them. Shareholders’ interests are not necessarily the same as the interests of corporate executives, and the interests of executives are not necessarily the same as those of their subordinates (and so on). Principals (those with an interest) were needed to monitor agents (those charged with carrying out their interests), which meant motivating them with pecuniary rewards and punishments.

In the 1980s, the ‘New Public Management’ developed. This advocated that not-for-profit organisations needed to function more like businesses, such that students, patients, or clients all became “customers”. Three strategies helped determine value for money:

  1. The development of performance indicators (to replace price).
  2. The use of performance-related rewards and punishments.
  3. The development of competition among providers and the transparency of performance indicators.

Critics of this approach have noted that not-for-profit organisations often have multiple purposes that are difficult to isolate and measure, and that their employees tend to be more motivated by the mission rather than the money. Of course, money does matter, but that recognition should come through the basic salary rather than performance-related rewards.

Indeed, evidence indicates that extrinsic (i.e. external to the person) rewards are most effective in commercial organisations. Where a job attracts people for whom intrinsic rewards (e.g. personal satisfaction, verbal praise) are more important, the application of pay-for-performance can undermine intrinsic motivation. Moreover, the people doing the monitoring tend to adopt measures for those things that are most visible or most easily measured, neglecting many other things that are important but which are less visible or not easily measured. This can lead to a distortion of organisational goals.

Many conservative and classic liberal thinkers have criticised such ideas, including Hayek, who drew a comparison with the failed attempts of socialist governments (notably the Soviet Union) at large-scale economic planning. Nonetheless, from Thatcher to Blair, from Clinton to Bush and Obama, politicians of different hues have continued to expand metrics further into the public domain.

Muller is not entirely a naysayer on metrics, noting that they can sometimes genuinely highlight areas of poor performance. In particular, he notes that in the US there have been some success stories associated with the application of metrics in healthcare. However, closer examination of these cases shows that these successes owe more to their being embedded within particular organisational cultures rather than with measurement per se. Indeed, these successes seem to be the exceptions rather than the rule, with other research showing no lasting effect on outcomes and no change in consumer behaviour. Research by the Rand corporation found that stronger methodological design in studies was associated with a lower likelihood of identifying significant improvements associated with pay-for-performance.

What is clear – and Muller looks at universities, schools, medicine, policing, the military, business, charities and foreign aid – is that metrics have a range of unintended consequences. These included various ways in which managers and employees try to game the system, including: teaching to the test (education), treating to the test (medicine), risk aversion (e.g. in medicine, not operating on the most severely ill patients), and short-termism (e.g. police arresting the easy targets rather than chasing down the crime bosses). There is also outright cheating (e.g. teachers changing the test results of their pupils).

Incidentally, another recent book, The Seven Deadly Sins of Psychology (by Chris Chambers) documents how institutional pressures and the publishing system have incentivized a range of behaviours that have led to ‘bad science’. For instance, ‘Journal Impact Factors’ (JIFs) supposedly provide information about the overall quality of the research that appears in different journals. Researchers can cite this information when applying for tenure, promotion, or for their inclusion in the UK’s Research Excellence Framework (formerly the RAE). However, only a small number of publications in any given journal account for most of the citations that feed into the JIF. Another issue with JIFs concerns statistical power – the likelihood that a study will identify a genuine effect (statistical power depends on sample size and several other factors). It turns out that there is no relationship between the JIF and the average level of statistical power within a journal’s publications. Worse, high impact journals have a higher rate of retractions due to errors or outright fraud.

But one of the impacts of metrics is the expansion of resources (people, time, money, equipment) in order to do the necessary monitoring. Even the people being monitored must give up time and effort in order to produce the necessary documentation to satisfy the system. And as new rules are introduced to crack down on attempts to game the system, so the administrative resources are expanded even further. This diversion of resources obviously works against the productivity gains that are supposed to be produced by the application of metrics.

I was less convinced by the penultimate chapter in Muller’s book, in which he addresses transparency in politics and diplomacy. He speaks scornfully of the actions of Chelsea Manning and Edward Snowden in disclosing secret documents, which he says have had detrimental effects on American intelligence. Undoubtedly, transparency can sometimes be a hazard – compromise between different parties is made harder under the full glare of transparency – and there is a balance to be struck, but I would argue that the scale of wrongdoing revealed by these individuals justifies the actions they took and for which they have both paid a price. In the UK, as I write, there is an ongoing scandal over the related issues of illegal blacklisting of trade union activists in the construction industry and spying on political and campaigning groups (including undercover police officers having sexual relationships with campaigners). A current TV program (A Very English Scandal) concerns the leader of a British political party who – in living memory – arranged the attempted murder of his former lover, and was exonerated following an outrageously biased summing up in court by the judge.  And of course the Chilcot report into the Iraq war found that Prime Minister Blair deliberately exaggerated the threat posed by the Iraq regime, and was damning about the way the final decision was made (of which no formal record was kept).

However, as far as the ordinary workplace is concerned, especially in not-for-profit organisations, the message is clear – beware of metrics!