A big Noise (review)

Noise: A flaw in human judgment

by Daniel Kahneman, Olivier Sibony and Cass R. Sunstein.

Published by William Collins, 2021.

By now, many people are familiar with Nobel prize winner Daniel Kahneman’s previous book Thinking Fast and Slow, in which he popularised the idea that rapid unconscious thought processes underlie many of our judgments and decisions. It is this manner of thought that we equate with intuition. Kahneman showed us how intuitive thinking can give rise to a range of systematic errors, referred to as biases. In this new book, he has teamed up with Olivier Sibony and Cass Sunstein to talk about another source of error in judgment, referred to as noise.

The authors state that error in judgment arises from a combination of bias and noise. Noise in judgment is defined as unwanted variability, and we are told that this is a more pervasive problem than bias. The book describes various studies of noise that researchers have conducted over several decades, including the notable contribution of Marvin Frankel, a US judge who was outraged by the variability of criminal sentencing in the American legal system. The authors contend, however, that the topic of noise has tended to be overshadowed by the topic of bias. Specifically:

[I]n public conversations about human error and in organizations all over the world, noise is rarely recognized. Noise is a bit player, usually offstage. The topic of bias has been discussed in thousands of scientific articles and dozens of popular books, few of which even mention the issue of noise. This book is our attempt to redress the balance (p.6).

Noise can be observed and measured even where a ‘right’ answer may not exist or cannot be verified. There is no objective standard for assessing whether a movie is ‘good’, for example, but because most professional critics give a numerical rating we can see the extent to which they agree or disagree with each other. There may be little consequence to variability in the judgments of film critics, but in many domains we would hope for high levels of consistency. For example, if any of us were to find ourselves the defendant in a court case, we would rightfully expect that the fairness of the outcome should not depend upon which judge happens to be hearing the case that day. Regrettably, the evidence reported by the authors indicates that noise pervades the legal system and many other areas of life. They note that noisy judgments have been observed in medicine, child custody decisions, professional forecasting, asylum decisions, personnel decisions, bail decisions, forensic science, and patent decisions. 

In Chapter 6, the authors describe different types of noise. The example of a legal defendant obtaining a different outcome depending on which judge handles the case is an illustration of system noise. Observations of court-room sentencing have always suggested that judges vary in the way they treat similar cases, a conclusion which is supported by controlled research. A study published in 1981 presented 208 US Federal judges with the same 16 cases and asked them to set a sentence for each. Sure enough, wide variation was observed in sentencing. There is of course no way of knowing what the ‘right’ sentence is, and while it is tempting to suggest that the average sentence for a case represents the ‘right’ sentence, the average may also reflect the existence of bias (e.g. racial discrimination in sentencing).

System noise is itself the product of two other distinct forms of noise. One of these is level noise. In the case of courtroom judges this would represent the tendency of some judges to be more severe than others. The other contribution to system noise comes from pattern noise. This occurs when a judge treats certain types of case more severely than other types (a judge x case interaction). As the authors put it:

One judge, for instance, may be harsher than average in general but relatively more lenient toward white-collar criminals. Another may be inclined to punish lightly but more severely when the offender is a recidivist.

Another type of noise arises when the same individual makes different judgments about the same information when it is encountered at different times. Such within-person variability is referred to as occasion noise. Logically, if a person is operating as part of a system, then occasion noise must also contribute to system noise, but this is rather difficult to tease apart. Occasion noise has been widely studied and arises from numerous factors, such as variation in mood, stress, fatigue, and even changes in the weather. Contextual information can also have an impact: in the US, a judge who has just granted asylum to the previous two applicants is less likely to grant it to the next applicant.

The authors propose a range of remedies for the problem of noisy judgments, which they class under the umbrella heading of decision hygiene. Any organisation concerned about the issue of noise in judgment, they suggest, should conduct a noise audit in order to determine the extent to which they are affected (an appendix provides guidance for how to go about this). The first principle of decision hygiene is that “The goal of judgment is accuracy, not individual expression”. Statistical models have been found to generally outperform human judges on repeated decisions, including models that were created from analyses of human judges. This has been known for a long time, though the advent of machine learning has given even greater scope for the application of such models. The great advantage of statistical models is that they are free from occasion noise, although there is a danger that models based on human judgment will incorporate societal biases (e.g. racial discrimination). There is some discussion about the problem of bias in AI systems, though the authors seem largely unconcerned. This was a real issue for me. I found their rather casual dismissal of bias to be hand-wavy and unconvincing.

However, acknowledging the fact that people often resist challenges to their autonomy, the authors suggest that some situations – such as job interviews – may best benefit from being structured, having options rated, and for those ratings to be the input for a discussion among decision makers rather than as the input to an algorithm. 

A second principle is that judges should “think statistically and take the outside view of the case”. Thinking about how a current situation might be similar to other situations that have been encountered can help root thinking in statistical reality, and avoid excessive optimism.

Thirdly, judges should “structure judgments into several independent tasks”. This has always been a basic principle of decision analysis. People’s limited cognitive capacities are better able to manage a series of smaller problems than one big, complex problem. Kahneman et al describe a specific procedure for organisational decision making, which they call the mediating assessments protocol.

A fourth principle is to “avoid premature intuitions”. In Chapter 20 the authors provide an alarming description of how the forensic analysis of fingerprints can be biased in the US legal system. Whenever a laboratory links the partial fingerprint from a crime scene to the complete fingerprint of a suspect, a second laboratory is asked to carry out the same analysis. Unfortunately, the second laboratory knows that it is only being asked to do the analysis because another laboratory made an identification, hence they are potentially biased at the outset.

The book finishes with a comparison of rules and standards as a means for regulating behaviour. Rules have clear-cut criteria (“Do not exceed the speed limit”), though as noted earlier they can also be biased. Standards, on the other hand, allow for the exercise of discretion (“Drive carefully”). Standards are often adopted because it can be difficult to get people to agree on the precise criteria for rules. However, the more open-ended the language used in a standard is, the more judgment is needed and the more that noise is likely to creep in. The authors give the example of Facebook’s Community Standards, which are meant to determine what is and isn’t acceptable online content. When first introduced, precisely because there were thousands of Facebook reviewers working according to standards, they ended up making decisions that were highly variable.To address this problem, Facebook created a non-public document for its reviewers called the Implementation Standards, which – for example – included graphic images to depict what it meant by “glorifying violence”. In so doing, they basically created a set of rules to underpin their public standards.

There appears to be no clear-cut way to determine whether a rule or a standard should be used, and the authors suggest that at a first approximation any organisation needs to consider the costs of decisions and the costs of errors. Creating a rule can be difficult, but applying a rule in a decision situation is relatively easy. Conversely, creating a standard is easier, but where a person has to make many decisions the need to be continually exercising judgment can be quite a burden. The authors suggest that the costs of errors depend on “whether agents are knowledgeable and reliable, and whether they practice decision hygiene”. Where agents can be trusted in this regard, then a standard might work well, but otherwise a rule may be more appropriate.

That is the book in summary, then. With three co-authors you might wonder how stylistically consistent the book would be, but I found it to be remarkably consistent, with no obvious clue as to who did what. However, there is also quite a bit of repetition and a more rigorous editing process could have cut down the length substantially. Overall, though, I  found the book to be quite engaging, much more so than Thinking Fast and Slow, which I found rather hard-going (I didn’t manage to finish that book, although I was familiar with most of the content anyway). 

There has been some academic sniping over Noise, though I don’t think it’s very interesting for a review to begin reviewing the other reviewers (one highly critical review with links to other critical reviews can be found here). Some of the criticism, in my view, is overstated and there is a sense of people trying to cut down one of the “tall poppies” in the field. Nonetheless, one of the reasons that Kahneman, in particular, has become something of a target is because a number of weaknesses have been identified in his previous book, Thinking Fast and Slow. Kahneman was perhaps unfortunate to have published his best-seller in the same year in which one well-known psychologist was revealed to have fabricated data in many studies, and in which one of the most controversial papers in psychology appeared, a paper which has prompted a great deal of soul-searching within the discipline. It transpires that for a long time, a range of questionable research practices (QRPs) have been used in psychology (and, to be fair, in other disciplines, though not to the same degree). As a result of this introspection, it turns out that Kahneman’s book contains many studies which have either failed to be replicated by other researchers or which are severely “underpowered” (too few participants), meaning that there is a good chance they would not be replicated. The implicit priming studies featured in Chapter 4 of Thinking Fast and Slow are particularly problematic, and a critique can be read here. A broader critique can be found here

Kahneman has not (yet) revised Thinking Fast and Slow to address the problems identified, and the millions of non-psychologist readers are unlikely to be aware that there are any problems. Those who are aware of the problems identified in psychological research will justifiably wonder about the validity of the studies reported in Noise. I have no doubt that noise exists, but to what extent and are the psychological explanations correct? One widely-cited study reported in Noise found that the parole decisions of experienced US judges became increasingly unfavourable the further they got into a session, with about 65% of decisions being favourable at the start of a session and none at the end of a session. Immediately following a break for food, favourable decisions predominated once more before going into a gradual decline again. Whereas most psychological effects are no more than modest in size, this one was substantial. Not reported by Kahneman and colleagues is the fact that this finding has been the subject of some contention. 

One response suggested that the results could be explained by the non-random ordering of cases – prisoners without legal representation tend to have their cases heard later in the session; although the original researchers argued that including representation in their analysis did not change the results. It has also been claimed that the “hungry judge” effect arises from the sensible planning of rational judges: judges tend to end a session when they foresee that the next case is likely to take a long time, and longer cases are more likely to result in favourable outcomes. If correct, then this account would suggest that the case for noise in this instance has been overstated and the supposed reason is false. Finally, the wider concept of “ego-depletion”, upon which the original hungry judge finding rests has itself been called into question.

In conclusion, Noise is somewhat overlong and repetitive, but I think the breakdown of different types of noise is very interesting. There are some potentially useful suggestions for minimising noise, though the authors gloss over concerns about bias in AI-driven decisions. Also, the idea of a noise audit for organisations sounds quite bureaucratic (though potentially a money-spinner for consultants), so presumably ought to be considered only by organisations where noise is a major concern. A healthy skepticism about the psychological research is advised.

[Note: I made a sight edit for clarity to the “hungry judge” section – 12.30pm, 9th August 2021]

Fear is the key: A review of ‘Why Horror Seduces’ (by Mathias Clasen)

Screenshot 2019-03-30 at 4.05.46 PMLooking back, I think the first horror movie I saw at the cinema must have been The Omega Man, the 1971 version of I Am Legend, Richard Matheson’s tale of modern-day vampires. I was only nine at the time, which was probably below the age certification for that film, though I think back then the ticket sellers at my local cinema weren’t always too bothered about checking and enforcing such matters. For a long time, The Omega Man remained my favourite film. I still love the scene where Charleton Heston sits alone in a cinema watching the documentary film of the 1969 Woodstock festival, listening to hippies talking about a world of peace and love, a wonderful juxtaposition with the world we know Heston is now living in – bereft of human beings during the daytime and besieged by malevolent vampires at night.

A little later I saw Jaws (1975). This was still a time when your ticket enabled you to enter at any point during the film and then stay for the next showing (hence the saying “This is where we came in”). Thus, my introduction to the film was seeing Quint disappearing into the mouth of the shark, without any of the dramatic build-up to this point, which is arguably more fear-inducing than the final scenes.

Both I Am Legend (the book) and Jaws (the film) are among the works included in a selective review of American horror fiction discussed in the 2017 book Why Horror Seduces, by Mathias Clasen, Associate Professor of Literature and Media at Aarhus University. Before we get to this review, though, Clasen addresses the wider questions of what horror is, how it works, and how it has been and should be studied. Horror is notoriously hard to define other than in terms of the reactions that a work of fiction elicits from the viewer or reader. Whilst some theorists date the origins of horror to the advent of Gothic fiction in the nineteenth century, Clasen agrees with one of the genre’s most celebrated practitioners, Lovecraft, who wrote that “the horror tale is as old as human thought and speech themselves”.

Survey research shows that most people enjoy horror but, like Goldilocks’ porridge, it needs to be just right – it doesn’t work if it fails to provoke unease or a fear reaction, and likewise most people don’t want horror to be too frightening. But why do we want to be frightened at all by a work of fiction? Enter a plethora of theorists who want to tell us that the stories we love are actually about something other than an entity or situation that is scaring the crap out of us. There are Freudian, feminist, queer, Lacanian, Marxist, race studies, post-colonial and post-structuralist readings of horror fiction. Sometimes, a critical interpretation is simultaneously based on several of these approaches.

Whilst acknowledging that works of fiction may include multiple themes, and also expressing pleasure that the horror genre is taken seriously by these writers, Clasen believes that their various theoretical approaches invariably miss what is at the core of horror fiction. He is especially scathing, albeit in a polite fashion, about the psychoanalytic approaches to horror, which are so divorced from empirical evidence that they enable the shark in Jaws to be interpreted as both “a greatly enlarged, marauding penis” (Peter Biskind) and a “vagina dentata” (Jane Caputi), a giant vagina with teeth.

Critical interpretations are sometimes at odds with the explicitly expressed intentions of writers or directors. Thus, one Lacanian reading of The Shining insists that the novel is really about repressed homoerotic and Oedipal desires, despite author Stephen King’s insistence that the story is based on his own battle with alcohol. The early slasher movies provoked a range of critical reactions. One critic insisted that these films were a means for young people to assuage their guilt about their own hedonistic lifestyles, a claim that had no evidential basis whatsoever. Others claimed that slasher movies were inherently misogynistic, depicting their female victims as being punished for their sexually active lifestyles. Yet, content analysis of slasher films has shown that men are just as likely as women to be victims. In Halloween, the character of Laurie Strode (Jamie Lee Curtis) is supposedly spared death because she adheres to socially conservative norms, overlooking the fact that Michael Myers is actually trying to kill her and thereby causing her to be terrified. Furthermore, writer/director John Carpenter explicitly rejects the moralistic interpretation: Laurie Strode survives because she is the only character to detect and adequately respond to the danger.

Carpenter’s explanation is aligned with Clasen’s own interpretation of how horror works and why we are drawn to it. The human capacity for emotion, he points out, is a product of evolution. Fear and anxiety are the most primal of emotions, as these help shape our responses to immediate and anticipated threats, respectively (more on the topic of evolution and emotions can be found in Randolph Nesse’s new book Good Reasons for Bad Feelings). Potentially threatening stimuli, such as strange noises in the house at night, tend to grab our attention, even if in fact there is no danger. It is better to be anxious about something that turns out to be harmless than to be unconcerned about something truly dangerous. Organisms that do not respond to potential threats are fairly quickly removed from the gene pool. For our hunter-gatherer ancestors, those threats included environmental hazards, non-human predators, other people (in the form of physical danger, loss of status, and potentially lethal social ostracization), and diseases in the form of invisible pathogens, bacteria and viruses (hence certain stimuli, such as excrement and rotting meat, universally lead to feelings of disgust).

Our hunter-gatherer ancestors faced regular challenges to their survival on a scale that most of us will never experience. Even contemporary hunter-gatherers mostly live shorter lives than the rest of us. Horror fiction enables us to experience emotional reactions to potentially threatening stimuli within a safe environment. As Clasen puts it (p.147):

The best works of horror have the capacity to change us for life – to sensitize us to danger, to let us develop crucial coping skills, to enhance our capacity for empathy, to qualify our understanding of evil, to enrich our emotional repertoire, to calibrate our moral sense, and to expand our imaginations into realms of the dark and disturbing.

In his review of several works of American horror fiction, Clasen not only skewers the inadequacy of many previous critical approaches to horror, but spells out precisely the behavioural challenges that are posed to the characters in these works. For example, a staple ingredient of many zombie films is the tension between acting self-interestedly versus cooperating with others to fight the encroaching threat (zombies themselves arouse feelings of disgust associated with contagion). Often goodness and selfishness are embodied in different characters, yet in some works of fiction they may represent a conflict within a single character. One such example is Jack Torrance in The Shining. His failing literary career represents a loss of status, which he hopes to address by focusing on his writing whilst at the Overlook Hotel. When it becomes clear that there is some kind of threat to his son, Danny, feelings of parental concern are aroused. However, the hotel itself – once the home to various gangsters and corrupt politicians – exerts an evil influence on Jack, a recovering alcoholic, poisoning him against his own son.

Elsewhere, in Rosemary’s Baby author Ira Levin “successfully targeted evolved fears of intimate betrayal, contamination of the body, and persecution by metaphysical forces of evil” (p.91), whilst The Blair Witch Project plays upon our tendency to attribute negative value to a place where something bad has happened, a tendency which is adaptive because it makes people avoid dangerous places. As Clasen notes (p.143):

The same psychological phenomenon is at work when people shun houses in which murders or other particularly violent or grisly forms of crime have taken place.

Although Clasen himself says that there is far more we don’t know about how horror fiction works than what we do know, the evolutionary psychology approach would appear to offer a far more promising prospect for our understanding than any other approach that has so far been proposed. It is also an approach which should be far more satisfying to those of us who enjoy horror fiction, because it is in line with our intuitive understanding that we like horror because we simply enjoy the thrill of being scared whilst knowing that we are not really in danger.

Review – Meltdown: Why Our Systems Fail and What We Can Do About It

screenshot 2019-01-26 at 15.14.17In the opening chapter of Meltdown, the authors Chris Clearfield and András Tilcsik describe the series of events that led to a near-disaster at the Three Mile Island nuclear facility in the United States. The initiating event was relatively minor, and occurred during routine maintenance, but as problems began to multiply the operators were confused. They could not see first-hand what was happening and were reliant on readouts from individual instruments, which did not show the whole story and which were open to misinterpretation.

The official investigation sought to blame the plant staff, but sociology professor Charles “Chick” Perrow argued that the incident was a system problem. The author of Normal Accidents: Living With High Risk Technologies, Perrow states that systems can be characterised along two dimensions: complexity and coupling. Complex systems have many interacting parts and frequently the components are invisible to the operators. Tightly-coupled systems are those in which there is little or no redundancy or slack. A perturbation in one component may have multiple knock-on effects. Perrow argues that catastrophic failures tend to be those where there is a combination of high complexity and tight coupling. His analysis forms the explanatory basis for many of the calamities described in Meltdown. Not all of these are life-threatening. Some are merely major corporate embarrassments, such as when Price-Waterhouse Cooper cocked up the award for the Best Picture at the 89th Academy Awards. Others nonetheless had a big impact on ordinary people, such as the problems with the UK Post Office’s Horizon software system, which led to many sub-postmasters being accused of theft, fraud and false accounting. Then there are the truly lethal events, such as the Deepwater Horizon oil rig explosion. Ironically, it is often the safety systems themselves that are the source of trouble. Perrow is quoted as saying that “safety systems are the biggest single source of catastrophic failure in complex tightly-coupled systems”.

The second half of Meltdown is devoted to describing some of the ways in which we can reduce the likelihood of things going wrong. These include Gary Klein’s idea of the premortem. When projects are being planned, people tend to be focused on how things are going to work, which can lead to excessive optimism. Only when things go wrong do the inherent problems start to appear obvious (hindsight bias). Klein suggests that planners envisage a point in time after their project has been implemented, and imagine that it has been a total disaster. Their task is to write down the reasons why it has all gone so wrong. By engaging in such an exercise, planners are forced to think about things that might not otherwise have come to mind, to find ways to address potential problems, and to develop more realistic timelines.

Clearfield and Tilcsik also discuss ways to improve operators’ mental models of the systems they are using, as well as the use of confidential reporting systems for problems and near-misses.

They devote several chapters to the important topic of allowing dissenting voices to speak openly about their concerns. There is ample evidence that lack of diversity in teams, including corporate boards, has a detrimental effect on quality of discussion. Appointing the “best people for the job” may not be such a great idea if the best people are all the same kind of people. One study found that American community banks were more likely to fail during periods of uncertainty when they had higher proportions of banking experts on their boards. It seems that these experts were overreliant on their previous experiences, were overconfident, and – most importantly – were over-respectful of each others’ opinions. Moreover, domination by banking experts made it harder for challenges to be raised by the non-bankers on the boards. However, where there were higher numbers of non-bankers, the bankers more often had to explain issues in more detail and their opinions were more often challenged.

Other research shows that both gender and ethnic diversity are important, too. An experimental study of stock trading, involving simulations, found that ethnically homogeneous groups of traders tended to copy each other, including each others’ mistakes, resulting in poorer performance. Where groups were more diverse, people were just generally more skeptical in their thinking and therefore more accurate overall. Another study found that companies were less likely to have to issue financial restatements (corrections owing to error or fraud) where there was at least one woman director on the board.

Clearfield and Tilcsik argue that the potential for catastrophe is changing as technologies develop. Systems which previously were not both complex and tightly-coupled are increasingly becoming so. This can of course result in great performance benefits, but may also increase the likelihood that any accidents that do occur will be catastrophic ones.

Meltdown has deservedly received a lot of praise since its publication last year. The examples it describes are fascinating, the explanations are clear, and the proposed solutions (although not magic bullets) deserve attention. Writing in the Financial Times, Andrew Hill cited Meltdown when talking about last year’s UK railway timetable chaos, saying that “organisations must give more of a voice to their naysayers”. The World Economic Forum’s Global Risks Report 2019 carries a short piece by Tilcsik and Clearfield, titled Managing in the Age of Meltdowns.

I highly recommend this excellent book.

Review – Lab Rats: Why Modern Work Makes People Miserable

screenshot 2019-01-19 at 12.55.27When scientists develop new antidepressant drugs they first administer them to rats. This means initially inducing depression in the poor rodents. No physical pain is involved. Rather, for a prolonged period the animals experience unpredictable negative changes to their environment, such as wet bedding, dirty sawdust, the sounds of predators, or changes to the cycle of light and dark. Eventually, the rats slide into an apathetic state, ceasing to groom themselves or build nests, and not bothering to use their exercise wheels.

In Lab Rats, a book that is by turns funny and frightening, Dan Lyons likens the plight of modern workers to that of these experimental rats. Constant change, whether it be office layout, new technologies or new methodologies, is producing a workforce that is increasingly stressed, depressed, and sometimes suicidal. Three other factors contribute to decreasing satisfaction with work. The first factor is money; or, specifically, the lack of it. Over the last few decades the incomes of ordinary workers have fallen, whilst those of the wealthiest have boomed. Secondly, workers are increasingly insecure. The third factor is dehumanization, whereby people are increasingly used by technology, rather than vice versa.

According to Lyons, there are two key reasons why modern work has become so much worse. The first is the shift from stakeholder capitalism to shareholder capitalism. For much of the twentieth century, company executives often recognised that they had responsibilities to employees and the wider community, as well as to investors. However, a significant change in attitude occurred when Milton Friedman, a University of Chicago economist (later to be awarded the Nobel prize), promoted the idea that the only responsibility that companies had was towards their shareholders. Aided by anti-trades union legislation, Friedman’s ideas led to a more ruthless form of capitalism, in which jobs were cut or moved abroad, wages slashed, and work frequently outsourced to the lowest bidder. The gig economy developed, in which organisations assembled a body of contract employees, people who were in fact often classed as “self-employed” so that they didn’t have to be awarded the kinds of benefits, such as paid holidays and sick pay, enjoyed by regular employees. The development of the Internet served to speed up these processes.

The second key factor identified by Lyons is the rise of Silicon Valley. Indeed, in the American edition of Lab Rats, Silicon Valley is explicitly identified in the subtitle. Once upon a time, Silicon Valley was full of hippies who grew up in the counter-culture of the 1960s. Companies like Hewlett Packard were a model of how to treat employees well. However, with the advent of shareholder capitalism the hippies were replaced by ruthless oligarchs (e.g. Jeff Bezos, Mark Zuckerberg, Travis Kalanick and Elon Musk), an army of wannabes desperate to get rich quick, and a bunch of venture capitalists who hold out the hope that this can be achieved.

The lack of morality in modern-day Silicon Valley is surely best exemplified by the following example. The rise of the tech oligarchs and their billion-dollar campuses, such as the Googleplex and Apple’s space-ship campus, have pushed up housing prices so far that these tech installations are now fringed by neighbourhoods where people live in camper vans, tents, or simply on the sidewalks. In 2016:

a bunch of rich techies came up with their own solution, sponsoring a ballot proposition that would let police forcibly remove homeless people from the sidewalks. Homeless people would get twenty-four hours to either move to a shelter or get a bus ticket out of town. If they didn’t comply, the cops could seize their tents and belongings. (p.36)

The proposition was passed.

But back to the venture capitalists. In Lyons’s words, the Valley has become a “casino”. The ambition of the modern ‘techie’ is to create a start-up business that attracts sufficient money from venture capitalists such that they are able to get rich by floating the business on the stock market (essentially, getting taken over by other monied interests). Lyons refers to these business start-ups as “unicorns”. Along the way, these businesses typically lose heaps of money, which is why their employees are treated so poorly. But no matter – if all goes to plan, the start-up bosses flog off their outfits, then write a best-selling book about how to run a ‘disruptive’ company. Outside of Silicon Valley, many CEOs – fearful that their organisation is at risk of stagnating in the new economy – lap up this guidance on how to do things a new way. Yet, to quote Lyons:

Silicon Valley has no fountain of youth. Unicorns do not possess any secret management wisdom. Most start-ups are terribly managed, half-assed outfits run by buffoons and bozos and frat boys, and funded by amoral investors who are only hoping to flip the company into the public markets and make a quick buck. They have no operations expertise, no special insight into organizational behavior (p.45)

Why is it that CEOs are so ready to seek out the guidance of business gurus? It seems that the simple truth is that no-one really knows how to run a big company. Lyons writes:

The business world has a seemingly insatiable appetite for management gurus. You probably can’t blame CEOs. It may be that no human is really smart enough to run something as vast and complex as a corporation. Yet someone has to do it. Clinging to a system, any system, at least provides the illusion of structure. The system also gives the boss something to blame when things go wrong. Managers grasp at systems the way that drowning people reach for life jackets.

Indeed, although Lyons doesn’t mention this, even the Harvard Business Review has previously noted that CEOs who run highly successful corporations frequently fail to repeat that success when they move to another company (with correspondingly vast salary). It is as though success occurs despite the CEOs’ presence rather than because of it.

Lyons traces a rough history of business systems, beginning with the work of “shameless fraud” Frederick Taylor (“He fudged his numbers. He cheated and lied”) in the 1890s. Taylor claimed to have devised a scientific method to optimise the efficiency of any process. In reality, he ramped up the quotas until staff began to leave. Taylor was subsequently fired from the company where he did his “research”. Despite his work being thoroughly debunked, Taylorism became almost a religion. Since Taylor, we have had Peter Drucker (who coined the term “knowledge worker”), Michael Porter, Jim Collins, and countless others. The business fads that we’ve been sold include the ‘Five Forces Framework’, ‘Six Sigma’, Lean Manufacturing’, ‘Lean Startup’ and ‘Agile’.

Some space is devoted to description and discussion of ‘Agile’, which is perhaps the most recent fad to be widely adopted. In 2001 a group of software developers authored a ‘Manifesto for Agile Software Development’, an idea that was subsequently pounced on by others and expanded far beyond its original domain of application. Lyons describes the business application of Agile as:

a management fad that has swept the corporate world and morphed into what some call a movement but is more like widespread mental illness (p.55)

Like other fads, Agile is really just another version of Taylorism. All these ideas basically boil down to trying to do more with fewer people for less money. The authors of the original Agile manifesto have sought to distance themselves from what Agile has become, saying they can no longer make sense of it.

Sadly, one particular group of workers may find themselves particularly at risk:

The pressure is extra high on older workers, who are experienced enough to realize that this is bullshit, and that Agile usually fails, but wise enough to realize that the real point of Agile may be to create an excuse to fire older workers, and so the smart thing to do is to shut up and go along. (p.59)

Lyons does hold out some hope for better things, and the final section of his book is called “The No-Shit-Sherlock School of Management”. He points out that Fortune magazine’s list of ‘Legends’ are companies that are incredibly successful and treat their employees exceptionally well. Elsewhere, a number of businesses have sprung up in Silicon Valley that are reacting against the spread of shareholder capitalism by not only treating their staff well, but doing good for the community. One venture capital firm, Kapor Capital, engages only in “gap-closing” investing, putting money into companies that are “serving low-income communities and/or communities of color to close gaps of access, opportunity, or outcome” (p.201).

It also seems that increasing numbers of students are drawn towards business courses that have more of a social emphasis. Elsewhere, workers have rediscovered the value of becoming organised. For example, Google employees succeeded in getting the company to abandon its involvement in a military drone programme, and a number of gig economy workers have successfully organised to challenge their working conditions and contracts.

Lab Rats is an eminently readable book that will both amuse and horrify. To be sure, Dan Lyons’s emphasis is on reforming capitalism, which may seem a little optimistic for some on the left. Indeed, I felt possibly he might have been viewing the pre-Friedman economic era through slightly rose-tinted spectacles. Also, whilst he holds up Starbucks as a company that treats their employees very well, he omits to mention that they have also been criticised for using legal mechanisms to minimise the tax they pay in the countries where they operate (Lyons does criticise Apple for the same thing). However, CEOs who are looking for a business reason to treat their employees well might note the work of the psychologist Dan Ariely, who found that companies where people felt physically and emotionally safe tended to outperform the stock market.

Double book review: Margaret Boden and Gary Smith on Artificial Intelligence

AI – Its nature and future, by Margaret A. Boden. Oxford University Press. 2016.

The AI Delusion, by Gary Smith. Oxford University Press. 2018.

AI, machine learning, algorithms, robots, automation, chatbots, sexbots, androids – in recent years all these terms have regularly been appearing in the media, either to tell us about the latest achievements in technology, about exciting future possibilities, or in the context of warnings about threats to our jobs and freedoms.

Two recent books, from Margaret Boden and Gary Smith, respectively, are useful guides to the perplexed in explaining the issues. Each is clearly written and highly readable. Margaret Boden, Research Professor of Cognitive Science at the University of Sussex, begins with a basic definition:

Artificial intelligence (AI) seeks to make computers do the sorts of things that minds can do.

People who work in AI tend to work in one of two different camps (though occasionally both). They either take a technological approach, whereby they attempt to create systems that can perform certain tasks, regardless of how they do it; or they take a scientific approach, whereby they are interested in answering questions about human beings or other living things.

Screenshot 2018-09-02 at 17.27.36

Boden’s book is essentially a potted history of the field, guiding the reader through the different approaches and philosophical arguments. Alan Turing, of Bletchley Park fame, seems to have envisaged all the current developments in the field, though during his lifetime the technology wasn’t available to implement these ideas. The first approach to hit the big time is now known as ‘Good Old-Fashioned AI (GOFAI)’. This assumes that intelligence arises from physical entities that can process symbols in the right kind of way, whether these entities are living organisms, arrangements of tin cans, silicon chips or whatever else. The other approaches are not reliant on sequential symbol processing. These are: 1. Artificial Neural Networks (ANNs), or connectionism, 2. Evolutionary programming, 3. Cellular automata (CA), and 4. Dynamical systems. Some researchers argue in favour of hybrid systems that combine elements of symbolic and non-symbolic processing.

For much of the 1950s, researchers of different theoretical persuasions all attended the same conferences and exchanged ideas, but in the late ’50s and 1960s a schism developed. In 1956 John McCarthy coined the term ‘Artificial Intelligence’ to refer to the symbol processing approach. This was seized upon by journalists, particularly as this approach began to have successes with the Logic Theory Machine (Newell & Simon) and General Problem Solver (Newell, Shaw, and Simon). By contrast, Frank Rosenblatt’s connectionist Perceptron model was found to have serious limitations and was contemptuously dismissed by many symbolists. Professional jealousies were aroused and communication between the symbolists and the others broke down. Worse, funding for the connectionist approach largely dried up.

Work within the symbol processing, or ‘classical’, approach has taught us some important lessons. These include the need to make problems tractable by directing attention to only part of the ‘search space’, by making simplifying assumptions and by ordering the search efficiently. However, the symbolic approaches also faced the issue of ‘combinatorial explosion’, meaning that logical processes would draw conclusions that were true but irrelevant. Likewise, in classical – or ‘monotonic’ – logic, once something is proved to be true it stays true, but in everyday life that is often not the case. Boden writes:

AI has taught us that human minds are hugely richer, and more subtle,  than psychologists previously imagined. Indeed, that is the main lesson to be learned from AI.

Throughout the lean years for connectionist AI a number of researchers had plugged away regardless, and in the late 1980s there was a sudden explosion of research under the name of ‘Parallel Distributed Processing’ (PDP). These models consist of many interconnected units, each one capable of computing only one thing. There are multiple layers of units, including an input layer, an output layer, and a ‘hidden layer’ or layers in between. Some connections feed forward, others backwards, and others connect laterally. Concepts are represented within the state of the entire network rather than within individual units.

PDP models have had a number of successes, including their ability to deal with messy input. Perhaps the most notable finding occured when a network produced over-generalisation of past tense learning (e.g. saying ‘go-ed’ rather than ‘went’), indicating – contrary to Chomsky – that this aspect of language learning may not be an inborn linguistic rule. Consequently, the research funding tap was turned back on, especially from the US Department of Defense. Nonetheless, PDP models have their own weaknesses too, such as not being able to represent precision as well as classical models:

Q: What’s 2 + 2?

A: Very probably 4.

Learning within ANN’s usually involves changing the strength (the ‘weights’) of the links between units, as expressed in the saying “fire together, wire together”. It involves the application of ‘backprop’ (backwards propagation) algorithms which trace responsibility for performance back from the output layer into the hidden layers, identifying the units that need to be adopted, and thence to the input layer. The algorithm needs to know the precise state of the output layer when the network is giving the correct answer.

Although PDP propaganda plays up the similarity between network models and the brain’s neuronal connections, in fact there is no backwards propagation in the brain. Synapses feed forwards only. Also, brains aren’t strict hierarchies. Boden also notes (p.91):

a single neuron is as computationally complex as an entire PDP system, or even a small computer.

Subsequent to the 1980s PDP work it has been discovered that connections aren’t everything:

Biological circuits can sometimes alter their computational function (not merely make it more or less probable), due to chemicals diffusing through the brain.

One example of this is Nitrous Oxide. Researchers have now developed new types of ANNs, including GasNets, used to evolve “brains for autonomous robots.

Boden also discusses other approaches within the umbrella of AI, including robots and artificial life (‘A-life’), and evolutionary AI. These take in concepts such as distributed cognition (minds are not within individual heads), swarm intelligence (simple rules can lead to complex behaviours), and genetic algorithms (programs are allowed to change themselves, using random variation and non-random selection).

But are any of these systems intelligent? Many AI models have been very successful within specific domains and have outperformed human experts. However, the essence of human intelligence – even though the word itself does not have a standard definition among psychologists – is that it involves the ability to perform in many different domains, including perception, language, memory, creativity, decision making, social behaviour, morality, and so on. Emotions appear to be an important part of human thought and behaviour, too. Boden notes that there have been advances in the modelling of emotion, and there are programs that have demonstrated a certain degree of creativity. There are also some programs that operate in more than one domain, but are still nowhere near matching human abilities. However, unlike some people who have warned about the ‘singularity’ – the moment when machine intelligence exceeds that of humans – Boden does not envisage this happening. Indeed, whilst she holds the view that, in principle, truly intelligent behaviour could arise in non-biological systems, in practice this might not be the case.

Likewise, the title of Gary Smith’s book is not intended to decry all research within the field of AI. He also agrees that many achievements have occurred and will continue to do so. However, the ‘delusion’ of the title occurs when people assign to computers an ability that they do not in fact possess. Excessive trust can be dangerous. For Smith:

True intelligence is the ability to recognize and assess the essence of a situation.

This is precisely what he argues AI systems cannot do. He gives the example of a drawing of a box cart. Computer systems can’t identify this object, he says, whereas almost any human being could not only identify it, but suggest who might use it, what it might be used for, what the name on the side means, and so on.

Screenshot 2018-09-02 at 17.28.21

Smith refers to the Winograd Schema Challenge. The Stanford Computer Science Professor, Terry Winograd, has put up a $25,000 prize for anyone who can design a system that is at least 90% accurate in interpreting sentences like this one:

I can’t cut that tree down with that axe; it is too [thick/small]

Most people realise that if the bracketed work is ‘thick’ it refers to the tree, whereas if it is ‘small’ it refers to the axe. Computers are typically – ahem – stumped by this kind of sentence, because they lack the real-world experience to put words in context.

Much of Smith’s concern is about the data-driven (rather than theory-driven) way that machine learning approaches use statistics. In essence, when a machine learning program processes data it does not stop to ask ‘Where did the data come from?’ or ‘Why these data?’ These are important questions to ask and Smith takes us through various problems that can arise with data (his previous book was called Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics).

One important limitation associated with data is the ‘survivor bias’. A study of Allied warplanes returning to Britain after bombing runs over Germany found that most of the bullet and shrapnel holes were on the wings and rear of the plane, but very few on the cockpit, engines, or fuel tanks. The top brass therefore planned to attach protective metal plates to the wings and rear of their aircraft. However, the statistician Abraham Wald pointed out that the planes that returned were, by definition, the ones that had survived the bullets and shrapnel. The planes that had not returned had most likely been struck in the areas that the returning planes had been spared. These were the areas that should be reinforced.

Another problem is the one discussed in my previous blog, that of fake or bad data, arising from the perverse incentives of academia and the publishing world. The ‘publish-or-perish’ climate, together with the wish of journals to publish ‘novel’ or ‘exciting’ results, has led to an exacerbation of ‘Questionable Research Practices’ or outright fakery, with the consequence that an unfortunately high proportion of published papers contain false findings.

Smith is particularly scathing about the practice of data mining, something that for decades was regarded as a major ‘no-no’ in academia. This is particularly problematic in the advent of big data, when machine learning algorithms can scour thousands upon thousands of variables looking for patterns and relationships. However, even among sequences that are randomly generated, correlations between variables will occur. Smith shows this to be the case with randomly generated sequences of his own. He laments that

The harsh truth is that data-mining algorithms are created by mathematicians who often are more interested in mathematical theory than practical reality.

and

The fundamental problem with data mining is that it is very good at finding models that fit the data, but totally useless in gauging whether the models are ludicrous.

When it comes to the choice of linear or non-linear models, Smith says that expert opinion is necessary to decide which is more realistic (though one recent systematic comparison of methods, involving a training set of data and a validation set, found that the non-linear methods associated with machine learning were dominated by the traditional linear methods). Other problems arise with particular forms of regression analysis, such as stepwise regression and ridge regression. Data reduction methods, such as factor analysis or principal components analysis, can also cause problems because the transformed data are hard to interpret. Especially if mined from thousands of variables they will contain nonsense. Smith looks at some dismal attempts to beat the stock market using data mining techniques.

But as if the statistical absurdities weren’t bad enough, Smith’s penultimate chapter – the one that everything else has been leading up to, he says – concerns the application of these techniques to our personal affairs in ways which impinge upon our privacy. For example, software exists that examines the online behaviour of job applicants. Executives who ought to know better may draw inappropriate causal inferences from the data. One of the major examples discussed earlier in the book is Hillary Clinton’s presidential campaign. Although not widely known, her campaign made use of a powerful computer program called Ada (after Ada Lovelace, an early pioneer in AI). This crunched masses of data about potential voters across the country, running 400,000 simulations per day. No-one knows exactly how Ada worked, but it was used to guide decisions about where to target campaigning resources. The opinions of seasoned campaigners were entirely sidelined, including perhaps the greatest campaigner of all – Bill Clinton (reportedly furious about this, too). We all know what happened next.

 

 

Review: The 7 Deadly Sins of Psychology

Screenshot 2018-08-18 at 19.05.15

I remember being taught as an undergraduate psychology student that replication, along with the principle of falsification, was a vital ingredient in the scientific method. But when I flipped through the pages of the journals (back in those pre-digital days), the question that frequently popped into my head was ‘Where are all these replications?’ It was a question I never dared actually ask in class, because I was sure I must simply have been missing something obvious. Now, about 30 years later, it turns out I was right to wonder.

In Chris Chambers’ magisterial new book The 7 Deadly Sins of Psychology, he reports that it wasn’t until 2012 that the first systematic study was conducted into the rate of replication within the field of psychology. Makel, Plucker and Hegarty searched for the term “replicat*” among the 321,411 articles published in the top 100 psychology journals between 1900 and 2012. Just 1.57 per cent of the articles contained this term, and among a randomly selected subsample of 500 papers from that 1.57 per cent,

only 342 reported some form of replication – and of these, just 62 articles reported a direct replication of a previous experiment. On top of that, only 47 per cent of replications within the subsample were produced by independent researchers (p.50).

Why does this matter? It seems that researchers, over a long period, have engaged in a variety of ‘Questionable Research Practices’ (QRPs), motivated by ambitions that are often shaped by the perverse incentives of the publishing industry.

A turning point occurred in 2011 when the Journal of Personality and Social Psychology published Daryl Bem’s now-notorious paper ‘Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect’. Taking a classic paradigm in which an emotional manipulation influences the speed of people’s responses on a subsequent task, Bem conducted a series of studies in which the experimental manipulation happened after participants made their responses. His results seemed to indicate that people’s responses were being influenced by a manipulation that hadn’t yet happened. There was general disbelief among the scientific community and Bem himself said that it was important for other researchers to attempt to replicate his findings. However, when the first – failed – replication was submitted to the same journal, they rejected it on the basis that their policy was to not publish replication studies, whether or not they were successful.

In fact, many top journals – e.g. Nature, Cortex, Brain, Psychological Science – explicitly state, in various ways, that they only publish findings that are novel. A December 2015 study in the British Medical Journal, that perhaps appeared too late for inclusion in Chambers’ book, found that over a forty year period scientific abstracts had shown a steep increase in the use of words relating to novelty or importance (e.g. “novel”, “robust”, “innovative” and “unprecedented”). Clearly, then, researchers know what matters when it comes to getting published.

A minimum requirement, though not the only one, for a result being interesting is that it is statistically significant. In the logic of null hypothesis significance testing (NHST), this means that if chance were the only factor in producing a result, then the probability of getting that result would be less than 5 per cent (or less than 1/20). Thus, researchers hope that any of their key tests will lead to a p-value of less than .05, as – agreed by convention – this allows them to reject the null hypothesis in favour of their experimental hypothesis (the explanation that they are actually proposing, and which they may be invested in).

It is fairly easy to see how the academic journals could be – and almost certainly are – overpopulated with papers that claim evidential support for hypotheses that are false. For instance, suppose many different researchers test a hypothesis that is, unknown to them, incorrect. Perhaps just one researcher finds a significant result, which is a fluke result arising by chance. That one person is likely to get published, whereas the others will not. In reality, many researchers will not bother to submit their null findings. But here lies another problem. A single researcher may conduct several studies of the same hypothesis, but only attempt to the publish the one (or ones) that turn out significant. He or she may feel a little guilty about this, but hey! – they have careers to progress and this is the system that the publishers have forced upon them.

Replication is supposed to help discover which hypotheses are false and which are likely to be true. As we have seen, though, failed replications may never see the light of day. More problematic is the use of ‘conceptual replications’, in which a researcher tries to replicate a previous finding using a methodology that is, to a greater or lesser degree, novel. The researcher can claim to be “extending” the earlier research by testing the generality of its findings. Indeed, having this element of originality may increase the chances of publication. However, as Chambers notes, there are three problems with conceptual replications.

Firstly, how similar must the methods be in order that the new study can count as a replication, and who decides this? Second, there is a risk of certain findings becoming unreplicated. If a successful conceptual replication later on turns out to have produced its result through an entirely different causal mechanism, then the original study has just been unreplicated. Thirdly, attempts at conceptual replication can fuel confirmation bias. If the attempt at a conceptual replication produces a different result to the initial study, the authors of the first study will inevitably claim that their own results were not reproduced precisely because the attempted replication didn’t follow exactly the same methodology.

Chambers sums up the replication situation as follows:

To fit with the demands of journals, psychologists have thus replaced direct replication with conceptual replication, maintaining the comfortable but futile delusion that our science values replication while still satisfying the demands of novelty and originality (p.20).

Because psychologists frequently run studies with more than one independent variable, they typically use statistical tests that provide various main effects and interactions. Unfortunately, this can tempt researchers to operate with a degree of flexibility that isn’t warranted by the original hypothesis. They may engage in HARK-ing – Hypothesizing After the Results are Known. Suppose a researcher predicts a couple of main effects, but that these turn out to be non-significant once the analysis has been performed. Nonetheless, there are some unpredicted significant interactions within the results. The researcher now goes through a process of trying to rationalise why the results turned out this way. Having come up with an explanation, he or she now rewrites the hypotheses as though these results were what had been expected all along. Recent surveys show that psychologists believe the prevalence of HARKing to be somewhere between 40%-90%, though the prevalence of those who admit to doing it themselves is, of course, much lower.

Another form of QRP is p-hacking. This refers to a cluster of practices whereby a researcher can illegitimately transform a non-significant result into a significant one. Suppose an experimental result has a p-value of .08, quite close to the magical threshold of .05 but also likely to be a barrier to publication. At this point, the researcher may try recruiting some new participants to the study in the hope that this will boost the p-value to below .05. However, bearing in mind that there will always be some variation in the way that participants respond, regardless of whether or not a hypothesis is true, then “peeking” at the results and recruiting new participants until such point that p falls below .05 is simply inflating the likelihood of obtaining a false positive result.

A second form of p-hacking is to analyse the data in different ways until you get the result you want. There is no single agreed method for the exclusion of ‘outliers’ in the data, so a researcher may run several analyses in which differing numbers of outliers are excluded, until a significant result is returned. Alternatively, there may be different forms of statistical test that can be applied. All tests are essentially estimates, and while equivalent-but-different tests will produce broadly similar results, the difference of a couple of decimal places or so may be all that is needed to transform a non-significant result into a significant one.

A third form of p-hacking is to change your dependent variables. For example, if three different measures of an effect are all just slightly non-significant, then a researcher might try integrating these into one measure to see if these brings the p-value below .05.

Several recent studies have examined the distributions of p-values in similar kinds of studies and have found that there is often a spike in p-values just below .05, which would appear to be indicative of p-hacking. The conclusion that follows from this is that many of the results in the psychological literature are likely to be false.

Chris Chambers also examines a number of other ways in which the scientific literature can be distorted by incorrect hypotheses. One such way is the hoarding of data. Many journals do not require, or even ask, that authors deposit their data with them. Authors themselves often refuse to provide data when a request is received, or will only provide it under certain restrictive conditions (almost certainly not legally enforceable). Yet one recent study found that statistical errors were more frequent in papers where the authors had failed to provide their data. Refusal to share may, of course, be one way of hiding misconduct. Chambers argues that data sharing should be the norm, not least because even the most scrupulous and honest authors may, over time, lose their own data, whether because of the updating of computer equipment or in the process of changing institutions. And, of course, everyone dies sooner or later. So why not ensure that all research data is held in accessible repositories?

Chapter 7 – The Sin of Bean Counting – covers some ground that I discussed in an earlier blog, when I reviewed Jerry Muller’s book The Tyranny of Metrics.  Academic journals now have a ‘Journal Impact Factor’ (JIF), which uses the citation counts of their papers to index the overall quality of the work published in the journals. Yet, a journal’s JIF is accounted for only by a very small proportion of the papers they carry. Most papers only have a small number of citations. Worse, the supposedly high impact journals are in fact the ones with the highest rates of retractions owing to fraud or suspected fraud. Chambers argues that it would be more accurate to refer to them as “high retraction” journals rather than “high impact” journals. The JIF is also easily massaged by editors and publishers, and, rather than being objectively calculated, is a matter of negotiation between the journals and the company that determines the JIF (Thomson Reuters).

Yet:

Despite all the evidence that JIF is more-or-less worthless, the psychological community has become ensnared in a groupthink that lends it value.

It is used within academic institutions to help determine hiring and promotions, and even redundancies. Many would argue that JIF and other metrics have damaged the collegial atmosphere that one associates with universities, which in many instances have become arenas of overwork, stress and bullying.

Indeed, recent years have seen a number of instances of fraudulent behaviour by psychologists, most notably Diederik Stapel, who invented data for over 50 publications before eventually being exposed by a group of junior researchers and one of his own PhD students. By his own account, he began by engaging in “softer” acts of misrepresentation before graduating to more serious behaviours. Sadly, his PhD students, who had unwittingly incorporated his fraudulent results into their own PhDs (which they were allowed to retain) had their peer-reviewed papers withdrawn from the journals in which they had been published. Equally sad is ‘Kate’s Story’ (also recounted in Chapter 5) which describes the unjust treatment meted out to a young scientist after she was caught up in a fraud investigation against the Principal Investigator of the project she was working on, even though she was not the one who had reported him. Kate is reported as saying that if you suspect someone of making up data, but lack definitive proof, then do not expect any sympathy or support for speaking out.

Fortunately, Chris Chambers has given considerable thought as to how psychology’s replication crisis might be addressed. Indeed, he and a number of other psychologists have been instrumental in effecting some positive changes in academic publishing. His view is that it would be hopeless to try to address the biases (many likely unconscious) that researchers possess. Rather, it is the entire framework of the scientific and publishing enterprise which must be changed. His suggestions include:

  • The pre-registration of studies. Researchers submit their research idea to a journal in advance of carrying out the work. This includes details of hypotheses to be tested, the methodology and the statistical analyses that will be used. If the peer reviewers are happy with the idea, then the journal commits to publication of the findings – however they turn out – if the researchers have indeed carried out the work in a satisfactory manner.
  • The use of p-curve analyses to determine which fields in psychology are suffering from p-hacking.
  • The use of disclosure statements. Joe Simmons and colleagues have pioneered a 21-word statement:

We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.

  • Data sharing.
  • Solutions to allow “optional stopping” during data collection. One method is to reduce the alpha-level every time a researcher “peeks” at the data. A second method is to use Bayesian hypothesis testing instead of NHST. Whereas NHST only actually tests the null hypothesis (and doesn’t provide an estimate of the likelihood of the null hypothesis), the Bayesian approach allows researchers to directly estimate the probability of the null hypothesis relative to the experimental hypothesis.
  • Standardization of research practices. This may not always be possible, but where researchers conduct more than one type of analysis then the details of each should be reported and the robustness of the outcomes summarised.

Chambers devotes most space to the discussion of pre-registration. Many objections have been raised against this idea, and Chambers tackles these objections (convincingly, I think) in his Chapter 8: Redemption.

Although the issue of replication and QRPs is not unique to psychology, evidence indicates that it may be a bigger problem than for other disciplines. Therefore, if psychologists wish to be taken seriously then it is incumbent upon them to clean up their act. Fortunately, a number of psychologists – Chambers included – have been at the forefront of both uncovering poor practice and proposing ways to improve matters. A good starting point for anyone wanting to appreciate the scale of the problem and how to deal with it would be to read this book. Indeed, I think every university in the library should have at least one copy of this book on its shelves, and it should be on the reading list for classes in research methods and statistics. Despite being a book on methodology, I didn’t find it a dry read. On the contrary, it is something of a detective story – like Sherlock Holmes explaining how he worked out whodunnit – and, as such, I found it rather gripping.

 

 

 

Review: ‘The Mind is Flat’ by Nick Chater

The nature of consciousness is a topic that psychologists and philosphers have spilt much ink and many pixels over. Outside of psychoanalytic circles, what has been less discussed is the nature of the ‘unconscious mind’. Claims made by some psychologists about the power of the unconscious mind to influence behaviour have proven controversial.

Now, in a book that will have psychoanalysts and many others protesting loudly, cognitive scientist Nick Chater has plunged a stake through the very concept of an unconscious mind. In The Mind Is Flat Chater argues that our minds have no depths, let alone hidden ones. His primary claim is that the brain exists to make sense of the world by creating a stable perception of it and ourselves; but the brain does not provide us with an account of its own workings. These perceptions are created from our interpretations of a limited number of sensory inputs, with the assistance of various memory traces (themselves based on our interpretations of past events).

Chater’s opening chapter, The Power of Invention, describes how we can create an apparently rich internal picture of a fictional person or location based on a limited description that may have gaps or inconsistencies (Chater discusses Anna Karenina and Gormenghast). So it is with our perceptions of the actual world and, indeed, ourselves.  Most of our visual receptors are incapable of colour detection, yet we perceive the world in glorious colour. Our eyes are continually darting about all over the place, yet our perception of the world is smooth, not jerky. In short, much or most of what we perceive is an illusion foisted upon us by our brains.

Screenshot 2018-08-05 at 16.44.32

For centuries, philosophers consulted their ‘inner oracle’ in order to determine how the world works. Yet, Chater points out, the inner oracle has consistently misled us about concepts such as heat, weight, force and energy. Early researchers in artificial intelligence (AI) tried to do the same thing. They tried to excavate the mental depths of experts, recover ‘common sense theory’ and then devise methods to reason over this database. However, by the 1980s it had become clear that this program was going nowhere, and so was quietly abandoned.

As Chater puts it:

The mind is flat: our mental ‘surface’, the momentary thoughts, explanations and sensory experiences that make up our stream of consciousness is all there is to mental life. (p.31)

One reason why we are unaware of the fictional nature of our perceptions is precisely because our eyes are constantly moving about and picking up new sensory fragments. I may be unaware of the type of flower on the mantelpiece, but if you mention it my eyes go there automatically. In gaze-contingent eye tracking studies, the text on a screen changes according to where a person is looking. In fact, most of the text on the screen consists of Xs. As a participant’s eyes move across the screen the Xs that would have been in their fixation point change to become real words, and the area where they had been looking reverts to Xs. The participant, however, perceives that the entire page consists of meaningful text.

Likewise, when we construct a mental image it is never truly a ‘picture in the mind’. If we are asked to describe some details from the image, we simply ‘create’ those in our imagination in response to the question. Nothing is being retrieved from a complete image.

We often talk about a battle between ‘the heart and the head’, but Chater argues that we are in fact simply posing one reason against another reason. Citing the Kuleshov Effect, and the work of Schacter & Singer (1962) and Aron & Dutton (1974) on the labeling of emotional states, Chater concludes that “our feelings do not burst unbidden from within – they do not pre-exist at all” (p.98). Indeed:

The meaning of pretty much anything comes from its place in a wider network of relationships, causes and effects – not from within. (p.107)

Despite, or perhaps because of, our lack of inner depth, we are extremely good at dreaming up explanations for all kinds of things, including our inner motives. Perhaps my favourite example is from the work on choice blindness, in which participants were asked to choose the most attractive of two faces, each of which was presented on a card. After a participant made their choice, the researcher supposedly passed them the card they had chosen and asked them to explain why they had preferred that face. In fact, the researcher used sleight-of-hand to pass them the face they hadn’t chosen. Most people didn’t spot the discrepancy and readily provided an explanation as to why they preferred the face that they had not in fact chosen.

This research links to a wider body of work in decision making research, which shows that people’s preferences are constructed during the process of choice, depending on various contextual factors, as opposed to the conventional economic account that assumes people to have stable preferences that are revealed by the choices they make.

Chater also goes on to talk about people’s attentional limitations, arguing that – in almost all circumstances – our brains are only able to work on one problem at a time (where a problem is something which requires an act of interpretation on our part, rather than an habitual action such as putting one foot in front of the other when walking). This also fits with decades of work on human judgment, which has repeatedly found that people are unable to reliably integrate multiple items of information when trying to make a judgment.

Finally, Chater isn’t arguing that there are no unconscious processes. However, these unconscious processes aren’t ‘thoughts’. The mind isn’t like an iceberg, with a few thoughts appearing in consciousness and many others below the level of consciousness. Rather, the real nature of the unconscious is “the vastly complex patterns of nervous activity that create and support our slow, conscious experience” (p.175). Thus:

There is just one type of thought, and each thought has two aspects: a conscious read-out, and unconscious processes operating the read-out. And we can have no more conscious access to these brain processes than we can have conscious awareness of the chemistry of digestion or the biophysics of our muscles.

 The Mind is Flat is a book that I wish I’d written, in that it expresses, with evidence, a viewpoint that I have held for some time. The writing is clear and entertaining, and I devoured the book in just a few days. Recommended.

 

Book review: Behind the Shock Machine (author: Gina Perry)

One evening in 1974, at a home in New Haven, the family of the late Jim McDonough gathered around their television to watch The Phil Donahue Show. To their horror, a piece of 1960s black and white footage was being shown in which Jim was having electrodes attached to his body. Jim was apparently the learner in an experiment whereby he would receive increasingly strong electric shocks whenever he failed to deliver a correct response to a question.

Bearing in mind that Jim had died of a heart attack in the mid-60s, his late wife Kathryn must have been concerned that there might be a connection with this extraordinary piece of research. She wrote to the show’s producer, asking to be put in touch with the man who’d run the experiment, Dr Stanley Milgram. Shortly afterwards, she received a phone call from Milgram, who provided reassurance that her late husband had not in reality received any electric shocks at all. He also sent her an inscribed copy of the book that had caused the media interest: Obedience to Authority.

The Milgram shock experiments are the subject of an enthralling book by psychologist Gina Perry, published in 2012: Behind the Shock Machine: The Untold Story of the Notorious Milgram Psychology Experiments. By sifting through Milgram’s archive material, as well as interviewing some of his experimental subjects  and assistants (or their surviving relatives), Perry shows that the popular account of the shock experiments, as promoted by Milgram himself, is but a pale and dubious version of what really happened and what the research means.

The popular account goes as follows. Milgram wanted to know whether the behaviour of the Nazis during the Holocaust was due to something specific about German culture, or whether it reflected a deeper aspect of humanity. In other words, could the same thing happen anywhere? In order to investigate this question, Milgram created an experimental scenario in which people would be pressured to commit a potentially lethal act. His subjects were recruited through newspaper advertisements in which they were promised payment for taking part in a study of learning and memory. As they arrived at Milgram’s laboratory in Yale University, a second subject (actually a paid staff member) would also appear. The experimenter (also a paid confederate of Milgram’s) explained that they were to take part in a study of the effects of punishment on learning. One of them would be the teacher and the other the learner. The two men drew a piece of paper to determine which would be which, but this was of course rigged: the subject was always the teacher. The teacher was told that any shocks received by the learner would be painful but not dangerous. He would then receive a small shock himself as an illustration of what he would potentially be delivering to the learner. During the experiment, the teacher and learner would be in separate rooms, unseen to each other but connected by audio.

At the beginning of the experiment, the teacher would read out a list of word pairs to the learner. After this, he would read out each target word followed by four words, only one of which was paired with the target. The learner would supposedly press a button corresponding to the word he thought was correct. If the learner picked the wrong word, then the teacher had to flick a switch on a machine in order to deliver an electric shock to the learner. The level of shock increased with each word, varying from 15 volts to 450 volts. The two highest settings on the shock machine were labelled ‘XXX – dangerous, severe shock’. The experimenter was always present to oversee the teacher and, if the teacher began to show concern or balk at giving further shocks, would deliver an increasingly stern series of commands (according to a script) requiring the teacher to carry on.

In the first version of the experiment the teacher did not hear from the learner, but in other experiments the learner would begin to call out in increasing levels of distress once the 150V level was reached. There were additional variations, too, such as having the learner and teacher in the same room, having the teacher place the learner’s hand on the shock plate, changing the actors, changing the location to a downtown building, having the learner mention heart trouble, and using female subjects. The experiments began in August 1961 and concluded in May 1962. During the last three days of the experiments, Milgram shot the documentary footage that would form the basis of his film Obedience.

Obedient subjects were defined as those who delivered the highest possible supposed shock of 450V. In most scenarios about 65% of subjects were classed as obedient, though some of the variations (such as teacher and learner in the same room) did lead to lower levels of obedience. By the time Milgram came to write up his research, the Nazi Adolf Eichmann had been tried and hanged in Israel and Hannah Arendt had coined the phrase “the banality of evil”. The observation that dull administrative processes could lie behind the most atrocious war crimes was an ideal peg on which Milgram could hang his research. In an era when the Korean war had given rise to concerns about brainwashing, the concept of ‘American Eichmanns’ took hold.

Milgram’s first account of his work was published in October 1963 in the Journal of Abnormal Psychology, but his famous book – still in print – did not appear until 1974. The original publication of Milgram’s work, and the later publication of his book, met with a mixed response from academics. Critics raised ethical concerns about the treatment of his subjects, pointed to the lack of any underlying theory, and wondered whether it all really meant anything. Wasn’t Milgram just showing what we all knew already, that people can be pushed to commit extreme acts? In response, Milgram pointed to a survey of psychiatrists in which most of them believed that his subjects would not be willing to cause extreme harm to the learners. He also cited follow-up interviews with subjects by a psychiatrist, Dr Paul Errera, which concluded that they had not been harmed and that most had endorsed Milgram’s research.

In his 1974 book, Milgram provided the theory to explain the behaviour of his obedient subjects. This was the notion of the ‘agentic shift’, according to which the presence of an authority figure leads people to view themselves as the agents of another person and therefore not responsible for their own actions. I can recall reading Obedience to Authority as a student in the late ’80s and being confused. To me, the agentic shift theory didn’t seem to be explaining anything. It simply begged the question of why people might give up their sense of responsibility in the presence of an authority figure. Gina Perry points out that the theory also fails to explain the substantial proportion of people who didn’t obey, not to mention the discomfort, questions and objections of those people who nonetheless ended up delivering the maximum supposed shock (these objections figured in Milgram’s earlier publications but less so in his book). In suggesting that ordinary Americans could behave like Nazis, Milgram was also ignoring the entire counterculture movement and especially widespread protest and civil disobedience in relation to America’s involvement in the Vietnam war.

But Perry goes deeper than merely questioning Milgram’s theory, which many other academics have also done. Her research into the archives resulted in the realisation that, over time, Milgram’s paid actors began to depart from their script. The experimenter was provided with a series of four increasingly strict commands that he was expected to give when faced with a subject who was reluctant to continue. If the subject still refused to continue, then the experimenter was expected to call a halt. But John McDonough, Milgram’s usual paid experimenter, began to extemporise some of his commands and to cycle back through the list of four. In other words, some subjects were classed as obedient when in fact they should have been classed as disobedient.

It also turns out that many or most of Milgram’s subjects were not told straight away that the study they had taken part in was a hoax. In a relatively small community, he didn’t want the word to get about that this was the case. Despite this, in the published reports Milgram referred to “dehoaxing” the subjects at the end of the study. Subjects were sent a report about the study, including that the procedure had been a hoax, a little while after the entire series of studies had been completed. However, for whatever reason, some of the people that Gina Perry tracked down said they had never received such a report. They had gone most of their lives not knowing the truth.

Worse than this, contrary to what Milgram claimed, it is clear that some subjects were not happy about the nature of his research, either at the time (the usual experimenter, John Williams, appears to have been assaulted on more than one occasion) or later on. Some appear to have been adversely affected by their participation. In some cases, Milgram did manage to mollify people by taking them into his confidence. He then cited them as evidence that subjects were happy to endorse his studies. Some of Milgram’s subjects were Jewish, an ironic fact given Milgram’s linkage of his research to the Holocaust (Milgram himself was Jewish, but this was not something he disclosed in his earlier writings).

It also turns out that the clean bill of health given to Milgram’s research by the Yale psychiatrist Paul Errera was not quite what it seems. In fact, Errera’s interviews with some of Milgram’s subjects had taken place at the insistence of Yale University after complaints had been made. Only a small proportion of subjects were contacted and an even smaller number agreed to be interviewed, but in his book Milgram referred to these – against Errera’s wishes – as the “worst cases”, who had nonetheless endorsed his work. Milgram actually watched the interviews from behind a one-way mirror and, in some instances, revealed himself to the subjects and engaged in interaction with them. Perry suggests that Errera’s endorsement of Milgram’s work may have been influenced by his reluctance to derail the career of a young psychologist who clearly had so much riding on his controversial research. In any case, the presence of Milgram at the interviews was hardly ideal.

Milgram moved to Harvard University in July 1963. Perhaps mindful of the controversy surrounding his work, his research there avoided personal contact with subjects. In 1967, having been denied tenure at Harvard, he left for a job at the City University of New York. Perry notes that with both staff and students Milgram could alternate between graciousness and rudeness. She wonders if his mood swings might have been influenced by his drug use. This doesn’t feature highly in the book, but Milgram had been using drugs since his student days, including marijuana, cocaine and methamphetamine. When writing Obedience to Authority he used drugs to help overcome his writer’s block and occasionally kept notes on the influence of his intake on the creative process.

Did his research ultimately tell us much at all? It seems unlikely that it really sheds light on the Holocaust, an event involving the actions of people working in groups and in the grip of a specific ideology. By contrast, Milgram’s subjects were acting as individuals in a highly ambiguous context. On the one hand they believed they were being instructed by a scientist, a highly trusted figure whom they would have been reluctant to let down. On the other hand, the setup didn’t make sense. Why was it necessary for a member of the public to play the role of the teacher in the experiment? Why didn’t the experimenter do this for himself? Also, some of Milgram’s own subjects were aware that punishment is not an effective method for making people learn, something that was well-established by the time that he ran his studies. One of Milgram’s research assistants, Taketo Murata, conducted an analysis that showed that the subjects who delivered the maximum shock were more often the ones who expressed disbelief in the veracity of the setup. Whilst Milgram argued that their responses after the study couldn’t be trusted, he was nonetheless happy to use these when it suited him.

Gina Perry shows that in private Milgram often shared many of the doubts that critics voiced about his work, including their ethical concerns. Publicly, though, he strongly defended his work, and more so with the passage of time. He wanted to be seen among the greats of social psychology, including his own mentor Solomon Asch, whose work on conformity was an obvious precursor to Milgram’s work. It seems, though, that Asch eventually stopped responding to Milgram’s letters, presumably increasingly uncomfortable with the ethical issues surrounding the shock experiments. Another famous psychologist, Lawrence Kohlberg, had watched some of the experimental trials with Milgram behind the one-way mirror. Yet he subsequently regretted his own passivity in the face of unethical research. In a letter to the New York Times he described Milgram as “another victim, another banal perpetrator of evil”.

What about Milgram’s paid actors, Williams and McDonough? Were they also culpable in perpetrating evil? Perry is sympathetic to these men. Like the subjects, they had been duped. They needed the money and had responded to an advertisement for assistants in a study of learning and memory. Possibly as the trials proceeded, they themselves became desensitized to what was happening. In any case, they received two pay rises from Milgram in recognition of the efforts they were making on his behalf. Another actor, Bob Tracy, took part in some trials but quit after an army buddy arrived at the lab and he couldn’t go through with the deception. But what kind of pressure were Williams and McDonough under? We know that Williams was assaulted more than once in the lab. And both men were dead of heart attacks within five years of the research ending. This is ironic, as many of the experiments featured the learner stating at the outset that he had a heart problem. There is also evidence that McDonough did experience a heart ‘flutter’ during one of the trials. Did Milgram know about his heart problem and deliberately incorporate this into the experimental scenario?

In conclusion, it is undeniably true that human beings, under certain circumstances, can do terrible things. But Gina Perry has done us a great service by showing that the behaviour of authority figures does not automatically turn us into unthinking automata who will commit atrocities. Through an exemplary piece of detective work she has shown that the people who served as Milgram’s subjects were, by turn, concerned, questioning, rebellious and even disbelieving. Some, though, were affected by the experiments for years afterwards. After all, if you had been pressured into delivering  very painful shocks, and possibly a lethal shock, in the name of science, only to be told that you were the person being studied, and possibly not being told that no real shocks were delivered, how would you feel about yourself later on?

Note: Gina Perry is also the author of a new book ‘The Lost Boys’, which I hope to write about in due course.