A big Noise (review)

Noise: A flaw in human judgment

by Daniel Kahneman, Olivier Sibony and Cass R. Sunstein.

Published by William Collins, 2021.

By now, many people are familiar with Nobel prize winner Daniel Kahneman’s previous book Thinking Fast and Slow, in which he popularised the idea that rapid unconscious thought processes underlie many of our judgments and decisions. It is this manner of thought that we equate with intuition. Kahneman showed us how intuitive thinking can give rise to a range of systematic errors, referred to as biases. In this new book, he has teamed up with Olivier Sibony and Cass Sunstein to talk about another source of error in judgment, referred to as noise.

The authors state that error in judgment arises from a combination of bias and noise. Noise in judgment is defined as unwanted variability, and we are told that this is a more pervasive problem than bias. The book describes various studies of noise that researchers have conducted over several decades, including the notable contribution of Marvin Frankel, a US judge who was outraged by the variability of criminal sentencing in the American legal system. The authors contend, however, that the topic of noise has tended to be overshadowed by the topic of bias. Specifically:

[I]n public conversations about human error and in organizations all over the world, noise is rarely recognized. Noise is a bit player, usually offstage. The topic of bias has been discussed in thousands of scientific articles and dozens of popular books, few of which even mention the issue of noise. This book is our attempt to redress the balance (p.6).

Noise can be observed and measured even where a ‘right’ answer may not exist or cannot be verified. There is no objective standard for assessing whether a movie is ‘good’, for example, but because most professional critics give a numerical rating we can see the extent to which they agree or disagree with each other. There may be little consequence to variability in the judgments of film critics, but in many domains we would hope for high levels of consistency. For example, if any of us were to find ourselves the defendant in a court case, we would rightfully expect that the fairness of the outcome should not depend upon which judge happens to be hearing the case that day. Regrettably, the evidence reported by the authors indicates that noise pervades the legal system and many other areas of life. They note that noisy judgments have been observed in medicine, child custody decisions, professional forecasting, asylum decisions, personnel decisions, bail decisions, forensic science, and patent decisions. 

In Chapter 6, the authors describe different types of noise. The example of a legal defendant obtaining a different outcome depending on which judge handles the case is an illustration of system noise. Observations of court-room sentencing have always suggested that judges vary in the way they treat similar cases, a conclusion which is supported by controlled research. A study published in 1981 presented 208 US Federal judges with the same 16 cases and asked them to set a sentence for each. Sure enough, wide variation was observed in sentencing. There is of course no way of knowing what the ‘right’ sentence is, and while it is tempting to suggest that the average sentence for a case represents the ‘right’ sentence, the average may also reflect the existence of bias (e.g. racial discrimination in sentencing).

System noise is itself the product of two other distinct forms of noise. One of these is level noise. In the case of courtroom judges this would represent the tendency of some judges to be more severe than others. The other contribution to system noise comes from pattern noise. This occurs when a judge treats certain types of case more severely than other types (a judge x case interaction). As the authors put it:

One judge, for instance, may be harsher than average in general but relatively more lenient toward white-collar criminals. Another may be inclined to punish lightly but more severely when the offender is a recidivist.

Another type of noise arises when the same individual makes different judgments about the same information when it is encountered at different times. Such within-person variability is referred to as occasion noise. Logically, if a person is operating as part of a system, then occasion noise must also contribute to system noise, but this is rather difficult to tease apart. Occasion noise has been widely studied and arises from numerous factors, such as variation in mood, stress, fatigue, and even changes in the weather. Contextual information can also have an impact: in the US, a judge who has just granted asylum to the previous two applicants is less likely to grant it to the next applicant.

The authors propose a range of remedies for the problem of noisy judgments, which they class under the umbrella heading of decision hygiene. Any organisation concerned about the issue of noise in judgment, they suggest, should conduct a noise audit in order to determine the extent to which they are affected (an appendix provides guidance for how to go about this). The first principle of decision hygiene is that “The goal of judgment is accuracy, not individual expression”. Statistical models have been found to generally outperform human judges on repeated decisions, including models that were created from analyses of human judges. This has been known for a long time, though the advent of machine learning has given even greater scope for the application of such models. The great advantage of statistical models is that they are free from occasion noise, although there is a danger that models based on human judgment will incorporate societal biases (e.g. racial discrimination). There is some discussion about the problem of bias in AI systems, though the authors seem largely unconcerned. This was a real issue for me. I found their rather casual dismissal of bias to be hand-wavy and unconvincing.

However, acknowledging the fact that people often resist challenges to their autonomy, the authors suggest that some situations – such as job interviews – may best benefit from being structured, having options rated, and for those ratings to be the input for a discussion among decision makers rather than as the input to an algorithm. 

A second principle is that judges should “think statistically and take the outside view of the case”. Thinking about how a current situation might be similar to other situations that have been encountered can help root thinking in statistical reality, and avoid excessive optimism.

Thirdly, judges should “structure judgments into several independent tasks”. This has always been a basic principle of decision analysis. People’s limited cognitive capacities are better able to manage a series of smaller problems than one big, complex problem. Kahneman et al describe a specific procedure for organisational decision making, which they call the mediating assessments protocol.

A fourth principle is to “avoid premature intuitions”. In Chapter 20 the authors provide an alarming description of how the forensic analysis of fingerprints can be biased in the US legal system. Whenever a laboratory links the partial fingerprint from a crime scene to the complete fingerprint of a suspect, a second laboratory is asked to carry out the same analysis. Unfortunately, the second laboratory knows that it is only being asked to do the analysis because another laboratory made an identification, hence they are potentially biased at the outset.

The book finishes with a comparison of rules and standards as a means for regulating behaviour. Rules have clear-cut criteria (“Do not exceed the speed limit”), though as noted earlier they can also be biased. Standards, on the other hand, allow for the exercise of discretion (“Drive carefully”). Standards are often adopted because it can be difficult to get people to agree on the precise criteria for rules. However, the more open-ended the language used in a standard is, the more judgment is needed and the more that noise is likely to creep in. The authors give the example of Facebook’s Community Standards, which are meant to determine what is and isn’t acceptable online content. When first introduced, precisely because there were thousands of Facebook reviewers working according to standards, they ended up making decisions that were highly variable.To address this problem, Facebook created a non-public document for its reviewers called the Implementation Standards, which – for example – included graphic images to depict what it meant by “glorifying violence”. In so doing, they basically created a set of rules to underpin their public standards.

There appears to be no clear-cut way to determine whether a rule or a standard should be used, and the authors suggest that at a first approximation any organisation needs to consider the costs of decisions and the costs of errors. Creating a rule can be difficult, but applying a rule in a decision situation is relatively easy. Conversely, creating a standard is easier, but where a person has to make many decisions the need to be continually exercising judgment can be quite a burden. The authors suggest that the costs of errors depend on “whether agents are knowledgeable and reliable, and whether they practice decision hygiene”. Where agents can be trusted in this regard, then a standard might work well, but otherwise a rule may be more appropriate.

That is the book in summary, then. With three co-authors you might wonder how stylistically consistent the book would be, but I found it to be remarkably consistent, with no obvious clue as to who did what. However, there is also quite a bit of repetition and a more rigorous editing process could have cut down the length substantially. Overall, though, I  found the book to be quite engaging, much more so than Thinking Fast and Slow, which I found rather hard-going (I didn’t manage to finish that book, although I was familiar with most of the content anyway). 

There has been some academic sniping over Noise, though I don’t think it’s very interesting for a review to begin reviewing the other reviewers (one highly critical review with links to other critical reviews can be found here). Some of the criticism, in my view, is overstated and there is a sense of people trying to cut down one of the “tall poppies” in the field. Nonetheless, one of the reasons that Kahneman, in particular, has become something of a target is because a number of weaknesses have been identified in his previous book, Thinking Fast and Slow. Kahneman was perhaps unfortunate to have published his best-seller in the same year in which one well-known psychologist was revealed to have fabricated data in many studies, and in which one of the most controversial papers in psychology appeared, a paper which has prompted a great deal of soul-searching within the discipline. It transpires that for a long time, a range of questionable research practices (QRPs) have been used in psychology (and, to be fair, in other disciplines, though not to the same degree). As a result of this introspection, it turns out that Kahneman’s book contains many studies which have either failed to be replicated by other researchers or which are severely “underpowered” (too few participants), meaning that there is a good chance they would not be replicated. The implicit priming studies featured in Chapter 4 of Thinking Fast and Slow are particularly problematic, and a critique can be read here. A broader critique can be found here

Kahneman has not (yet) revised Thinking Fast and Slow to address the problems identified, and the millions of non-psychologist readers are unlikely to be aware that there are any problems. Those who are aware of the problems identified in psychological research will justifiably wonder about the validity of the studies reported in Noise. I have no doubt that noise exists, but to what extent and are the psychological explanations correct? One widely-cited study reported in Noise found that the parole decisions of experienced US judges became increasingly unfavourable the further they got into a session, with about 65% of decisions being favourable at the start of a session and none at the end of a session. Immediately following a break for food, favourable decisions predominated once more before going into a gradual decline again. Whereas most psychological effects are no more than modest in size, this one was substantial. Not reported by Kahneman and colleagues is the fact that this finding has been the subject of some contention. 

One response suggested that the results could be explained by the non-random ordering of cases – prisoners without legal representation tend to have their cases heard later in the session; although the original researchers argued that including representation in their analysis did not change the results. It has also been claimed that the “hungry judge” effect arises from the sensible planning of rational judges: judges tend to end a session when they foresee that the next case is likely to take a long time, and longer cases are more likely to result in favourable outcomes. If correct, then this account would suggest that the case for noise in this instance has been overstated and the supposed reason is false. Finally, the wider concept of “ego-depletion”, upon which the original hungry judge finding rests has itself been called into question.

In conclusion, Noise is somewhat overlong and repetitive, but I think the breakdown of different types of noise is very interesting. There are some potentially useful suggestions for minimising noise, though the authors gloss over concerns about bias in AI-driven decisions. Also, the idea of a noise audit for organisations sounds quite bureaucratic (though potentially a money-spinner for consultants), so presumably ought to be considered only by organisations where noise is a major concern. A healthy skepticism about the psychological research is advised.

[Note: I made a sight edit for clarity to the “hungry judge” section – 12.30pm, 9th August 2021]