A big Noise (review)

Noise: A flaw in human judgment

by Daniel Kahneman, Olivier Sibony and Cass R. Sunstein.

Published by William Collins, 2021.

By now, many people are familiar with Nobel prize winner Daniel Kahneman’s previous book Thinking Fast and Slow, in which he popularised the idea that rapid unconscious thought processes underlie many of our judgments and decisions. It is this manner of thought that we equate with intuition. Kahneman showed us how intuitive thinking can give rise to a range of systematic errors, referred to as biases. In this new book, he has teamed up with Olivier Sibony and Cass Sunstein to talk about another source of error in judgment, referred to as noise.

The authors state that error in judgment arises from a combination of bias and noise. Noise in judgment is defined as unwanted variability, and we are told that this is a more pervasive problem than bias. The book describes various studies of noise that researchers have conducted over several decades, including the notable contribution of Marvin Frankel, a US judge who was outraged by the variability of criminal sentencing in the American legal system. The authors contend, however, that the topic of noise has tended to be overshadowed by the topic of bias. Specifically:

[I]n public conversations about human error and in organizations all over the world, noise is rarely recognized. Noise is a bit player, usually offstage. The topic of bias has been discussed in thousands of scientific articles and dozens of popular books, few of which even mention the issue of noise. This book is our attempt to redress the balance (p.6).

Noise can be observed and measured even where a ‘right’ answer may not exist or cannot be verified. There is no objective standard for assessing whether a movie is ‘good’, for example, but because most professional critics give a numerical rating we can see the extent to which they agree or disagree with each other. There may be little consequence to variability in the judgments of film critics, but in many domains we would hope for high levels of consistency. For example, if any of us were to find ourselves the defendant in a court case, we would rightfully expect that the fairness of the outcome should not depend upon which judge happens to be hearing the case that day. Regrettably, the evidence reported by the authors indicates that noise pervades the legal system and many other areas of life. They note that noisy judgments have been observed in medicine, child custody decisions, professional forecasting, asylum decisions, personnel decisions, bail decisions, forensic science, and patent decisions. 

In Chapter 6, the authors describe different types of noise. The example of a legal defendant obtaining a different outcome depending on which judge handles the case is an illustration of system noise. Observations of court-room sentencing have always suggested that judges vary in the way they treat similar cases, a conclusion which is supported by controlled research. A study published in 1981 presented 208 US Federal judges with the same 16 cases and asked them to set a sentence for each. Sure enough, wide variation was observed in sentencing. There is of course no way of knowing what the ‘right’ sentence is, and while it is tempting to suggest that the average sentence for a case represents the ‘right’ sentence, the average may also reflect the existence of bias (e.g. racial discrimination in sentencing).

System noise is itself the product of two other distinct forms of noise. One of these is level noise. In the case of courtroom judges this would represent the tendency of some judges to be more severe than others. The other contribution to system noise comes from pattern noise. This occurs when a judge treats certain types of case more severely than other types (a judge x case interaction). As the authors put it:

One judge, for instance, may be harsher than average in general but relatively more lenient toward white-collar criminals. Another may be inclined to punish lightly but more severely when the offender is a recidivist.

Another type of noise arises when the same individual makes different judgments about the same information when it is encountered at different times. Such within-person variability is referred to as occasion noise. Logically, if a person is operating as part of a system, then occasion noise must also contribute to system noise, but this is rather difficult to tease apart. Occasion noise has been widely studied and arises from numerous factors, such as variation in mood, stress, fatigue, and even changes in the weather. Contextual information can also have an impact: in the US, a judge who has just granted asylum to the previous two applicants is less likely to grant it to the next applicant.

The authors propose a range of remedies for the problem of noisy judgments, which they class under the umbrella heading of decision hygiene. Any organisation concerned about the issue of noise in judgment, they suggest, should conduct a noise audit in order to determine the extent to which they are affected (an appendix provides guidance for how to go about this). The first principle of decision hygiene is that “The goal of judgment is accuracy, not individual expression”. Statistical models have been found to generally outperform human judges on repeated decisions, including models that were created from analyses of human judges. This has been known for a long time, though the advent of machine learning has given even greater scope for the application of such models. The great advantage of statistical models is that they are free from occasion noise, although there is a danger that models based on human judgment will incorporate societal biases (e.g. racial discrimination). There is some discussion about the problem of bias in AI systems, though the authors seem largely unconcerned. This was a real issue for me. I found their rather casual dismissal of bias to be hand-wavy and unconvincing.

However, acknowledging the fact that people often resist challenges to their autonomy, the authors suggest that some situations – such as job interviews – may best benefit from being structured, having options rated, and for those ratings to be the input for a discussion among decision makers rather than as the input to an algorithm. 

A second principle is that judges should “think statistically and take the outside view of the case”. Thinking about how a current situation might be similar to other situations that have been encountered can help root thinking in statistical reality, and avoid excessive optimism.

Thirdly, judges should “structure judgments into several independent tasks”. This has always been a basic principle of decision analysis. People’s limited cognitive capacities are better able to manage a series of smaller problems than one big, complex problem. Kahneman et al describe a specific procedure for organisational decision making, which they call the mediating assessments protocol.

A fourth principle is to “avoid premature intuitions”. In Chapter 20 the authors provide an alarming description of how the forensic analysis of fingerprints can be biased in the US legal system. Whenever a laboratory links the partial fingerprint from a crime scene to the complete fingerprint of a suspect, a second laboratory is asked to carry out the same analysis. Unfortunately, the second laboratory knows that it is only being asked to do the analysis because another laboratory made an identification, hence they are potentially biased at the outset.

The book finishes with a comparison of rules and standards as a means for regulating behaviour. Rules have clear-cut criteria (“Do not exceed the speed limit”), though as noted earlier they can also be biased. Standards, on the other hand, allow for the exercise of discretion (“Drive carefully”). Standards are often adopted because it can be difficult to get people to agree on the precise criteria for rules. However, the more open-ended the language used in a standard is, the more judgment is needed and the more that noise is likely to creep in. The authors give the example of Facebook’s Community Standards, which are meant to determine what is and isn’t acceptable online content. When first introduced, precisely because there were thousands of Facebook reviewers working according to standards, they ended up making decisions that were highly variable.To address this problem, Facebook created a non-public document for its reviewers called the Implementation Standards, which – for example – included graphic images to depict what it meant by “glorifying violence”. In so doing, they basically created a set of rules to underpin their public standards.

There appears to be no clear-cut way to determine whether a rule or a standard should be used, and the authors suggest that at a first approximation any organisation needs to consider the costs of decisions and the costs of errors. Creating a rule can be difficult, but applying a rule in a decision situation is relatively easy. Conversely, creating a standard is easier, but where a person has to make many decisions the need to be continually exercising judgment can be quite a burden. The authors suggest that the costs of errors depend on “whether agents are knowledgeable and reliable, and whether they practice decision hygiene”. Where agents can be trusted in this regard, then a standard might work well, but otherwise a rule may be more appropriate.

That is the book in summary, then. With three co-authors you might wonder how stylistically consistent the book would be, but I found it to be remarkably consistent, with no obvious clue as to who did what. However, there is also quite a bit of repetition and a more rigorous editing process could have cut down the length substantially. Overall, though, I  found the book to be quite engaging, much more so than Thinking Fast and Slow, which I found rather hard-going (I didn’t manage to finish that book, although I was familiar with most of the content anyway). 

There has been some academic sniping over Noise, though I don’t think it’s very interesting for a review to begin reviewing the other reviewers (one highly critical review with links to other critical reviews can be found here). Some of the criticism, in my view, is overstated and there is a sense of people trying to cut down one of the “tall poppies” in the field. Nonetheless, one of the reasons that Kahneman, in particular, has become something of a target is because a number of weaknesses have been identified in his previous book, Thinking Fast and Slow. Kahneman was perhaps unfortunate to have published his best-seller in the same year in which one well-known psychologist was revealed to have fabricated data in many studies, and in which one of the most controversial papers in psychology appeared, a paper which has prompted a great deal of soul-searching within the discipline. It transpires that for a long time, a range of questionable research practices (QRPs) have been used in psychology (and, to be fair, in other disciplines, though not to the same degree). As a result of this introspection, it turns out that Kahneman’s book contains many studies which have either failed to be replicated by other researchers or which are severely “underpowered” (too few participants), meaning that there is a good chance they would not be replicated. The implicit priming studies featured in Chapter 4 of Thinking Fast and Slow are particularly problematic, and a critique can be read here. A broader critique can be found here

Kahneman has not (yet) revised Thinking Fast and Slow to address the problems identified, and the millions of non-psychologist readers are unlikely to be aware that there are any problems. Those who are aware of the problems identified in psychological research will justifiably wonder about the validity of the studies reported in Noise. I have no doubt that noise exists, but to what extent and are the psychological explanations correct? One widely-cited study reported in Noise found that the parole decisions of experienced US judges became increasingly unfavourable the further they got into a session, with about 65% of decisions being favourable at the start of a session and none at the end of a session. Immediately following a break for food, favourable decisions predominated once more before going into a gradual decline again. Whereas most psychological effects are no more than modest in size, this one was substantial. Not reported by Kahneman and colleagues is the fact that this finding has been the subject of some contention. 

One response suggested that the results could be explained by the non-random ordering of cases – prisoners without legal representation tend to have their cases heard later in the session; although the original researchers argued that including representation in their analysis did not change the results. It has also been claimed that the “hungry judge” effect arises from the sensible planning of rational judges: judges tend to end a session when they foresee that the next case is likely to take a long time, and longer cases are more likely to result in favourable outcomes. If correct, then this account would suggest that the case for noise in this instance has been overstated and the supposed reason is false. Finally, the wider concept of “ego-depletion”, upon which the original hungry judge finding rests has itself been called into question.

In conclusion, Noise is somewhat overlong and repetitive, but I think the breakdown of different types of noise is very interesting. There are some potentially useful suggestions for minimising noise, though the authors gloss over concerns about bias in AI-driven decisions. Also, the idea of a noise audit for organisations sounds quite bureaucratic (though potentially a money-spinner for consultants), so presumably ought to be considered only by organisations where noise is a major concern. A healthy skepticism about the psychological research is advised.

[Note: I made a sight edit for clarity to the “hungry judge” section – 12.30pm, 9th August 2021]

Double book review: Margaret Boden and Gary Smith on Artificial Intelligence

AI – Its nature and future, by Margaret A. Boden. Oxford University Press. 2016.

The AI Delusion, by Gary Smith. Oxford University Press. 2018.

AI, machine learning, algorithms, robots, automation, chatbots, sexbots, androids – in recent years all these terms have regularly been appearing in the media, either to tell us about the latest achievements in technology, about exciting future possibilities, or in the context of warnings about threats to our jobs and freedoms.

Two recent books, from Margaret Boden and Gary Smith, respectively, are useful guides to the perplexed in explaining the issues. Each is clearly written and highly readable. Margaret Boden, Research Professor of Cognitive Science at the University of Sussex, begins with a basic definition:

Artificial intelligence (AI) seeks to make computers do the sorts of things that minds can do.

People who work in AI tend to work in one of two different camps (though occasionally both). They either take a technological approach, whereby they attempt to create systems that can perform certain tasks, regardless of how they do it; or they take a scientific approach, whereby they are interested in answering questions about human beings or other living things.

Screenshot 2018-09-02 at 17.27.36

Boden’s book is essentially a potted history of the field, guiding the reader through the different approaches and philosophical arguments. Alan Turing, of Bletchley Park fame, seems to have envisaged all the current developments in the field, though during his lifetime the technology wasn’t available to implement these ideas. The first approach to hit the big time is now known as ‘Good Old-Fashioned AI (GOFAI)’. This assumes that intelligence arises from physical entities that can process symbols in the right kind of way, whether these entities are living organisms, arrangements of tin cans, silicon chips or whatever else. The other approaches are not reliant on sequential symbol processing. These are: 1. Artificial Neural Networks (ANNs), or connectionism, 2. Evolutionary programming, 3. Cellular automata (CA), and 4. Dynamical systems. Some researchers argue in favour of hybrid systems that combine elements of symbolic and non-symbolic processing.

For much of the 1950s, researchers of different theoretical persuasions all attended the same conferences and exchanged ideas, but in the late ’50s and 1960s a schism developed. In 1956 John McCarthy coined the term ‘Artificial Intelligence’ to refer to the symbol processing approach. This was seized upon by journalists, particularly as this approach began to have successes with the Logic Theory Machine (Newell & Simon) and General Problem Solver (Newell, Shaw, and Simon). By contrast, Frank Rosenblatt’s connectionist Perceptron model was found to have serious limitations and was contemptuously dismissed by many symbolists. Professional jealousies were aroused and communication between the symbolists and the others broke down. Worse, funding for the connectionist approach largely dried up.

Work within the symbol processing, or ‘classical’, approach has taught us some important lessons. These include the need to make problems tractable by directing attention to only part of the ‘search space’, by making simplifying assumptions and by ordering the search efficiently. However, the symbolic approaches also faced the issue of ‘combinatorial explosion’, meaning that logical processes would draw conclusions that were true but irrelevant. Likewise, in classical – or ‘monotonic’ – logic, once something is proved to be true it stays true, but in everyday life that is often not the case. Boden writes:

AI has taught us that human minds are hugely richer, and more subtle,  than psychologists previously imagined. Indeed, that is the main lesson to be learned from AI.

Throughout the lean years for connectionist AI a number of researchers had plugged away regardless, and in the late 1980s there was a sudden explosion of research under the name of ‘Parallel Distributed Processing’ (PDP). These models consist of many interconnected units, each one capable of computing only one thing. There are multiple layers of units, including an input layer, an output layer, and a ‘hidden layer’ or layers in between. Some connections feed forward, others backwards, and others connect laterally. Concepts are represented within the state of the entire network rather than within individual units.

PDP models have had a number of successes, including their ability to deal with messy input. Perhaps the most notable finding occured when a network produced over-generalisation of past tense learning (e.g. saying ‘go-ed’ rather than ‘went’), indicating – contrary to Chomsky – that this aspect of language learning may not be an inborn linguistic rule. Consequently, the research funding tap was turned back on, especially from the US Department of Defense. Nonetheless, PDP models have their own weaknesses too, such as not being able to represent precision as well as classical models:

Q: What’s 2 + 2?

A: Very probably 4.

Learning within ANN’s usually involves changing the strength (the ‘weights’) of the links between units, as expressed in the saying “fire together, wire together”. It involves the application of ‘backprop’ (backwards propagation) algorithms which trace responsibility for performance back from the output layer into the hidden layers, identifying the units that need to be adopted, and thence to the input layer. The algorithm needs to know the precise state of the output layer when the network is giving the correct answer.

Although PDP propaganda plays up the similarity between network models and the brain’s neuronal connections, in fact there is no backwards propagation in the brain. Synapses feed forwards only. Also, brains aren’t strict hierarchies. Boden also notes (p.91):

a single neuron is as computationally complex as an entire PDP system, or even a small computer.

Subsequent to the 1980s PDP work it has been discovered that connections aren’t everything:

Biological circuits can sometimes alter their computational function (not merely make it more or less probable), due to chemicals diffusing through the brain.

One example of this is Nitrous Oxide. Researchers have now developed new types of ANNs, including GasNets, used to evolve “brains for autonomous robots.

Boden also discusses other approaches within the umbrella of AI, including robots and artificial life (‘A-life’), and evolutionary AI. These take in concepts such as distributed cognition (minds are not within individual heads), swarm intelligence (simple rules can lead to complex behaviours), and genetic algorithms (programs are allowed to change themselves, using random variation and non-random selection).

But are any of these systems intelligent? Many AI models have been very successful within specific domains and have outperformed human experts. However, the essence of human intelligence – even though the word itself does not have a standard definition among psychologists – is that it involves the ability to perform in many different domains, including perception, language, memory, creativity, decision making, social behaviour, morality, and so on. Emotions appear to be an important part of human thought and behaviour, too. Boden notes that there have been advances in the modelling of emotion, and there are programs that have demonstrated a certain degree of creativity. There are also some programs that operate in more than one domain, but are still nowhere near matching human abilities. However, unlike some people who have warned about the ‘singularity’ – the moment when machine intelligence exceeds that of humans – Boden does not envisage this happening. Indeed, whilst she holds the view that, in principle, truly intelligent behaviour could arise in non-biological systems, in practice this might not be the case.

Likewise, the title of Gary Smith’s book is not intended to decry all research within the field of AI. He also agrees that many achievements have occurred and will continue to do so. However, the ‘delusion’ of the title occurs when people assign to computers an ability that they do not in fact possess. Excessive trust can be dangerous. For Smith:

True intelligence is the ability to recognize and assess the essence of a situation.

This is precisely what he argues AI systems cannot do. He gives the example of a drawing of a box cart. Computer systems can’t identify this object, he says, whereas almost any human being could not only identify it, but suggest who might use it, what it might be used for, what the name on the side means, and so on.

Screenshot 2018-09-02 at 17.28.21

Smith refers to the Winograd Schema Challenge. The Stanford Computer Science Professor, Terry Winograd, has put up a $25,000 prize for anyone who can design a system that is at least 90% accurate in interpreting sentences like this one:

I can’t cut that tree down with that axe; it is too [thick/small]

Most people realise that if the bracketed work is ‘thick’ it refers to the tree, whereas if it is ‘small’ it refers to the axe. Computers are typically – ahem – stumped by this kind of sentence, because they lack the real-world experience to put words in context.

Much of Smith’s concern is about the data-driven (rather than theory-driven) way that machine learning approaches use statistics. In essence, when a machine learning program processes data it does not stop to ask ‘Where did the data come from?’ or ‘Why these data?’ These are important questions to ask and Smith takes us through various problems that can arise with data (his previous book was called Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics).

One important limitation associated with data is the ‘survivor bias’. A study of Allied warplanes returning to Britain after bombing runs over Germany found that most of the bullet and shrapnel holes were on the wings and rear of the plane, but very few on the cockpit, engines, or fuel tanks. The top brass therefore planned to attach protective metal plates to the wings and rear of their aircraft. However, the statistician Abraham Wald pointed out that the planes that returned were, by definition, the ones that had survived the bullets and shrapnel. The planes that had not returned had most likely been struck in the areas that the returning planes had been spared. These were the areas that should be reinforced.

Another problem is the one discussed in my previous blog, that of fake or bad data, arising from the perverse incentives of academia and the publishing world. The ‘publish-or-perish’ climate, together with the wish of journals to publish ‘novel’ or ‘exciting’ results, has led to an exacerbation of ‘Questionable Research Practices’ or outright fakery, with the consequence that an unfortunately high proportion of published papers contain false findings.

Smith is particularly scathing about the practice of data mining, something that for decades was regarded as a major ‘no-no’ in academia. This is particularly problematic in the advent of big data, when machine learning algorithms can scour thousands upon thousands of variables looking for patterns and relationships. However, even among sequences that are randomly generated, correlations between variables will occur. Smith shows this to be the case with randomly generated sequences of his own. He laments that

The harsh truth is that data-mining algorithms are created by mathematicians who often are more interested in mathematical theory than practical reality.

and

The fundamental problem with data mining is that it is very good at finding models that fit the data, but totally useless in gauging whether the models are ludicrous.

When it comes to the choice of linear or non-linear models, Smith says that expert opinion is necessary to decide which is more realistic (though one recent systematic comparison of methods, involving a training set of data and a validation set, found that the non-linear methods associated with machine learning were dominated by the traditional linear methods). Other problems arise with particular forms of regression analysis, such as stepwise regression and ridge regression. Data reduction methods, such as factor analysis or principal components analysis, can also cause problems because the transformed data are hard to interpret. Especially if mined from thousands of variables they will contain nonsense. Smith looks at some dismal attempts to beat the stock market using data mining techniques.

But as if the statistical absurdities weren’t bad enough, Smith’s penultimate chapter – the one that everything else has been leading up to, he says – concerns the application of these techniques to our personal affairs in ways which impinge upon our privacy. For example, software exists that examines the online behaviour of job applicants. Executives who ought to know better may draw inappropriate causal inferences from the data. One of the major examples discussed earlier in the book is Hillary Clinton’s presidential campaign. Although not widely known, her campaign made use of a powerful computer program called Ada (after Ada Lovelace, an early pioneer in AI). This crunched masses of data about potential voters across the country, running 400,000 simulations per day. No-one knows exactly how Ada worked, but it was used to guide decisions about where to target campaigning resources. The opinions of seasoned campaigners were entirely sidelined, including perhaps the greatest campaigner of all – Bill Clinton (reportedly furious about this, too). We all know what happened next.