Review – Meltdown: Why Our Systems Fail and What We Can Do About It

screenshot 2019-01-26 at 15.14.17In the opening chapter of Meltdown, the authors Chris Clearfield and András Tilcsik describe the series of events that led to a near-disaster at the Three Mile Island nuclear facility in the United States. The initiating event was relatively minor, and occurred during routine maintenance, but as problems began to multiply the operators were confused. They could not see first-hand what was happening and were reliant on readouts from individual instruments, which did not show the whole story and which were open to misinterpretation.

The official investigation sought to blame the plant staff, but sociology professor Charles “Chick” Perrow argued that the incident was a system problem. The author of Normal Accidents: Living With High Risk Technologies, Perrow states that systems can be characterised along two dimensions: complexity and coupling. Complex systems have many interacting parts and frequently the components are invisible to the operators. Tightly-coupled systems are those in which there is little or no redundancy or slack. A perturbation in one component may have multiple knock-on effects. Perrow argues that catastrophic failures tend to be those where there is a combination of high complexity and tight coupling. His analysis forms the explanatory basis for many of the calamities described in Meltdown. Not all of these are life-threatening. Some are merely major corporate embarrassments, such as when Price-Waterhouse Cooper cocked up the award for the Best Picture at the 89th Academy Awards. Others nonetheless had a big impact on ordinary people, such as the problems with the UK Post Office’s Horizon software system, which led to many sub-postmasters being accused of theft, fraud and false accounting. Then there are the truly lethal events, such as the Deepwater Horizon oil rig explosion. Ironically, it is often the safety systems themselves that are the source of trouble. Perrow is quoted as saying that “safety systems are the biggest single source of catastrophic failure in complex tightly-coupled systems”.

The second half of Meltdown is devoted to describing some of the ways in which we can reduce the likelihood of things going wrong. These include Gary Klein’s idea of the premortem. When projects are being planned, people tend to be focused on how things are going to work, which can lead to excessive optimism. Only when things go wrong do the inherent problems start to appear obvious (hindsight bias). Klein suggests that planners envisage a point in time after their project has been implemented, and imagine that it has been a total disaster. Their task is to write down the reasons why it has all gone so wrong. By engaging in such an exercise, planners are forced to think about things that might not otherwise have come to mind, to find ways to address potential problems, and to develop more realistic timelines.

Clearfield and Tilcsik also discuss ways to improve operators’ mental models of the systems they are using, as well as the use of confidential reporting systems for problems and near-misses.

They devote several chapters to the important topic of allowing dissenting voices to speak openly about their concerns. There is ample evidence that lack of diversity in teams, including corporate boards, has a detrimental effect on quality of discussion. Appointing the “best people for the job” may not be such a great idea if the best people are all the same kind of people. One study found that American community banks were more likely to fail during periods of uncertainty when they had higher proportions of banking experts on their boards. It seems that these experts were overreliant on their previous experiences, were overconfident, and – most importantly – were over-respectful of each others’ opinions. Moreover, domination by banking experts made it harder for challenges to be raised by the non-bankers on the boards. However, where there were higher numbers of non-bankers, the bankers more often had to explain issues in more detail and their opinions were more often challenged.

Other research shows that both gender and ethnic diversity are important, too. An experimental study of stock trading, involving simulations, found that ethnically homogeneous groups of traders tended to copy each other, including each others’ mistakes, resulting in poorer performance. Where groups were more diverse, people were just generally more skeptical in their thinking and therefore more accurate overall. Another study found that companies were less likely to have to issue financial restatements (corrections owing to error or fraud) where there was at least one woman director on the board.

Clearfield and Tilcsik argue that the potential for catastrophe is changing as technologies develop. Systems which previously were not both complex and tightly-coupled are increasingly becoming so. This can of course result in great performance benefits, but may also increase the likelihood that any accidents that do occur will be catastrophic ones.

Meltdown has deservedly received a lot of praise since its publication last year. The examples it describes are fascinating, the explanations are clear, and the proposed solutions (although not magic bullets) deserve attention. Writing in the Financial Times, Andrew Hill cited Meltdown when talking about last year’s UK railway timetable chaos, saying that “organisations must give more of a voice to their naysayers”. The World Economic Forum’s Global Risks Report 2019 carries a short piece by Tilcsik and Clearfield, titled Managing in the Age of Meltdowns.

I highly recommend this excellent book.