Things go wrong. Sometimes things go catastrophically wrong, with hundreds to tens of thousands of lives lost and millions to billions in property damage.
When reviewing events afterward, a pattern frequently appears: people made mistakes which made things worse. The most obvious mistakes happen at the time, in the tumult of emergency calls and rushing to action. Often the more severe mistakes happened well beforehand, in setting up policies, in designing equipment or systems to handle both ordinary events and emergency overflow, in setting schedules to check equipment and to replace components, or in setting up funding to cover all that.
In reviewing the events, and the mistakes, we constantly need to remind ourselves of the maxim, Never attribute to malice that which can be adequately explained by incompetence. That’s generally good advice. Yes, there are malicious people around who will screw someone over for fun and profit but there are a lot more poorly-trained newbies, people who wouldn’t have the job if they weren’t related to the owner, and idiot neighbors.
Sometimes the Incompetence theory is strained. When half a dozen independent decisions or evaluations all go wrong, and go wrong in the same direction, an honest observer would suspect Malice. My go-to example is the review of Ronald Reagan’s Strategic Defense Initiative, the plan for putting up satellites to kill ICBMs. The “independent, nonpartisan” scientists published a report which claimed that the number of satellites needed was more than 10,000 times the number which was later calculated. At every step of the way, they made an estimate for the amount of laser power needed to disrupt a missile, the kill rate needed to make an attack not worth kicking off, and so on. Almost every estimate, every assumption, and every calculation was wrong, and they were all wrong in the same direction, that of showing that the space portion of the SDI was infeasible both technically and economically.
Jerry Pournelle, who had worked with many of the scientists making that erroneous estimate, defended them by saying that surely they didn’t deliberately tank their work, surely it was a matter of making mistakes and letting them be if they matched the scientists’ prior beliefs but rechecking if they went the other way.
I don’t buy it. First, just slopping something together and only half-checking it isn’t the way a scientific review is supposed to go, especially one performed by luminaries in the field. Second, the report was allegedly peer-reviewed, meaning that either the reviewers made exactly the same errors or they didn’t bother to check the work, only the conclusion. Put these together and it’s much more plausible that all of the estimates and assumptions were deliberately high-balled, and that the fact checkers went along with it because they, too, opposed the SDI on ideological grounds.
Many other examples abound. Some are obvious lies, with blatant malicious acts being written off as simple mistakes or happenstance events. The American elections in 2020 give a lot of examples, with voting machine failures predominantly in Republican-heavy districts. Preloaded test data “accidentally” left on the tabulating machines before the counting began, and always giving Democrats several thousand votes. And so on. (This doesn’t address poll watchers being thrown out and then bags of ballots being pulled out of boxes rather than official transport cases, as caught on video. I’m talking only about events which are claimed to be simple, honest mistakes.)
Other examples are less clear. A highway bridge in New York collapsed about 40 years ago. Somehow it had fallen through the cracks, pun intended, in the inspection schedules and one day it just fell down.
An engineering office lost hundreds of thousands of dollars worth of data, and thousands of dollars of hardware, because the building’s power line got cut by a construction crew a couple hundred yards away, no one had set up backup power for those servers, and no one had made data backups in a couple years. The person whose job it was had left and no one had thought to assign the job to someone else.
A municipal water system had to issue a boil water advisory because maintenance had been deferred and deferred again and then something failed and one branch couldn’t hold pressure and potentially allowed untreated water to contaminate the purified drinking water.
These three examples all involve engineering. That’s because I’m an engineer, these types of things catch my eye, and I understand how they’re supposed to work and how they failed. (All three also affected me, which helped them to stick in my memory.) As with the above, other examples abound, such as business reports being put together with the wrong client’s data, reviewed by several colleagues and at least one manager, and then sent to the correct client, thereby leaking proprietary information. (I saw that one happen, too.)
These are always presented as an unfortunate series of bad luck or at worst mistakes, regrettable but certainly not malicious. It strains belief, though: how is it possible that decades of engineering best practices and written policies and a list of every vehicular bridge in the state could have let one bridge (at least one bridge!) be dropped from the lists to be inspected? It boggles the imagination that no one at DoT noticed that there are 1000 bridges in the state but the crews inspected only 999 each year for ten years in a row. It has to be deliberate, doesn’t it? It couldn’t be that everyone missed it?
I propose that this is exactly what happened: Everyone honestly, though incompetently, missed that the bridge was not being inspected. Everyone honestly, though incompetently, let valuable, non-backed-up data reside only on servers which were known to fail completely if the power flickered.
Many systems today are too complex for anyone but a genius to fully understand. Engineered systems, business systems, economic systems, organizational systems. Most systems start simple but as needs change or problems are found they gradually increased in complexity, from something comprehensible by an bright but not outstanding man to a Gordian knot of relationships and dependencies and “don’t change this section; we don’t know why but if you touch it the whole thing breaks”. Others were complex from the start, set up by a genius and then put into the hands of the only-slightly-above-average to operate.
Regardless of how they became complex, while these complex systems work well enough, so long as nothing goes wrong, something will always go wrong sooner or later. Someone will do things out of order, someone will use a tool or a web page in a way that the designer didn’t expect, power will fail, data will be garbled in transmission, some boss will demand a trivial change with unforeseen ramifications. Something will go wrong.
The problem is that our expectation is for everything to go right. Any deviation from perfection is seen as a problem.
When mistakes are made or things just go wrong, the result is a failed product popping out of the assembly line, a loss of efficiency, or a bridge falling down. Hardly ever does something going wrong result in things going better than expected. (This does happen but it’s rare enough that tales of fortuitous discoveries are endlessly repeated until they seem commonplace.)
Why don’t mistakes make things go better? Because the system has been optimized over the years to be as good as people can make it. Doing things differently is probably going to be worse. You can think of it like assembling a flatpack: swapping parts or doing steps out of order sometimes doesn’t matter and sometimes will screw up the product. Only very rarely will a change make the product better. For the most part the parts list and the instructions were arranged in pretty much the best possible order. The same goes for getting timecards processed and people paid or for keeping a power plant running for years.
This isn’t a contradiction with what I said before, about people not being smart enough to set up a complex system. Trial and error over lots of years and lots of sites will usually settle on a system which is about as good as we can get, even if no one fully understands it.
We can make allowances for things going wrong, and in particular for people not doing everything right. Sometimes the system will include checks to make sure the less-capable or less-conscientious or even the less-honest are doing their jobs right, and fail-safes for when they don’t. Sometimes checks are not included. Checks and fail-safes make a complex system more complex.
If a system is too complex for people to fully understand, they can’t anticipate all the ways in which it can fail. Worse, some systems can be so complex that even known failure modes can’t be properly addressed, often because fixing this thing over here breaks that thing over there.
One of the forms of “breaking that thing over there” is making part of a system too expensive, whether in terms of requiring more highly refined source materials, needing more computing resources to thoroughly check all data inputs before processing them, or having humans follow more detailed checklists with more supervisor approval.
More complex systems with more thorough checks are more expensive to run, too. Every check has a cost as the system runs, as people have to follow more steps or fill out more paperwork or as additional components have to be powered. Every fail-safe has a cost to create and sometimes a cost as the system runs.
It often happens that the executives or the bean-counters insist on reducing scheduled inspections and maintenance because “once every other year is really enough” or cut back safety margins because “it was overdesigned from the beginning”. Then, when the electrical substation catches fire because it was running at 200% for five years, the spokesman will tell reporters that the power company had been following appropriate guidelines regarding use, maintenance, and replacement of the equipment, not mentioning that the company is the entity which set the guidelines and that they’d been revised annually.
OK, so we see the problem: Most systems, of any type, are either too complex for most people to understand now or they will become so in the future. Attempting to make them more tolerant of errors makes them even more complex. Making the problem worse, the systems are often unintentionally sabotaged in order to save money.
What to do about it? That’s a fine question. The obvious solution is to put very smart people in charge of creating and maintaining the most important and most complex systems, leaving the less bright to operate them or to set up the less important systems. The problems with this are that there might not be enough very smart people to go around, given other demands such as scientific research, and that few executives and managers are willing to turn over control (and funding and implicit power) of something they don’t understand. I’m sure that that is not universal but it’s almost so in my experience. There are the related problems that most corporations and probably no bureaucracies are willing to pay a top performer what he’s worth and that few managers and no HR departments are able to distinguish between a genius and a fraud.
Another approach is to scale back large, complex systems to the point that they can be understood by the people available to work on them. That’s not going to happen, not willingly. The lure of ever-bigger government and economy of scale are too strong. The urge to make just one more little tweak to a repeatedly tweaked system rather than redesigning it to properly address new requirements is just as strong.
The only realistic approach is to be more structured about learning from mistakes and problems and creating systems based on best practices. Yes, I recognize the irony of setting up a complex system for creating complex systems. Some engineering disciplines do this to some extent, spreading around lessons learned from problems and setting up best practices which professionals are expected to follow. Commercial aviation is well known for doing so and it’s almost managing to overcome the increase in incompetence at airports. The medical profession also does this, though I’m not sure how much is actually just lip service.
I’m not confident that this approach will be followed, not in general. What I expect is that things will fail or fall apart more and more often in the future. The few bright spots of improvement will be outnumbered by the failures.
Sorry to end on a down note, but that’s the way I see it going. And, hey, at least now you have a better understanding of why you have no electricity in the middle of Winter.
There are two additional points that I want to make which didn’t fit into the narrative above.
First, be aware of bias in noticing and reporting. When things go wrong in a big way, it’s noticed and it’s reported on and the cause (or scapegoat) is searched for. When things go wrong but the checks or failsafes work, it counts as the system working and no one talks about it much except perhaps grumbling about the production line being halted for three hours because someone shipped the wrong thickness of steel sheets.
Second, sometimes things go wrong not because of incompetence or intent by the operators but because someone had a hidden motivation. This can result in a system set up to fail. A number of government projects in the US seem to be this way, especially IT projects. The conflicted mess of written requirements could not possibly be implemented correctly by the best team under the best of circumstances. Constant interference and changes by politicians on high-visibility projects makes it worse. As I started out in this article, I’ll take it that in most cases this truly is because of incompetence rather than because a Moriarty in the bureaucracy is setting it up to fail for some purpose of his own.
EDIT: Francis Porretto has expanded on these thoughts with a valuable contribution of his own. Hie thee hence.