Posts Tagged System Safety
The quality of risk management has mostly fallen for the past few decades. There are signs of change for the better.
Risk management is a broad field; many kinds of risk must be managed. Risk is usually defined in terms of probability and cost of a potential loss. Risk management, then, is the identification, assessment and prioritization of risks and the application of resources to reduce the probability and/or cost of the loss.
The earliest and most accessible example of risk management is insurance, first documented in about 1770 BC in the Code of Hammurabi (e.g., rules 23, 24, and 48). The Code addresses both risk mitigation, through threats and penalties, and minimizing loss to victims, through risk pooling and insurance payouts.
Insurance was the first example of risk management getting serious about risk assessment. Both the frequentist and quantified subjective risk measurement approaches (see recent posts on belief in probability) emerged from actuarial science developed by the insurance industry.
Risk assessment, through its close relatives, decision analysis and operations research, got another boost from World War II. Big names like Alan Turing, John Von Neumann, Ian Fleming (later James Bond author) and teams at MIT, Columbia University and Bletchley Park put quantitative risk analyses of several flavors on the map.
Today, “risk management” applies to security guard services, portfolio management, terrorism and more. Oddly, much of what is called risk management involves no risk assessment at all, and is therefore inconsistent with the above definition of risk management, paraphrased from Wikipedia.
Most risk assessment involves quantification of some sort. Actuarial science and the probabilistic risk analyses used in aircraft design are probably the “hardest” of the hard risk measurement approaches, Here, “hard” means the numbers used in the analyses come from measurements of real world values like auto accidents, lightning strikes, cancer rates, and the historical failure rates of computer chips, valves and motors. “Softer” analyses, still mathematically rigorous, involve quantified subjective judgments in tools like Monte Carlo analyses and Bayesian belief networks. As the code breakers and submarine hunters of WWII found, trained experts using calibrated expert opinions can surprise everyone, even themselves.
A much softer, yet still quantified (barely), approach to risk management using expert opinion is the risk matrix familiar to most people: on a scale of 1 to 4, rate the following risks…, etc. It’s been shown to be truly worse than useless in many cases, for a variety of reasons by many researchers. Yet it remains the core of risk analysis in many areas of business and government, across many types of risk (reputation, credit, project, financial and safety). Finally, some of what is called risk management involves no quantification, ordering, or classifying. Call it expert intuition or qualitative audit.
These soft categories of risk management most arouse the ire of independent and small-firm risk analysts. Common criticisms by these analysts include:
1. “Risk management” has become jargonized and often involves no real risk analysis.
2. Quantification of risk in some spheres is plagued by garbage-in-garbage-out. Frequency-based models are taken as gospel, and believed merely because they look scientific (e.g., Fukushima).
3. Quantified/frequentist risk analyses are not used in cases where historical data and a sound basis for them actually exists (e.g., pharmaceutical manufacture).
4. Big consultancies used their existing relationships to sell unsound (fluff) risk methods, squeezing out analysts with sound methods (accused of Arthur Anderson, McKinsey, Bain, KPMG).
5. Quantitative risk analyses of subjective type commonly don’t involve training or calibration of those giving expert opinions, thereby resulting in incoherent (in the Bayesian sense) belief systems.
6. Groupthink and bad management override rational input into risk assessment (subprime mortgage, space shuttle Challenger).
7. Risk management is equated with regulatory compliance (banking operations, hospital medicine, pharmaceuticals, side-effect of Sarbanes-Oxley).
8. Some professionals refuse to accept any formal approach to risk management (medical practitioners and hospitals).
While these criticisms may involve some degree of sour grapes, they have considerable merit in my view, and partially explain the decline in quality of risk management. I’ve worked in risk analysis involving uranium processing, nuclear weapons handling, commercial and military aviation, pharmaceutical manufacture, closed-circuit scuba design, and mountaineering. If the above complaints are valid in these circles – and they are – it’s easy to believe they plague areas where softer risk methods reign.
Several books and scores of papers specifically address the problems of simple risk-score matrices, often dressed up in fancy clothes to look rigorous. The approach has been shown to have dangerous flaws by many analysts and scholars, e.g., Tony Cox, Sam Savage, Douglas Hubbard, and Laura-Diana Radu. Cox shows examples where risk matrices assign higher qualitative ratings to quantitatively smaller risks. He shows that risks with negatively correlated frequencies and severities can result in risk-matrix decisions that are worse than random decisions. Also, such methods are obviously very prone to range compression errors. Most interestingly, in my experience, the stratification (highly likely, somewhat likely, moderately likely, etc.) inherent in risk matrices assume common interpretation of terms across a group. Many tests (e.g., Kahneman & Tversky and Budescu, Broomell, Por) show that large differences in the way people understand such phrases dramatically affect their judgments of risk. Thus risk matrices create the illusion of communication and agreement where neither are present.
Nevertheless, the risk matrix has been institutionalized. It is embraced by government (MIL-STD-882), standards bodies (ISO 31000), and professional societies (Project Management Institute (PMI), ISACA/COBIT). Hubbard’s opponents argue that if risk matrices are so bad, why do so many people use them – an odd argument, to say the least. ISO 31000, in my view, isn’t a complete write-off. In places, it rationally addresses risk as something that can be managed through reduction of likelihood, reduction of consequences, risk sharing, and risk transfer. But elsewhere it redefines risk as mere uncertainty, thereby reintroducing the positive/negative risk mess created by economist Frank Knight a century ago. Worse, from my perspective, like the guidelines of PMI and ISACA, it gives credence to structure in the guise of knowledge and to process posing as strategy. In short, it sets up a lot of wickets which, once navigated, give a sense that risk has been managed when in fact it may have been merely discussed.
A small benefit of the subprime mortgage meltdown of 2008 was that it became obvious that the financial risk management revolution of the 1990s was a farce, exposing a need for deep structural changes. I don’t follow financial risk analysis closely enough to know whether that’s happened. But the negative example made public by the housing collapse has created enough anxiety in other disciplines to cause some welcome reappraisals.
There is surprising and welcome activity in nuclear energy. Several organizations involved in nuclear power generation have acknowledged that we’ve lost competency in this area, and have recently identified paths to address the challenges. The Nuclear Energy Institute recently noted that while Fukushima is seen as evidence that probabilistic risk analysis (PRA) doesn’t work, if Japan had actually embraced PRA, the high risk of tsunami-induced disaster would have been immediately apparent. Late last year the Nuclear Energy Institute submitted two drafts to the U.S. Nuclear Regulatory Commission addressing lost ground in PRA and identifying a substantive path forward: Reclaiming the Promise of Risk-Informed Decision-Making and Restoring Risk-Informed Regulation. These documents acknowledge that the promise of PRA has been stunted by distrust of the method, focus on compliance instead of science, external audits by unqualified teams, and the above-mentioned Fukushima fallacy.
Likewise, the FDA, often criticized for over-regulating and over-reach – confusing efficacy with safety – has shown improvement in recent years. It has revised its decades-old process validation guidance to focus more on verification, scientific evidence and risk analysis tools rather than validation and documentation. The FDA’s ICH Q9 (Quality Risk Management) guidelines discuss risk, risk analysis and risk management in terms familiar to practitioners of “hard” risk analysis, even covering fault tree analysis (the “hardest” form of PRA) in some detail. The ASTM E2500 standard moves these concepts further forward. Similarly, the FDA’s recent guidelines on mobile health devices seem to accept that the FDA’s reach should not exceed its grasp in the domain of smart phones loaded with health apps. Reading between the lines, I take it that after years of fostering the notion that risk management equals regulatory compliance, the FDA realized that it must push drug safety far down into the ranks of the drug makers in the same way the FAA did with aircraft makers (with obvious success) in the late 1960s. Fostering a culture of safety rather than one of compliance distributes the work of providing safety and reduces the need for regulators to anticipate every possible failure of every step of every process in every drug firm.
This is real progress. There may yet be hope for financial risk management.
In a recent post I mentioned that probabilistic failure models are highly vulnerable to wrong assumptions of independence of failures, especially in redundant system designs. Common-mode failures in multiple channels defeats the purpose of redundancy in fault-tolerant designs. Likewise, if probability of non-function is modeled (roughly) as historical rate of a specific component failure times the length of time we’re exposed to the failure, we need to establish that exposure time with great care. If only one channel is in control at a time, failure of the other channel can go undetected. Monitoring systems can detect such latent failures. But then failures of the monitoring system tend to be latent.
For example, your car’s dashboard has an engine oil warning light. That light ties to a monitor that detects oil leaks from worn gaskets or loose connections before the oil level drops enough to cause engine damage. Without that dashboard warning light, the exposure time to an undetected slow leak is months – the time between oil changes. The oil warning light alerts you to the condition, giving you time to deal with it before your engine seizes.
But what if the light is burned out? This failure mode is why the warning lights flash on for a short time when you start your car. In theory, you’d notice a burnt-out warning light during the startup monitor test. If you don’t notice it, the exposure time for an oil leak becomes the exposure time for failure of the warning light. Assuming you change your engine oil every 9 months, loss of the monitor potentially increases the exposure time from minutes to months, multiplying the probability of an engine problem by several orders of magnitude. Aircraft and nuclear reactors contain many such monitoring systems. They need periodic maintenance to ensure they’re able to detect failures. The monitoring systems rarely show problems in the check-ups; and this fact often lures operations managers, perceiving that inspections aren’t productive, into increasing maintenance intervals. Oops. Those maintenance intervals were actually part of the system design, derived from some quantified level of acceptable risk.
Common-mode failures get a lot press when they’re dramatic. They’re often used by risk managers as evidence that quantitative risk analysis of all types doesn’t work. Fukushima is the current poster child of bad quantitative risk analysis. Despite everyone’s agreement that any frequencies or probabilities used in Fukushima analyses prior to the tsunami were complete garbage, the result for many was to conclude that probability theory failed us. Opponents of risk analysis also regularly cite the Tacoma Narrows Bridge collapse, the Chicago DC-10 engine-loss disaster, and the Mount Osutaka 747 crash as examples. But none of the affected systems in these disasters had been justified by probabilistic risk modeling. Finally, common-mode failure is often cited in cases where it isn’t the whole story, as with the Sioux City DC-10 crash. More on Sioux City later.
On the lighter side, I’d like to relate two incidents – one personal experience, one from a neighbor – that exemplify common-mode failure and erroneous assumptions of exposure time in everyday life, to drive the point home with no mathematical rigor.
I often ride my bicycle through affluent Marin County. Last year I stopped at the Molly Stone grocery in Sausalito, a popular biker stop, to grab some junk food. I locked my bike to the bike rack, entered the store, grabbed a bag of chips and checked out through the fast lane with no waiting. Ninety seconds at most. I emerged to find no bike, no lock and no thief.
I suspect that, as a risk man, I unconsciously model all risk as the combination of some numerical rate (occurrence per hour) times some exposure time. In this mental model, the exposure time to bike theft was 90 seconds. I likely judged the rate to be more than zero but still pretty low, given broad daylight, the busy location with lots of witnesses, and the affluent community. Not that I built such a mental model explicitly of course, but I must have used some unconscious process of that sort. Thinking like a crook would have served me better.
If you were planning to steal an expensive bike, where would you go to do it? Probably a place with a lot of expensive bikes. You might go there and sit in your pickup truck with a friend waiting for a good opportunity. You’d bring a 3-foot long set of chain link cutters to make quick work of the 10 mm diameter stem of a bike lock. Your friend might follow the victim into the store to ensure you were done cutting the lock and throwing the bike into the bed of your pickup to speed away before the victim bought his snacks.
After the fact, I had much different thought thoughts about this specific failure rate. More important, what is the exposure time when the thief is already there waiting for me, or when I’m being stalked?
My neighbor just experienced a nerve-racking common mode failure. He lives in a San Francisco high-rise and drives a Range Rover. His wife drives a Mercedes. He takes the Range Rover to work, using the same valet parking-lot service every day. He’s known the attendant for years. He takes his house key from the ring of vehicle keys, leaving the rest on the visor for the attendant. He waves to the attendant as he leaves the lot on way to the office.
One day last year he erred in thinking the attendant had seen him. Someone else, now quite familiar with his arrival time and habits, got to his Range Rover while the attendant was moving another car. The thief drove out of the lot without the attendant noticing. Neither my neighbor nor the attendant had reason for concern. This gave the enterprising thief plenty of time. He explored the glove box, finding the registration, which includes my neighbor’s address. He also noticed the electronic keys for the Mercedes.
The thief enlisted a trusted colleague, and drove the stolen car to my neighbor’s home, where they used the electronic garage entry key tucked neatly into its slot in the visor to open the gate. They methodically spiraled through the garage, periodically clicking the button on the Mercedes key. Eventually they saw the car lights flash and they split up, each driving one vehicle out of the garage using the provided electronic key fobs. My neighbor lost two cars though common-mode failures. Fortunately, the whole thing was on tape and the law men were effective; no vehicle damage.
Should I hide my vehicle registration, or move to Michigan?
In theory, there’s no difference between theory and practice. In practice, there is.
Last time I started with my friend Willie’s bold claim that he doesn’t believe in probability; then I gave a short history of probability. I observed that defining probability is a controversial matter, split between objective and subjective interpretations. About the only thing these interpretations agree on is that probability values range from zero to one, where P = 1 means certainty. When you learn probability and statistics in school, you are getting the frequentist interpretation, which is considered objective. Frequentism relies on directly equating observed frequencies with probabilities. In this model, the probability of an event exactly equals the limit of the relative frequency of that outcome in an infinitely large number of trials.
The problem with this interpretation in practice – in medicine, engineering, and gambling machines – isn’t merely the impossibility of an infinite number of trials. A few million trials might be enough. Running trials works for dice but not for earthquakes and space shuttles. It also has problems with things like cancer, where plenty of frequency data exists. Frequentism requires placing an individual specimen into a relevant population or reference class. Doing this is easy for dice, harder for humans. A study says that as a white males of my age I face a 7% probability of having a stroke in the next 10 years. That’s based on my membership in the reference class of white males. If I restrict that set to white men who don’t smoke, it drops to 4%. If I account for good systolic blood pressure, no family history of atrial fibrillation or ventricular hypertrophy, it drops another percent or so.
Ultimately, if I limit my population to a set of one (just me) and apply the belief that every effect has a cause (i.e., some real-world chunk of blockage causes an artery to rupture), you can conclude that my probability of having a stroke can only be one of two values – zero or one.
Frequentism, as seen by its opponents, too closely ties probabilities to observed frequencies. They note that the limit-of-relative-frequency concept relies on induction, which might mean it’s not so objective after all. Further, those frequencies are unknowable in many real-world cases. Still further, finding an individual’s correct reference class is messy, possibly downright subjective. Finally, no frequency data exists for earthquakes that haven’t happened yet. All that seems to do some real damage to frequentism’s utility score.
The subjective interpretations of probability propose fixes to some of frequentism’s problems. The most common subjective interpretation is Bayesianism, which itself comes in several flavors. All subjective interpretations see probability as a degree of belief in a specific outcome, as held by a rational person. Think of it as a fair bet with odds. The odds you’re willing to accept for a bet on your race horse exactly equals your degree of belief in that horse’s ability to win. If your filly were in the same race an infinite number of times, you’d expect to break even, based on those odds, whether you bet on her or against her.
Subjective interpretations rely on logical coherence and belief. The core of Bayesianism, for example, is that beliefs must 1) originate with a numerical probability estimate, 2) adhere to the rules of probability calculation, and 3) follow an exact rule for updating belief estimates based on new evidence. The second rule deals with the common core of probability math used in all interpretations. These include things like how to add and multiply probabilities and Bayes theorem, not to be confused with Bayesianism, the belief system. Bayes theorem is an uncontroversial equation relating the probability of A given B to the probability of A and the probability of B. The third rule of Bayesianism is similarly computational, addressing how belief is updated after new evidence. The details aren’t needed here. Note that while Bayesianism is generally considered subjective, it is still computationally exacting.
The obvious problem with all subjective interpretations, particularly as applied to engineering problems, is that they rely, at least initially, on expert opinion. Life and death rides on the choice of experts and the value of their opinions. As Richard Feynman noted in his minority report on the Challenger, official rank plays too large a part in the choice of experts, and the higher (and less technical) the rank, the more optimistic the probability estimates.
The engineering risk analysis technique most consistent with the frequentist (objective) interpretation of probability is fault tree analysis. Other risk analysis techniques, some embodied in mature software products, are based on Bayesian (subjective) philosophy.
When Willie said he didn’t believe in probability, he may have meant several things. I’ll try to track him down and ask him, but I doubt the incident stuck in his mind as it did mine. If he meant that he doesn’t believe that probability was useful in system design, he had a rational belief; but I disagree with it. I doubt he meant that though.
Willie may have been leaning toward the ties between probability and redundancy in system design. Probability is the calculus by which redundancy is allocated to redundant systems. Willie may think that redundancy doesn’t yield the expected increase in safety because having more equipment means more things than can fail. This argument fails to face that, ideally speaking, a redundant path does double the chance having a component failure, but squares the probability of system failure. That’s a good thing, since squaring a number less than one makes it smaller. In other words, the benefit in reducing the chance of system failure vastly exceeds the deficit of having more components to repair. If that was his point, I disagree in principle, but accept that redundancy is no excuse for lack of component design excellence.
He may also think system designers can be overly confident of the exponential increase in modeled probability of system reliability that stems from redundancy. That increase in reliability is only valid if the redundancy creates no common mode failures and no latent (undetected for unknown time intervals) failures of redundant paths that aren’t currently operating. If that’s his point, then we agree completely. This is an area where pairing the experience and design expertise of someone like Willie with rigorous risk analysis using fault trees yields great systems.
Unlike Willie, Challenger-era NASA gave no official statement on its belief in probability. Feynman’s report points to NASA’s use of numeric probabilities for specific component failure modes. The Rogers Commission report says that NASA management talked about degrees of probability. From this we might guess that NASA believed in probability and its use in measuring risk. On the other hand, the Rogers Commission report also gives examples of NASA’s disbelief in probability’s usefulness. For example, the report’s Technical Management section states that, “NASA has rejected the use of probability on the basis that such techniques are insufficient to assure that adequate safety margins can be applied to protect the lives of the crew.”
Regardless of what NASA’s beliefs about porbability, it’s clear that NASA didn’t use fault tree analysis for the space shuttle program prior to the Challenger disaster. Nor did it use Bayesian inference methods, any hybrid probability model, or any consideration of probability beyond opinions about failures of critical items. Feynman was livid about this. A Bayesian (subjective, but computational) approach would have at least forced NASA to make it subjective judgments explicit and would have produced a rational model of its judgments. Post-Challenger Bayesian analyses, including one by NASA, varied widely, but all indicated unacceptable risk. NASA has since adopted risk management approaches more consistent with those used in commercial and military aircraft design.
An obvious question arises when you think about using a frequentist model on nearly one-of-a-kind vehicles. How accurate can any frequency data be for something as infrequent as a shuttle flight? Accurate enough, in my view. If you see the shuttle as monolithic and indivisible, the data is too sparse; but not if you view it as a system of components, most of which, like o-ring seals, have close analogs in common use, with known failure rates.
The FAA mandated probabilistic risk analyses of the frequentist variety (effectively mandating fault trees) in 1968. Since then flying has become safe, by any measure. In no other endeavor has mankind made such an inherently dangerous activity so safe. Aviation safety progressed through many innovations, redundant systems being high on the list. Probability is the means by which you allocate redundancy. You can’t get great aircraft systems without designers like Willie. Nor can you get them without probability. Believe it or not.
On reading my praise of Richard Feynman, a fellow systems engineer and INCOSE member (International Council on Systems Engineering) suggested that I read Feynman’s Minority Report to the Space Shuttle Challenger Enquiry. He said I might not like it. I read it, and I don’t like it, not from the perspective of a systems engineer.
Challenger explosion, Jan. 28, 1986
I should be clear on what I mean by systems engineering. I know of three uses of the term: first, the engineering of embedded systems, i.e., firmware (not relevant here); second, an organizational management approach (relevant, but secondary); third, a discipline aimed at design of assemblies of components to achieve a function that is greater than those of its constituents (bingo). Definitions given by others are useful toward examining Feynman’s minority report on the Challenger.
Simon Ramo, the “R” in TRW and inventor of the ICBM, put it like this: “Systems engineering is a discipline that concentrates on the design and application of the whole (system) as distinct from the parts. It involves looking at a problem in its entirety, taking into account all the facets and all the variables and relating the social to the technical aspect.”
Howard Eisner of GWU says, “Systems engineering is an iterative process of top-down synthesis, development, and operation of a real-world system that satisfies, in a near optimal manner, the full range of requirements for the system.”
INCOSE’s definition is pragmatic (pleasantly, as their guide tends a bit toward strategic-management jargon): “Systems engineering is an interdisciplinary approach and means to enable the realization of successful systems.”
Feynman reaches several sound conclusions about root causes of the flight 51-L Challenger disaster. He observes that NASA’s safety culture had critical flaws and that its management seemed to indulge in fantasy, ignoring the conclusions, advice and warnings of diligent systems and component engineers. He gives specific examples of how NASA management grossly exaggerated the reliability of many systems and components in the shuttle. On this point he concludes, “reality must take precedence over public relations, for nature cannot be fooled.” He describes a belief by management that because an anomaly was without consequence in a previous mission, it is therefore safe. Most importantly, he cites the erroneous use of the concept of factor of safety around the O-ring seals between the two lower segments of the solid rocket motors by NASA management (the Rogers Commission also agrees that failure of these O-rings was the root cause of the disaster). An NASA report on seal erosion in an earlier mission (flight 51-C) had assigned a safety factor of three, based on the seals having eroded only one third of the amount thought to be critical. Feynman replies that the O-rings were not designed to erode, and hence the factor-of-safety concept did not apply. Seal erosion was a failure of the design, catastrophic or not; there was no safety factor at all. “Erosion was a clue that something was wrong; not something from which safety could be inferred.”
But later Feynman incorrectly states that establishing a hypothetical propulsion system failure rate of 1 in 100,000 missions would require an inordinate number of tests to determine with confidence. Here he seems not to grasp both the exponential impact of redundancy on reliability, and that fault tree analysis could confidently calculate low system failure rates based on historical failure rates of large populations of constituent components, combined with the output of FMEAs (failure mode effects analyses) on those components in the relevant systems. This error does not impact Feynman’s conclusions about the root cause of the Challenger disaster. I mention it here because Feynman might be viewed as an authoritative source on systems engineering, but is here doing a poor job of systems engineering.
Discussing the liquid fuel engines, Feynman then introduces the concept of top-down design, which he criticizes. It isn’t clear exactly what he means by top-down. The most charitable reading would be a critique of NASA top management’s overruling the judgments of engineering management and engineers; but, on closer reading, it’s clear this cannot be his meaning:
The usual way that such engines are designed (for military or civilian aircraft) may be called the component system, or bottom-up design. First it is necessary to thoroughly understand the properties and limitations of the materials to be used (for turbine blades, for example), and tests are begun in experimental rigs to determine those. With this knowledge larger component parts (such as bearings) are designed and tested individually…
The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes.
All mechanical-system design is necessarily top-down, in the sense of top-down used by Eisner, above. This use of the term is metaphor for progressive functional decomposition from mission requirements down to component requirements. Engineers cannot, for example, size a shuttle’s fuel pumps based on the functional requirement of having five men and two women orbit the earth to deploy a communications satellite. The fuel pump’s performance requirements ultimately emerge from successive derivations of requirements for subsystem design candidates. This design process is top-down, whether the various layers of subsystem design candidates are themselves newly conceived systems or ones that are already mature products (“off the shelf”). Wikipedia’s article and several software methodology sites incorrectly refer to design using off-the-shelf components as bottom-up – not involving functional decomposition. They err by failing to consider that piecing together existing subsystems toward a grander purpose still first requires functional decomposition of that grander purpose into lower-level requirements that serve as a basis for selecting existing subsystems. Simply put, you’ve got to know what you want a thing to do, even if you build that thing from available parts – software or hardware – in order to select those parts. Using off-the-shelf software subsystems still requires functional decomposition of the desired grander system.
F-117 frontal view
Off-the-shelf is a common strategy in aerospace, primarily for cost and schedule reasons. The Lockheed F-117, despite its unique design, used avionics taken from the C-130 and the F-16, brakes from the F-15, landing gear from the T-38, and other parts from commercial and military aircraft. This was for expediency. For the F-117, these off-the-shelf components still had to go through the necessary requirements validation, functional and stress testing, certification, and approval by all of the “ilities” (reliability, maintainability, supportability, durability, etc) required to justify their use in the vehicle – just as if they were newly designed. Likewise for the Challenger, the choice of new design vs. off-the-shelf should have had no impact on safety or reliability if proper systems engineering occurred. Whether its constituents were new designs or off-the-shelf, the shuttle’s propulsion system is necessarily – and desirably – the result of top-down design. Feynman may simply mean that the design and testing phases were rushed, that omissions were made, and that testing was incomplete. Other evidence suggests this; but these omissions are not a negative consequence of top-down design, which is the only sound process for the design of aircraft and other systems of systems.
It is difficult to imagine any sound basis for Feynman’s use of – and defense of – bottom-up design other than the selection of off-the-shelf components, which, as mentioned above, still entails functional decomposition (top-down design). Other uses of the term appear in discussions of software methodologies. I also found a handful of academic papers that incorrectly – incoherently, in my view – equate top-down with analysis and deduction, and bottom-up with synthesis and induction. The erroneous equation of analysis with deductive reasoning pops up in Design Thinking and social science literature (e.g., at socialresearchmethods.net). It fails to realize that analysis as a means of inferring cause from observed result (i.e., what made this happen?) always entails inductive reasoning. Geometry is deduction; science and engineering are inherently inductive.
The use of bottom-up shows up in software circles in a disparaging sense. It describes a state of system growth that happens with no conscious design beyond that of an original seed. It is non-design, in a sense. Such “organic growth” happens in enterprise software when new features, not envisioned during the original design, are later bolted-on. This can stem from naïve mismanagement by those unaware of the damage done to maintainability and further extensibility of the software system, or through necessity in a merger/acquisition scenario where the system’s owners are aware of the consequences but have no other alternatives. This scenario obviously does not apply to the hardware or software of the Challenger; and if it did, such bottom-up “design” would be a defect of the system, not a virtue.
Hydro-mechanical system components in 737 gear bay
Aerospace has in its legacy an attitude – as opposed to a design method – sometimes called a bottom-up mindset. I’ve encountered this as a form of resistance to methodological system-design-for-safety and the application of redundancy. In my experience it came from expert designers of electro-hydro-mechanical subsystems. A legendary aerospace systems designer once told me with a straight face, “I don’t believe in probability.” You can trace this type of thinking back to the rough and ready pioneers of manned flight. Charles Lindbergh, for example, said something along the lines of, “give me one good engine and one good pilot.” Implicit in this mentality is the notion that safety emerges from component quality rather than from system design. The failure rates of the best aerospace components tend to vary from those of average components by factors of two or ten, whereas redundancy has an exponential effect. Feynman’s criticism of top-down and endorsement of bottom-up – whatever he meant by it – could unfortunately be seen as support for this harmful and oddly persistent notion of bottom-up.
Toward the end of Feynman’s report, he reveals another misunderstanding about design of life-critical systems. In the section on avionics, he faults NASA for using 15-year-old software and hardware designs, concluding that the electronics are obsolete. He claims that modern chip sets are more reliable and of higher quality. This criticism runs contrary to his complaint about top-down design of the main engines, and it misses a key point. The improvements in reliability of newer chips would contribute only negligibly toward improved availability of the quad-redundant system containing them. More importantly, older designs of electronic components are often used in avionics precisely because they are old, mature designs. Accelerated-life testing of electronics is known to be tricky business. We use old-design chips because there is enough historical usage data to determine their failure rates without relying on accelerated-life testing. Long ago at McDonnell Douglas I oversaw use of the Intel 87C196 chip for a system on the C-17 aircraft. The Intel rep told me that this was the first use of the Intel 8086-derivative chip in a military aircraft. We defended its use, over the traditional but less capable Motorola chips, on the basis that the then 10+ year history of 8086’s in similar environments was finally sufficient to establish a statistical failure rate usable in our system availability calculations. Interestingly, at that time NASA had already been using 8086 chips in the shuttle for years.
Feynman’s minority report on the Challenger contains misunderstandings and technical errors from the perspective of a systems engineer. While these errors may have little impact on his findings, they should be called out because of the possible influence they may have on future generations of engineers. The tyranny of pedigree, as we saw with Galileo, can extend a wrong idea’s life for generations.
That said, Feynman makes several key points about the psychology of engineering management that deserve much more attention than they get in engineering circles. First among these in my mind is the fallacy of induction from near-misses viewed as successes, thereby producing undue confidence about future missions.
“His legs were weary, but his mind was at ease, free from the presentiment of change. The sense of security more frequently springs from habit than from conviction, and for this reason it often subsists after such a change in the conditions as might have been expected to suggest alarm. The lapse of time during which a given event has not happened is, in the logic of habit, constantly alleged as a reason why the event should never happen, even when the lapse of time is precisely the added condition which makes the event imminent. A man will tell you that he has worked in a mine for forty years unhurt by an accident, as a reason why he should apprehend no danger, though the roof is beginning to sink; and it is often observable that the older a man gets, the more difficult it is to retain a believing conception of his own death.”
– from Silas Marner, by George Eliot (Mary Ann Evans Cross), 1861
Text and aircraft photos copyright 2013 by William Storage. NASA shuttle photos public domain.