The Path to More Cost-Effective System Safety

The Path to More Cost-Effective System Safety Nancy Leveson Aeronautics and Astronautics Dept. MIT

Changes in the Last 50 Years New causes of accidents created by use of software Role of humans in systems and in accidents has changed Increased recognition of importance of organizational and social factors in accidents Faster pace of technological change Learning from experience ( fly-fix-fly ) no longer as effective Introduces unknowns and new paths to accidents Less exhaustive testing is possible Increasing complexity Decreasing tolerance for single accidents

Why Our Efforts are Often Not Cost-Effective Efforts superficial, isolated, or misdirected Often isolated from engineering design Spend too much time and effort on assurance not building safety in from the beginning Focusing on making arguments that systems are safe rather than making them safe Safety cases : Subject to confirmation bias Traditional system safety tries to prove the system is unsafe (looks for paths to hazards), not that it is safe Safety must be built in, it cannot be assured in or argued in

Why our Efforts are Often Not Cost-Effective (2) Safety efforts start too late 80-90% of safety-critical decisions made in early system concept formation (C.O. Miller) Cannot add safety to an unsafe design Most of our techniques require a relatively complete design to work

Why our Efforts are Often Not Cost-Effective (3) Using inappropriate techniques for systems built today Mostly used hazard analysis techniques created 40-50 years ago Developed for relatively simple electromechanical systems New technology increasing complexity of system designs and introducing new accident causes Complexity is creating new causes of accidents Need new, more powerful safety engineering approaches to dealing with complexity and new causes of accidents

Relation of Complexity to Safety In complex systems, behavior cannot be thoroughly Planned Understood Anticipated Guarded against Critical factor is intellectual manageability Leads to unknowns in system behavior Need tools to Stretch our intellectual limits Deal with new causes of accidents

Types of Complexity Relevant to Safety Interactive Complexity: arises in interactions among system components Non-linear complexity: cause and effect not related in an obvious way Dynamic complexity: related to changes over time

Chain-of-Events Model Explains accidents in terms of multiple events, sequenced as a forward chain over time. Simple, direct relationship between events in chain Events almost always involve component failure, human error, or energy-related event Forms the basis for most safety-engineering and reliability engineering analysis: e,g, FTA, PRA, FMECA, Event Trees, etc. and design: e.g., redundancy, overdesign, safety margins,.

Limitations of Chain-of-Events Causation Models Oversimplifies causality Excludes or does not handle Component interaction accidents (vs. component failure accidents) Indirect or non-linear interactions among events Systemic factors in accidents Human errors System design errors (including software errors) Adaptation and migration toward states of increasing risk

Reliability Engineering Approach to Safety Many accidents occur without any component failure Caused by equipment operation outside parameters and time limits upon which reliability analyses are based. Caused by interactions of components all operating according to specification. Highly reliable components are not necessarily safe

Reliability is NOT equal to safety in complex systems

It s only a random failure, sir! It will never happen again.

Software and Safety Software is very different from hardware. We cannot just apply techniques developed for hardware assumptions and expect them to work. We need something new that fits software properties.

Software-Related Accidents Are usually caused by flawed requirements Incomplete or wrong assumptions about operation of controlled system or required operation of computer Unhandled controlled-system states and environmental conditions Merely trying to get the software correct or to make it reliable will not make it safer under these conditions.

Human Error Human error is a symptom, not a cause All behavior affected by context (system) in which occurs To do something about error, must look at system in which people work: Design of equipment Usefulness of procedures Existence of goal conflicts and production pressures Role of operators in our systems is changing Supervising rather than directly controlling Systems are stretching limits of comprehensibility Designing systems in which operator error inevitable and then blame accidents on operators rather than designers

Why our Efforts are Often Not Cost-Effective (4) Focus efforts only on technical components of systems Ignore or only superficially handle Management decision making Operator error (and operations in general) Safety culture Focus on development and often ignore operations

Why our Efforts are Often Not Cost-Effective (4) Inadequate risk communication Applying probabilistic risk analysis for events that are not random Software errors are design errors, not random failures Human error is not random (slips vs. mistakes) Component interaction accidents (system design errors) are not random End up either leaving things out or making up numbers Need better ways to assess and communicate risk

Why our Efforts are Often Not Cost-Effective (4) Limited learning from events Oversimplification of accident causation Blame is the enemy of safety Focus on who and not why Root cause seduction Believing in a root cause appeals to our desire for control Leads to a sophisticated whack a mole game Fix symptoms but not process that led to those symptoms In continual fire-fighting mode Having the same accident over and over

The Starting Point: Questioning Our Assumptions It s never what we don t know that stops us, it s what we do know that just ain t so. (Attributed to many people) Many of our hazard analysis and System Safety techniques are based on assumptions that no longer hold

Assumptions No Longer True Safety is increased by increasing system or component reliability If components do not fail, then mishaps will not occur Mishaps are caused by chains of directly related events. We can understand mishaps and assess risk by examining the events leading to the loss Probabilistic risk analysis based on event chains is the best way to assess and communicate safety and risk information.

Assumptions No Longer True (2) Most mishaps are caused by operator error. Rewarding safe behavior, punishing unsafe behavior, and better training will eliminate or reduce accidents significantly Human error is random and can be assessed using probabilities Highly reliable software is safe Most major mishaps occur from the chance simultaneous occurrence of random events Assigning blame is necessary to learn from and prevent mishaps

Summary The world of engineering is changing. If system safety does not change with it, it will become more and more irrelevant. Trying to shoehorn new technology and new levels of complexity into old methods does not work

So What Do We Need to Do? Engineering a Safer World Expand our accident causation models Create new, more powerful and inclusive hazard analysis techniques Use new system design techniques Safety-guided design Integrate System Safety more into system engineering Improve accident analysis and learning from events Improve control of safety during operations Improve management decision-making and safety culture

Additional information in: Nancy Leveson, Engineering a Safer World: MIT Press, January 2012