Resilience Engineering as an approach to safety for industry and society. Erik Hollnagel, Professor, Ph.D.

Size: px
Start display at page:

Download "Resilience Engineering as an approach to safety for industry and society. Erik Hollnagel, Professor, Ph.D."

Transcription

1 Resilience Engineering as an approach to safety for industry and society Erik Hollnagel, Professor, Ph.D.

2 The meaning of safety From French Sauf = unharmed / except How can it be done? How much risk is acceptable? How much risk is affordable SAFETY = FREEDOM FROM UNACCEPTABLE RISK What can go wrong? Prevention of Protection against unwanted events unwanted outcomes Unwanted outcome Unexpected event LIFE PROPERTY MONEY Normal performance Accidents, incidents,

3 Safety-I definitions - examples Safety is the state in which the risk of harm to persons or of property damage is reduced to, and maintained at or below, an acceptable level through a continuing process of hazard identification and risk management. Safety is defined as freedom from accidental injury, which can be achieved by Avoiding injuries or harm to patients from care that is intended to help them. Industrial safety can be defined as the ability to manage the risks inherent to operations or related to the environment. Industrial safety is not a dislike of risks; rather it is a commitment to clearly identify them in relation to production operations, assess them in terms of quality and quantity, and manage them.

4 The causality credo Adverse outcomes (accidents, incidents) happen when something goes wrong. Adverse outcomes therefore have causes, which can be found and treated. Find the component that failed by reasoning backwards from the final consequence. Accidents result from a combination of active failures (unsafe acts) and latent conditions (hazards). Find the probability that something breaks, either alone or by simple, logical and fixed combinations. Look for single failures combined with latent conditions that may degrade barriers and defences.

5 Different process => different outcome Function (work as imagined) Success (no adverse events) Acceptable outcomes Malfunction, non-compliance, error Failure (accidents, incidents) Unacceptable outcomes

6 Increasing safety by reducing failures Analysis of adverse events is central to safety. Function (work as imagined) Success (no adverse events) Acceptable outcomes Malfunction, non-compliance, error Failure (accidents, incidents) Unacceptable outcomes

7 Safety-I when nothing goes wrong Safety-I: Safety is defined as a condition where the number of adverse outcomes (accidents / incidents / near misses) is as low as possible. Safety has traditionally been defined by its opposite the lack of safety. The lack of safety means that something goes wrong or can go wrong. Safety-I requires the ability to prevent that something goes wrong. This is achieved by: 1. Find the causes of what goes wrong (RCA). 2. Eliminate causes, disable possible cause-effect links. 3. Measure results by how many fewer things go wrong.

8 Why only look at what goes wrong? Safety-I = Reduced number of adverse events := 1 failure in events Safety-II = Ability to succeed under varying conditions. Focus is on what goes wrong. Look for failures and malfunctions. Try to eliminate causes and improve barriers. Focus is on what goes right. Use that to understand everyday performance, to do better and to be safer. Safety and core business compete for resources. Learning only uses a fraction of the data available Safety and core business help each other. Learning uses most of the data available := non-failures in events

9 Various risks in practice Likelihood of being in a fatal accident on a commercial flight. 1 : 7,000, x 10-7 Core Damage Frequency for a nuclear reactor (per reactor year). Likelihood of iatrogenic harm when admitted to a hospital. 1 : 20, x : x 10-1

10 What should we be looking for? Easy to see Complicated aetiology Difficult to change Difficult to manage Difficult to see Uncomplicated aetiology Easy to change Easy to manage Easy to see Complicated aetiology Difficult to change Difficult to manage

11 A formalism for system characterisation Understanding how the system functions can range from Easy to Difficult. Easy Difficult The regularity of system functions can range from Low ( disorderly ) to High ( clockwork ). Low High D Descriptions can be Simple or Complicated, in terms of number of parts and relations. Simple Complicated E Understanding S L Regularity C Description H

12 Tractable, homogeneous systems Standardized industrial production is highly regular, easy to understand, and relatively simple to describe. System functions are Tractable and Homogeneous Understanding D E S L Regularity C Description H

13 Intractable, heterogeneous systems Services for unscheduled demands, such as an emergency room, are irregular, complicated to describe and difficult to understand. System functions are Intractable and Heterogeneous Understanding D E S L Regularity C Description H

14 Why do people adjust their work? AVOID anything that may have negative consequences for yourself, your group, or organisation MAINTAIN/CREATE conditions that may be of use in case of future problems. COMPENSATE FOR unacceptable conditions so that it becomes possible to do your work.

15 Same process => different outcomes Constraining performance variability to prevent failures will also prevent successful everyday work. Function (work as imagined) Success (no adverse events) Acceptable outcomes Everyday work (performance variability) Malfunction, non-compliance, error Failure (accidents, Unacceptable outcomes incidents)

16 Increase safety by facilitating work Understanding the variability of everyday performance is the basis for patient safety. Function (work as imagined) Success (no adverse events) Acceptable outcomes Everyday work (performance variability) Malfunction, non-compliance, error Failure (accidents, Unacceptable outcomes incidents)

17 Safety-II when everything goes right Safety-II: Safety is defined as a condition where the number of successful outcomes (meaning everyday work) is as high as possible. It is the ability to succeed under varying conditions. Safety-II is achieved by trying to make sure that things go right, rather than by preventing them from going wrong. Individuals and organisations must adjust to the current conditions in everything they do. Everyday performance must be variable in order for things o work. Everyday performance variability Expected outcomes (success) Unexpected outcomes (failure)

18 Resilience and safety management A system is resilient if it can adjust its functioning prior to, during, or following changes and disturbances, and thereby sustain required operations under both expected and unexpected conditions. This requires that all levels of the organisation are able to: Respond to regular and irregular conditions in an effective, flexible manner, Factual Learn from past events, understand correctly what happened and why Anticipate long-term threats and opportunities Actual Critical Potential Monitor short-term developments and threats; revise risk models

19 The ability to respond (actual) Actual Factual What When How Critical Potential For which events is there a response ready? How was the list of events created? When and why is the list revised? What is the threshold of response? How soon can a response been given? How long can it be sustained? How was the type of response determined? How many resources are allocated to response readiness? How is the readiness verified or maintained?

20 Ability to respond? Resilient?

21 The ability to monitor (critical) Actual Factual Critical Potential How have the indicators been defined? (Articulated vs. common sense )? How, and when, are they revised? How many are leading indicators and how many are lagging? How are the measurements made? (qualitative, quantitative) When are the measurements made (continuously, regularly)? What are the delays between measurement and interpretation? Are effects transient or permanent?

22 The real indicators Unplanned Automatic Scrams per 7,000 Hours Critical Industrial Safety Accident Rate Safety is defined as that which is measured by the indicators. Availability is more important than meaningfulness. Safety Collective Radiation Exposure Unit Capability Factor Unplanned Capability Loss Factor Σ PSIi Responsei Forced Loss Rate Proxy indicator: Indirect measure or sign that represents a phenomenon in the absence of a direct measure or sign. Indicators are based on an articulated description (model) of the system and of safety. Meaningfulness is more important than availability.

23 The ability to learn (factual) Actual Factual Critical Potential What is the learning based on (successes failures)? When does learning take place (continuously or event-driven)? What is the nature of learning (qualitative, quantitative)? What is the target of learning (individuals, organisation)? How are the effects of learning verified and maintained?

24 What should we look for and learn from? Classical safety approaches: Look for what went wrong. Reconstruct failure sequence (time-line) Find the component or subsystem that failed. Look at events based on their severity. Resilience engineering approach: Look for what did not go right. Construct account of everyday, successful performance. Look for how performance variability, alone or in combination, could lead to loss of control. Look at events based on their frequency.

25 The ability to look ahead (potential) Actual Factual Critical Potential The future is a mirror image of the past (repetition, extrapolation) The future is described as a (re)combination of past events and conditions. Mechanistic view Probabilistic view The future has not been seen before. It involves a combination of known performance variability, that usually is seen as irrelevant for safety Realistic view

26 Conclusion: Two approaches to safety Something that goes right cannot go wrong at the same time. Yet we cannot make something go right simply by preventing it from going wrong. We can only make something go right by understanding the nature of everyday performance, instead of taking it for granted, and blissfully neglecting it. but which kind of safety? Protective safety (Safety-I) Productive safety (Safety-II)