STPA: A New Hazard Analysis Technique. Presented by Sanghyun Yoon

STPA: A New Hazard Analysis Technique Presented by Sanghyun Yoon

Introduction Hazard analysis can be described as investigating an accident before it occurs. Potential causes of accidents can be eliminated or controlled in design or operations before damage occurs. The most widely used existing hazard analysis techniques were developed fifty years ago and have serious limitations. This chapter describes a new approach to hazard analysis, bases on the STAMP causality model, called STPA (System-Theoretic Process Analysis). 2

Goals for a New Hazard Analysis Technique Old hazard analysis techniques Fault Tree Analysis Event Tree Analysis HAZOP FMEA 3

Goals for a New Hazard Analysis Technique (cont d) Including the new causal factors identified in STAMP that are not handled by the older techniques. Old techniques did not include design errors, software flaws, component interaction accidents, and so on. The goal is to identify accident scenarios that encompass the entire accident process, not just the electro-mechanical components. Old techniques have different assumptions which are different from new causal factors. To provide guidance to the users in getting good results. Fault and event tree: the tree itself is the result of analysis HAZOP: guidewords more than, less than and opposite It can be used before a design has been created. 4

The STPA Process (1) STPA can be used at any stage of the system lifecycle. STPA uses a control diagram, the functional requirements, system hazards, the safety constraints and safety requirements for components as defined in chapter 7. When STPA is applied to an existing design (chapter 8) This Information is available when analysis process begins. When STPA is used for safety-guided design, (chapter 9) Only system the system-level requirements and constraints may be available at the beginning of the process. These requirements and constraints are refined and traced to individual system components as the iterative design and analysis process proceeds. 5

The STPA Process (2) Main steps 1. Identify the potential for inadequate control of the system that could lead to a hazardous state. a. A required control action is not provided or not followed; b. An incorrect or unsafe control action is provided; c. A potentially safe control action is provided too early or too late, that is, at the wrong time or in the wrong sequence; d. A correct control action is stopped too soon; 6

The STPA Process (3) Main steps 2. Determine how each potentially hazardous control action identified in step 1 could occur. a. Augment the control structure with process models for each control component. b. For each unsafe control action, examine the parts of the control loop to see if they could cause it. Design controls and mitigation measures if they do not already exist or evaluate existing measures if the analysis is being performed on an existing design. For multiple controllers of the same component or safety constraint, identify conflicts and potential coordination problems. c. Consider how the designed controls could degrade over time and build in protection, including 1) Management of change procedures to ensure safety constraints are enforced in planned changes. 2) Performance audits where the assumptions underlying the hazard analysis are the preconditions for the operational audits and controls so that unplanned changes that violated the safety constraints can be detected. 3) Accident and incident analysis to trace anomalies to the system design. 7

The STPA Process (4) Control structure example 8

Hazard: Inadvertent launch 9

Identifying Potentially Hazardous Control Actions (Step 1) Assess the safety controls to determine the potential for leading to a hazard. Control actions can be hazardous in four ways a. A required control action is not provided or not followed. b. An incorrect or unsafe control action is provided that lead to hazard. c. A potentially safe control action is provided too early, too late, or out of sequence. d. A correct control action is stopped too soon (for a continuous or nondiscrete control action). 10

Identifying Potentially Hazardous Control Actions (Step 1) - Interlock example Safety constrains The power must always be off when the door is open; A power off command must be provided within x milliseconds after the door is opened A power on command must never be issued when the door is open The power on command must never be given until the door is fully closed 11

Identifying Potentially Hazardous Control Actions (Step 1) - FMIS example (1) 12

Identifying Potentially Hazardous Control Actions (Step 1) - FMIS example (2) 13

STPA: A New Hazard Analysis Technique DETERMINING HOW UNSAFE CONTROL ACTIONS COULD OCCUR (STEP 2) 14

Purpose of Step 2 Step 1 provides the component safety requirements(constraints). Step 2 used to identify the ways that the software safety constraint might still not be enforced by the software logic and system design. Step 2 identifies the scenarios or paths to hazard found in a classic analysis. 15

Step 2a: Augment the control structure with process models To create causal scenarios, the control structure diagram is first augmented with process models for each component. If the system exists, The content of these models should be easily determined by looking at the system functional design and its documentation. If the system does not yet exist, The analysis can start with a best guess then be refined and changed as the analysis process proceeds. 16

Step 2b: Determine how hazardous control actions could occur 1. Identify how hazardous control action could happen. 2. For each potential causes Design controls and mitigation measures if do not already exist Evaluate existing measure if the analysis is being performed on an existing design 19

Identify how hazardous control action could happen 20

Try to eliminate the hazard completely Interlock example Design: add a circuit which is broken as soon as the door opens Feedback: add a light that identifies whether the power supply is on or off Incorrect assumption More about the safety design? Chapters 16 and 17 of Safeware and Chapter 9 of this book Changes must be analyzed for new hazards or causal scenarios. 21

Step 2b: FMIS example (1) A missing case in the requirements A common occurrence in accidents where software is involved Described in next slide. 22

track Radar detected a missile -> Weapons Free -> Fire Enable 23

Step 2b: FMIS example (1) (cont d) When an operator make a track inactive? A given period passes with no radar input The total predicted impact time elapses for the track An intercept is confirmed One case was note considered by the designers If an operator deselects all of the options, no tracks will be marked as inactive -> FIRE ENABLE command would be send to the launch station immediately, even if there were no threats Solution Fix the software requirements and the software design to include the missing case. 24

Step 2b: FMIS example (2) confusion between the regular and the test software Launch Station has two operating modes Testing Live fire Causal sequence Small window of time before the launch station notifies the fire control component of the change 25

Step 2c: Consider the degradation of controls over time Build in protection against the degradation. Degradations, interlock example: warning light Operator(human who opens the door) might create workarounds to bypass the light They do not understand the purpose The light might be partially blocked from view because of workplace changes -> The assumptions and required audits should be identified during the system design process 26

Step 2c: Consider the degradation of controls over time (cont d) Management of change procedures need to be developed and the STPA analysis revisited. (chapter 10) After accidents and incidents, the design and the hazard analysis should be revisited to determine why the controls were not effective. (chapter 12) 27

Human controllers (1) Humans in the system can be treated in the same way as automated components in Step 1 of STPA. Difference: Humans need an additional process model All controllers need a model of the process they are controlling directly Human controllers also need a model of any process 28

Human controllers (2) Difference: algorithm Automated system: basically static control algorithms Humans: dynamic control algorithms that they change as a result of feedback and changes in goals Human error is best modeled and understood using feedback loops Example: Driving a car Driving algorithm is changed by car s performance, in goals and motivation, or driving experience. 30

Human controllers (3) Hazard analysis must identify the specific human behaviors that can lead to the hazard. We are not able to redesign humans. Training can be helpful, but not nearly enough Take the information obtained in the hazard analysis about worst case human behavior, Use it in the design of the other system components 31

STPA: A New Hazard Analysis Technique USING STPA ON ORGANIZATIONAL COMPONENTS OF THE SAFETY CONTROL STRUCTURE 32

Programmatic and organizational risk analysis NASA Independent Technical Authority (ITA) function risk analysis. 33

Gap Analysis Mapping the identified safety requirements and constraints to the assigned responsibilities in the safety control structure and identify any gaps. The responsibility for the system-level safety requirement: 1a. Sate-of-the art safety standards and requirements for NASA missions must be established, implemented, enforced, and maintained that protect the astronauts, the workforce, and the public 35

Gap Analysis (cont d) finding holes Omission: technical conscience No defined way to express concerns about the warrant holders. Omission: person(s) who see that engineers and managers are trained to use the results of hazard analyses in their decision-making. 37

Hazard analysis to identify organizational and programmatic risks Risks in STAMP terms can be divided into two types Basic inadequacies in the way individual components in the control structure fulfill their responsibilities Risks involved in the coordination of activities and decision-making that can lead to unintended interactions and consequences 38

Hazard analysis to identify organizational and programmatic risks basic risks (1) General types of risks: Unsafe decisions are made or approved by the Chief Engineer or warrant holders. Safe decisions are disallowed (e.g., overly conservative decision-making that undermines the goals of NASA and long-term support for ITA) Decision-making takes too long, minimizing impact and also reducing support for the ITA. Good decisions are made by the ITA, but do not have adequate impact on system design, construction, and operation. General responsibilities must be refined into more specific control actions to omit the control actions. 39

Hazard analysis to identify organizational and programmatic risks basic risks (2) The control actions the Chief Engineer has available to implement this responsibility are: To develop, monitor, and maintain technical standards and policy In coordination with programs and projects, to establish or approve the technical requirements and ensure they are enforced and implemented in the programs and projects (ensure the design is compliant with the requirements) To approve all changes to the initial technical requirements To approve all variance (waivers, deviations, exceptions to the requirements) Etc. 40

Hazard analysis to identify organizational and programmatic risks basic risks (3) To develop, monitor, and maintain technical standards and policy is identified using STPA Step 1 include: General technical and safety standards are not created. Inadequate standards and requirements are created. Standards degrade over time due to external pressure to weaken them. The process for approving changes is flawed. Standards are not changed over time as the environment changes. 41

Hazard analysis to identify organizational and programmatic risks coordination risks Coordination risks arise when multiple people or groups control the same process. The types of unsafe interactions that may result include Both controllers assume the other is performing the control responsibilities and as a result nobody does Controllers provide conflicting control actions that have unintended side effects. When similar responsibilities related to the same system requirement are identified, the potential for new coordination risks needs to be considered. 42

Use of the analysis and potential extensions The application of Step 2 of STPA to identify causes of the risks will help to provide better control measures. Used Fig. 8.6 43

Use of the analysis and potential extensions The application of Step 2 of STPA to identify causes of the risks will help to provide better control measures. Used Fig. 8.6 The risks are categorized. Categories for ITA: Immediate concern: An immediate and substantial concern that should be part of a near term-assessment. Longer-term concern: A substantial longer-term concern that should potentially be part of future assessments, as the risk will increase over time or cannot be evaluated without future knowledge of the system or environment behavior. Standard Process: An important concern that should be addressed through standard processes, such as inspections, rather than an extensive special assessment procedure. 45

STPA: A New Hazard Analysis Technique RE-ENGINEERING A SOCIO-TECHNICAL SYSTEM: PHARMACEUTICAL SAFETY AND THE VIOXX TRAGEDY 46

This section provides an example using pharmaceutical safety. Couturier has performed a STAMP-based causal analysis of the incidents associated with the introduction and withdrawal of Vioxx. Vioxx events 47