An Overview of Software Reliability

Software Design For Reliability (DfR) Seminar An Overview of Software Reliability Bob Mueller bobm@opsalacarte.com www.opsalacarte.com

Software Quality and Software Reliability Related Disciplines, Yet Very Different

Definition of Software Quality FACTORS CRITERIA Functionality suitability accuracy interoperability security Usability understandability learnability operability attractiveness Software Quality *ISO9126 Quality Model Reliability Efficiency maturity fault tolerance recoverability time behavior resource utilization Software Quality The level to which the software characteristics conform to all the specifications. Portability analysability changeability stability testability adaptability installability Maintainability co-existence replaceability (v0.5) Ops A La Carte 3

Most Common Misconception FACTORS Functionality Usability CRITERIA suitability accuracy interoperability security understandability learnability operability attractiveness What organizations believe they are doing ------------------ We have a strong SW quality program. We don t need to add SW reliability practices. Software Quality *ISO9126 Quality Model Reliability Efficiency maturity fault tolerance recoverability time behavior resource utilization What is missing --------------- Implementing sufficient SW reliability practices to satisfy customer expectations Portability analysability changeability stability testability What the organizations are really doing ------------------ Implementing only a sparse set of SW quality adaptability installability Maintainability co-existence replaceability practices (v0.5) Ops A La Carte 4

Software Design For Reliability (DfR) Seminar Background on Software Reliability

Software Reliability Can Be Measured Software Reliability is 20 years behind HW reliability Ramifications of failure Education on the consumer side Many consumers just expect unreliable s/w Education on the manufacturer s side Mfgs don t know new innovative methods Mfgs don t figure out how users will use product Software engineers are more free-spirited than HW Entry cost for a SW devel. team less than for HW (v0.5) Ops A La Carte 6

Reliability vs. Cost OPTIMUM COST POINT TOTAL COST CURVE RELIABILITY PROGRAM COSTS COST HW WARRANTY COSTS RELIABILITY The SW impact on HW warranty costs is minimal at best (v0.5) Ops A La Carte 7

Reliability vs. Cost, continued SW has no associated manufacturing costs, so warranty costs and saving are almost entirely allocated to HW If there are no cost savings associated with improving software reliability, why not leave it as is and focus on improving HW reliability to save money? One study found that the root causes of typical embedded system failures were SW, not HW, by a ratio of 10:1. Customers buy systems, not just HW. The benefits for a SW Reliability Program are not in direct cost savings, rather in: Increased SW/FW staff availability with reduced operational schedules resulting from fewer corrective maintenance content. Increased customer goodwill based on improved customer satisfaction. (v0.5) Ops A La Carte 8

Defining Software Reliability

Software Reliability Definitions The customer perception of the software s ability to deliver the expected functionality in the target environment without failing. Examine the key points Practical rewording of the definition Software reliability is a measure of the software failures that are visible to a customer and prevent a system from delivering essential functionality. (v0.5) Ops A La Carte 10

Software Reliability Can Be Measured Measurements are a required foundation Differs from quality which is not defined by measurements All measurements and metrics are based on run-time failures Only customer-visible failures are targeted Only defects that produce customer-visible failures affect reliability Corollaries Defects that do not trigger run-time failures do NOT affect reliability badly formatted or commented code defects in dead code Not all defects that are triggered at run-time produce customer-visible failures corruption of any unused region of memory SW Reliability evolved from HW Reliability SW Reliability focuses only on design reliability HW Reliability has no counterpart to this (v0.5) Ops A La Carte 11

Software Reliability Is Based On Usage SW failure characteristics are derived from the usage profile of a particular customer or set of customers Each usage profile triggers a different set of run-time SW faults and failures Example Examine product usage by 2 different customers Customer A s usage profile only exercises the sections of SW that produce very few failures. Customer B s usage profile overlaps with Customer A s usage profile, but additionally exercises other sections of SW that produce many, frequent failures. Customer assessment of the product s software reliability Customer A s assessment - the SW reliability is high Customer B s assessment - the SW reliability is low (v0.5) Ops A La Carte 12

Reliability Correctness Correctness is a measure of the degree of intended functionality implemented by the SW Correctness measures the completeness of requirements and the accuracy of defining a SW model based on these requirements Reliability is a measure of the behavior (i.e., failures) that prevents the software from delivering the implemented functionality (v0.5) Ops A La Carte 13

Defects, Faults, and Failures

Terminology Defect A flaw in the requirements, design or source code that produces implementation logic that will trigger a fault Defect Defects of omission Not all requirements were used in creating a design model The design satisfies all requirements but is incomplete The source code did not implement all the design The source code has missing or incomplete logic Defects of commission Incorrect requirements are specified Requirements are incorrectly translated into a design model The design is incorrectly translated into source code The source code logic is flawed Defects are static and can be detected and removed without executing the source code Defects that cannot trigger a SW failure are not tracked or measured Ex: quality defects, such as test case and soft maintenance defects, and defects in dead code (v0.5) Ops A La Carte 15

Terminology (continued) Fault The result of triggering a SW defect by executing the associated implementation logic Defect Faults are NOT always visible to the customer A fault can be the transitional state that results in a failure Trivially simple defects (e.g., display spelling errors) do not have intermediate fault states Fault Failure A customer (or operational system) observation or detection that is perceived as an unacceptable departure of operation from the designed SW behavior Failures MUST be observable by the customer or an operational system Failures are the visible, run-time symptoms of faults Not all failures result in system outages Defect Fault Failure (v0.5) Ops A La Carte 16

Basic Failure Classification High-level SW failure classification based on complexity and time-sensitivity of triggering the associated defect: Bohr Bugs Heisen Bugs Aging Bugs Bohr Bugs Named after the Bohr atom Connotation: Deterministic failures that are straight-forward to isolate Failures are easily reproducible, even after a system restart/reboot Most frequent failure category detected during development, testing and early deployment These are considered trivial defects since every execution of the associated logic results in a failure (v0.5) Ops A La Carte 17

Basic Failure Classification (continued) Heisen Bugs Named after the Heisenberg uncertainty principle Connotation: Failures that are difficult to isolate to a root cause Intermittent failures that are rarely triggered and difficult to reproducible. Unlikely to reoccur following a system restart/reboot Common root causes: Synchronization boundaries between SW components Improper or insufficient exception handling Interdependent timing of multiple events Rarely detected when the SW is not mature (i.e., during early development and testing phases) The best methods to deal with these tough defects are by Identification using SW failure analysis Impact mitigation using fault tolerant code (v0.5) Ops A La Carte 18

Basic Failure Classification (continued) Aging Bugs Attributed to the results of continuous, long-term operations or use Connotation: Failures resulting from accumulation of erroneous conditions Transient failures occur after extended run-time or functional cycles where the contributing faults have occurred numerous times Preceding faults may lead to system performance degradation before a failure occurs Extremely unlikely to reoccur following a system restart/reboot due to the longevity requirement Common root causes: Deterioration in the availability of OS resources (e.g., depletion of device handles, memory leaks, heap fragmentation) Data corruption Application race conditions Accumulation of numerical round-off errors Gradual data accumulation for sampling or queue build-up The best methods to deal with these tough defects are by Identification using SW failure analysis Impact mitigation using fault tolerant code (v0.5) Ops A La Carte 19

What Is Reliable Software??

Reliable Software Characteristics Operates within the reliability specification that satisfies customer expectations Measured in terms of failure rate and availability level The goal is rarely defect free or ultra-high reliability Gracefully handles erroneous inputs from users, other systems, and transient hardware faults Attempts to prevent state or output data corruption from erroneous inputs Quickly detects, reports and recovers from SW and transient HW faults SW provides system behave as continuously monitoring, self-diagnosing and self-healing Prevents as many run-time faults as possible from becoming system-level failures (v0.5) Ops A La Carte 21

Common Paths to Software Reliability Traditional SW Reliability Programs - Predictions Program directed by a separate team of reliability engineers Development process viewed as a SW-generating, black box Develop prediction models to estimate the number of faults in the SW Reliability techniques used to identify defects and produce SW reliability metrics Traditional HW failure analysis techniques, e.g., FMEAs or FTAs Defect estimation and tracking SW Process Control Based on the assumption of a correlation between development process maturity and latent defect density in the final SW Ex: CMM Level 3 organizations can develop SW with 3.5 defects/ksloc If the current process level does not yield the desired SW reliability, audits and stricter process controls are implemented Quality Through SW Testing Most prevalent approach for implementing SW reliability Assumes reliability is increased by expanding the types of system tests (e.g., integration, performance and loading) and increasing the duration of testing Measured by counting and classifying defects (v0.5) Ops A La Carte 22

Common Paths to Software Reliability (continued) These approaches generally do not provide a complete solution Reliability prediction models are not well-understood SW engineers find it difficult to apply HW failure analysis techniques to detailed SW designs Only 20% of the SW defects identified by quality processes during development (e.g., code inspections) affect reliability System testing is an inefficient mechanism for finding run-time failures Generally identifies no more than 50% of run-time failures Quality processes for tracking defects do not produce SW reliability information such as defect density and failure rates Net Effect: SW engineers still end up spending more than 50% of their time debugging, instead of focusing on designing or implementing source code (v0.5) Ops A La Carte 23

Design for Reliability (DfR)

Software Defect Distributions Average distribution of SW defects by lifecycle phase: 20% 30% 35% 10% 5% Requirements Design Coding Bad Defect Fixes (introduction of secondary defects) Customer Documentation Average distribution of SW defects at the time of field deployment: (based on 1 st year field defect report data) 1% 20% 35% 44% Severity 1 (catastrophic) Severity 2 (major) Severity 3 (minor) Severity 4 (annoyance) (v0.5) Ops A La Carte 25

Typical Defect Tracking (System Test) System Test Build Severity #1 Defects Found Severity #2 Defects Found Severity #3 Defects Found Severity #4 Defects Found Total Defects Found SysBuild-01 7 9 16 22 54 SysBuild-02 5 5 14 26 50 SysBuild-03 4 6 8 16 34 SysBuild-7 0 1 4 6 11 (v0.5) Ops A La Carte 26

Defect Origin and Discovery Typical Behavior Defect Origin Requirements Design Coding Testing Maintenance Defect Discovery Requirements Design Coding Testing Maintenance Surprise! Goal of Best Practices on Defect Discovery Defect Origin Requirements Design Coding Testing Maintenance Defect Discovery Requirements Design Coding Testing Maintenance (v0.5) Ops A La Carte 27

Defect Removal Efficiencies Defect removal efficiency is a key reliability measure Defects found Removal efficiency = Defects present Defects present is the critical parameter that is based on inspections, testing and field data Requirements Design Coding Unit Testing System & Subsystem Testing Stages Field Deployment Inspection Efficiency Testing Efficiency Overall Efficiency Example: Origin Defects Found Metric Removal Efficiency Inspections 90 Unit Testing 25 Inspection Efficiency 43% = (90 / 210) Testing Efficiency 38% = (80 / 210) System & Subsystem Testing 55 Overall Efficiency 81% = (170 / 210) Field Deployment 40 TOTAL 210 (v0.5) Ops A La Carte 28

Reliability Defect Tracking (All Phases) Activity Total Failures Found Total Critical Failures Found Defect Density Reqmts Critical Defects Found Design Critical Defects Found Code Critical Defects Found Unit Test Critical Failures Found System Test Critical Failures Found Reqmts 75 12 16% 12 Design 123 45 37% 4 41 Code 158 72 46% 4 6 62 Unit Test 78 25 35% 1 4 18 2 Development Totals DRE (development) 434 154 21 51 80 2 57% 80% 78% 100% System Test 189 53 68% 1 11 31 6 2 DRE (after system testing) 55% 66% 56% 25% 100% (v0.5) Ops A La Carte 29

Defect Removal Technique Impact Design Inspections / Reviews n n n n Y n Y Y Code Inspections / Reviews n n n Y n n Y Y Formal SQA Processes n Y n n n Y n Y Formal Testing n n Y n n Y n Y Median Defect Efficiency 40% 45% 53% 57% 60% 65% 85% 99% (v0.5) Ops A La Carte 30

Typical Defect Reduction Goals 200 150 100 50 SysBld SysBld SysBld SysBld SysBld #1 #2 #3 #4 #5 System Test (v0.5) Ops A La Carte 31

Design for Reliability 200 150 Goal is to Predict Defect Totals for Next Phase 100 50 Req Design Code Unit SysBld SysBld SysBld SysBld SysBld Field Test #1 #2 #3 #4 #5 Failures Development System Test Deployment (v0.5) Ops A La Carte 32

Software Reliability Practices

Goals of Reliability Practices Reliability Practices split the development lifecycle into 2 opposing phases: Pre-deployment Focus on Fault Intolerance Post-deployment Focus on Fault Tolerance Fault Avoidance Techniques Prevents defects from being introduced Fault Removal Techniques Detects and repairs faults Fault Tolerance Techniques Allow a system to operate predictably in the presence of faults System Restoration Techniques Quickly restore the operational state of a system in the simplest manner possible Goal: Increase reliability by eliminating critical defects that reduce the failure rate Goal: Increase availability by reducing or avoiding the effects of faults (v0.5) Ops A La Carte 34

Software Reliability Practices Analysis Formal Scenario/Checklist Analysis FRACAS FMECA FTA Petri Nets Change Impact Analysis Common Cause Failure Analysis Sneak Analysis Design Formal Interface Specification Defensive Programming Fault Tolerance Modular Design Error Detection and Correction Critical Functionality Isolation Design by Contract Reliability Allocation Design Diversity Verification Boundary Value Analysis Equivalence Class Partitioning Reliability Growth Testing Fault Injection Testing Static/Dynamic Code Analysis Coverage Testing Usage Profile Testing Cleanroom (v0.5) Ops A La Carte 35

Design and Code Inspections The original rationale for inspections (current payback): Inspections require less time and resources to detect and repair defects than traditional testing and debugging Work done at Nortel Technologies in 1991 demonstrated that 65% to 90% of operational defects were detected by inspections at 1/4 to 2/3 the cost of testing Soft maintenance rationale (future payback): Data collected on 130 inspection sessions findings on the long-term, software maintenance benefits of inspections as follows: True Defects - The code behavior was wrong and an execution affecting change was made to resolve it. False Positives Any issue not requiring a code or document change. Soft Maintenance Changes Any other issue that resulted in a code or document change, e.g., code restructuring or addition of code comments. (v0.5) Ops A La Carte 36

Spectrum of Inspection Methodologies Method / Originator Team Size # of Sessions Detection Method Collection Meeting Post Process Feedback Fagan Large 1 Ad hoc Bisant Small 1 Ad hoc Gilb Large 1 Checklist Yes Group oriented Yes Group oriented Yes Group oriented None None Root Cause Analysis Meetingless Inspection Large 1 Unspecified No Individual Oriented None ADR Small >1 Scenario Yes Group oriented None Britcher Unspecified 4 Parallel Scenario Yes Group oriented None Phased Inspection Small >1 Sequential Checklist (comp) No (Mtg only to reconcile data) None N-fold Small >1 Parallel Ad hoc Yes Group oriented None Code Reading Small 1 Ad hoc No Mtg Optional None WOW! No wonder inspections are not well-understood, there s too many methodologies. AND, THERE ARE MORE OPTIONS (v0.5) Ops A La Carte 37

Spectrum of Technical Review Methodologies Inspections are just one of the many classes of Technical Review Methodologies. Informal Individual initiative Small time commitment General Feedback Defect Detection Formal Team-oriented Multiple meetings and premeeting preparation Compliance with Standards Satisfies Specifications Adhoc Review Pairs Programming Team Review Peer Desk Check (Passaround Check) Walkthrough Inspection (v0.5) Ops A La Carte 38

Why Isn t Software Reliability Prevalent?? Those are very good ideas. We would like to implement them and we know we should try. However, there just isn t enough time. The erroneous arguments all assume testing is the most effective defect detection methodology Results from inspections/reviews are generally poor Engineers believe that testers will do a more thorough and efficient job of testing than any effort they implement (inspections and unit testing) Managers believe progress can be demonstrated faster and better once the SW is in the system test phase Remember, just like the story of the lumberjack and his ax, If you don t have time to do it correctly the first time, then you must have time to do it over later! (v0.5) Ops A La Carte 39

Software DfR Tools by Phase Phase Activities Tools Concept Define SW reliability requirements Benchmarking Internal Goal Setting Gap Analysis Design Architecture & High Level Design Low Level Design Modeling & Predictions Identify core, critical and vulnerable sections of the design Static detection of design defects SW Failure Analysis SW Fault Tolerance Human Factors Analysis Human Factors Analysis Derating Analysis Worst Case Analysis Coding Static detection of coding defects FRACAS RCA Unit Testing Dynamic detection of design and coding defects FRACAS RCA Integration and System Testing SW Statistical Testing FRACAS RCA SW Reliability Testing Operations and Maintenance Continuous assessment of product reliability FRACAS RCA (v0.5) Ops A La Carte 40

Questions? (v0.5) Ops A La Carte 41