Develop Quantitative Reliability Roadmap to Meet Market s Expectations

Size: px
Start display at page:

Download "Develop Quantitative Reliability Roadmap to Meet Market s Expectations"

Transcription

1 Develop Quantitative Reliability Roadmap to Meet Market s Expectations Xuemei Zhang Alcatel-Lucent April 27, 2007

2 Introduction Gaps between a product s target and current-release availability can arise in early releases of new products when product is deployed in new scenario, such as supporting VoIP or IPTV by a traditionally IP data-only product when significant software features or hardware/architecture changes are made Reliability roadmapping is the best practice for managing closure of an availability gap Product management owns product roadmaps; reliability roadmaps are an key input to overall product roadmaps This presentation details what a reliability roadmap is, how to construct one, and how to use that roadmap to manage closure of an availability gap 2 Reliability Roadmap April 2007

3 Outline The Business Problem and Solution New Product Reliability Risk New Deployment Scenario Reliability Risk New Feature Reliability Risk Reliability Roadmap as a Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers 3 Reliability Roadmap April 2007

4 Business Problem New Product Reliability Risk Market expects % availability for most of Lucent s products Best practice for assessing market s availability expectation given in a companion presentation Significant risk in achieving % availability in initial (or early) releases of most products because: 1. Some availability features may have been deferred from initial product release(s) in favor of higher-priority features 2. High availability system configurations (e.g., N+K, duplex controllers) may not be supported in initial release(s) (note: high availability configurations may be required in RFx s, but not actually be purchased and hence not reflected in business cases) 3. Software may not be sufficiently mature to have low enough failure rate 4. Software may not be sufficiently mature to have sufficiently effective and efficient automatic failure detection, isolation, alarming and recovery mechanisms 4 Reliability Roadmap April 2007

5 Business Problem New Deployment Scenario Reliability Risk As existing products are deployed in new scenarios, they may encounter different availability expectations, thus exposing a gap; for example Network element availability expectations for VoIP and IPTV may be higher than for data-only deployments Basestation availability expectations for wireless local-loop may be higher than for typical mobility deployments 5 Reliability Roadmap April 2007

6 Business Problem New Feature Reliability Risk As existing products evolve, large, availability-impacting features may be added, such as: Adding VoIP or other major capability Expanding architecture/configuration (e.g., adding duplex controllers) Changing blades or major hardware elements Significant changes to existing products increase reliability risks of: 1. degrow software reliability (increase failure rate) or 2. reduce system s ability to effectively detect and isolate failures (lower coverage factor) or 3. add latency to recovery/restart times thus adding software downtime Note: hardware downtime for a particular element typically changes little from release-to-release, so release-by-release roadmapping of hardware elements is less common 6 Reliability Roadmap April 2007

7 Business Solution: Reliability Roadmap The risk in purchasing a release of a system that doesn t currently meet a customer s availability expectations can be reduced by providing a credible, concrete plan for closing the availability gap in an upcoming release.a.k.a., a reliability roadmap Key elements of a reliability roadmap 1. Gives ultimate quantitative system availability goal(s) and definition 2. Availability estimate of current release and system configuration 3. A target release and system configuration to meet a specific availability level 4. Per-release availability budgets to plausibly close the gap between current release performance and specific availability goal in target release 5. By-release enumeration of features and/or factors that will support this availability growth 7 Reliability Roadmap April 2007

8 Outline The Business Problem and Solution Reliability Roadmap Elements 1. Ultimate Availability Goal 2. Estimate Availability of Current Release 3. Specific Release Identified to Meet Goal 4. Per-Release Availability Improvement Targets 5. Per-Release Availability Improvement Features Graphical Example Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers 8 Reliability Roadmap April 2007

9 Roadmap Element 1: 9 Reliability Roadmap April 2007 Set Ultimate Availability Goal Availability goals are typically set for annualized minutes of unplanned, supplier-attributable total system unavailability (meaning greater than 90% capacity lost) Includes both hardware and software downtime, but may exclude planned/scheduled downtime for upgrades, updates, growth, etc Market expectation for most telecom products is 5.25 down-minutes per year (99.999% availability) Partial-capacity-loss events are quite common, and thus sophisticated customers may have availability expectations for pro-rated partial-capacityloss availability TL-9000 defines partial-capacity-loss to be greater than 10% capacity loss, but less than 90% capacity loss Planned unavailability includes system downtime for upgrades, updates, reconfiguration, growth, degrowth, and so on. Sophisticated customers may have clear planned downtime expectations Some sophisticated customers (e.g., Nextel) explicitly define their % availability requirement to include planned events, as well as unplanned events Quantitatively define exactly what ultimate objective is Example: Availability goal for Product A is % unplanned, supplierattributable, (partial) pro-rated availability16

10 Roadmap Element 2: Estimate Availability of Current Release Estimating the availability of the current release of a product provides the baseline availability and helps identify the gap with the market s availability expectation The availability of a baseline release can be estimated from: Field data if the release is out in the field and reliable data exists Lab data via system reliability modeling 10 Reliability Roadmap April 2007

11 Roadmap Element 3: Set Specific Release to Meet Target As with any business objective, explicitly setting a clear scheduled completion goal is essential Since products are typically planned and managed on a release basis (rather than a calendar basis), recommend setting a target release 11 Reliability Roadmap April 2007

12 Roadmap Element 4: Set By-Release Improvement Targets Based on the availability of the baseline release and the release planned to meet the market expectation, by-release reliability improvement targets can be set to plan the reliability growth. Product A Reliability Roadmap Linear Growth Actual Release X Downtime Annual Downtime (min/yr) RX R(X+1) R(X+2) R(X+3) R(X+4) Release 12 Reliability Roadmap April 2007

13 Roadmap Element 5: Set By-Release Feature Investments Investing in reliability improving features is often required to achieve high availability in a timely manner. Example: Product A reliability roadmap Release by Release Reliability Feature Sets RX R(X+1) R(X+2) R(X+3) R(X+4) 13 Reliability Roadmap April 2007

14 Roadmap Example Element 2: Estimate current availability Product A Reliability Roadmap Annual Downtime (min/yr) Element 1: Set Ultimate availability goal RX R(X+1) R(X+2) R(X+3) R(X+4) Release Element 4: Set rough per-release targets 14 Reliability Roadmap April 2007 Element 5: Set per-release feature investments to achieve availability goal Element 3: Pick a release to achieve availability goal

15 Outline The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers 15 Reliability Roadmap April 2007

16 Availability Improving Features Product availability improves in 3 general ways Maturation of software and support (both service provider and Lucent) reduces software failure rates, shortens outage durations for manually-recovered events, and improves reliability of manual maintenance activities This growth is fairly slow, often not keeping pace with reliability degrowth from addition of new features Investment in reliability/availability improving features. Broadly, these features address one or more of the following: 1. Reduce failure rates 2. Reduce impact of failures 3. Improve efficiency of failure detection, isolation, alarming and recovery 4. Shorten recovery latency 5. Improve Design-for-Serviceability (DfS) 6. Reduce planned downtime 7. Policy and other items Technology change products can undergo significant changes in architecture, configuration, hardware or software which can significantly affect availability. Often managed via product s feature roadmap 16 Reliability Roadmap April 2007

17 Availability Improvement via Reducing Failure Rates System downtime is typically a linear function of hardware and software failure rates. Reducing hardware and software failure rates is an efficient way to reduce system downtime General feature categories Hardware Use lower failure rate components Enhance thermal environment Software Better cooling and lower temperatures means lower hardware failure rates More/better testing More mature development processes More/better static and dynamic analysis (Purify, lint, clean compilations/builds, etc) 17 Reliability Roadmap April 2007

18 Availability Improvement via Reducing Impact of Failures Better product architectures will reduce failure group sizes so failures impact smallest feasible system capacity General approaches Redundant hardware elements - so service can be rapidly restored after hardware failure Clustering Put resources in a pool and run load-sharing operations mode. Failures of one unit can be re-distributed to other healthy units. Partitioning intelligently distribute hardware and software failure rates to minimize pro-rated downtime. For example: moving high failure rate software modules from top-controller (where a failure would have a large failure footprint) to subordinate linecards (where a failure will have a smaller failure footprint) Separating OA&M software from service-related software so failures in OA&M software don t cause service downtime Application-specific architectures/mechanisms 18 Reliability Roadmap April 2007

19 Availability Improvement via Improving Failure Detection Efficiency Typically, detected failures can be recovered by fast automatic failovers or restarts Failures that are not properly automatically detected, isolated and alarmed must be addressed manually, thereby significantly prolonging outage duration Slowly detected failures also increase downtime General approaches: Timeouts, watchdogs, heartbeats, audits of data integrity, etc Leveraging platform- and OS-provided monitoring and recovery facilities Fault insertion testing to validate system s automatic failure detection, isolation, alarming, and recovery capabilities Higher-layer failure detection/integrity monitoring applications Application- and protocol-specific techniques, such as Event Correlation Service (RECS) or Reliability Integrity Monitoring (RIM) 19 Reliability Roadmap April 2007

20 Availability Improvement via Shortening Recovery Latency Shortening recovery latency for both automatically detected+recovered failures and manual restarts is an efficient approach to reduce system downtime. Although detected failures typically contribute less system downtime than uncovered failures, faster detection and recovery mechanism also improve availability, especially using the TL9000 outage discounting rules. Failures that cause less than 30 seconds of service disruption are not TL9000 reportable outages, and can thus be excluded. General approaches: Switchover and restart times are often shortened via a combination of optimizations in high-availability middleware and applicationspecific mechanisms Faster hardware can also reduce recovery times 20 Reliability Roadmap April 2007

21 Availability Improvement via Design for Serviceability General approach: Follow DfS requirements and guidelines during architecture and design LWS-Serviceability Engineering team performs DfS assessment and identifies gaps Invest to close serviceability gaps 21 Reliability Roadmap April 2007

22 Availability Improvement via Reducing Planned Downtime Planned events --- such as retrofits, upgrades and updates --- often happen more frequently than unplanned downtime events; thus reducing downtime associated with planned events increases system availability Acceptability of planned downtime often varies by product category; some markets accept planned downtime if it occurs in a scheduled maintenance window, other markets won t General approach is to drive planned events to be be less than 15 seconds TL9000 guidelines suggest planned service disruptions of 15 seconds or less to can be excluded from availability calculations Various operating-system-, middleware- and application-specific mechanisms are often used to minimize planned downtime for updates, upgrades, retrofits and other planned maintenance events 22 Reliability Roadmap April 2007

23 Availability Improvement via Policy and Other Items Many items beyond traditional network element design can improve system availability, including: Sparing strategies maintaining an adequate supply of spare FRUs close to network elements can shorten repair times Sparing entire network elements (e.g., cold standby ) is sometimes appropriate Support agreements having service providers purchase support agreements from key equipment suppliers can shorten outage resolution times Training of support engineers appropriate training of service providers maintenance staff can shorten outage resolution times and improve reliability of both planned and on-demand maintenance actions 23 Reliability Roadmap April 2007

24 Outline The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers 24 Reliability Roadmap April 2007

25 Reliability Roadmapping Recipe Initial Effort 1. Product management sets availability target 2. Estimate availability of current product release 3. Based on significant features and changes planned for future releases, estimate likely availability of those future releases 4. Set target release to close (expected) availability gap 5. Identify candidate, availability-improving features Detailed on next slide 6. Estimate availability benefit of candidate features 7. Product management selects suitable set of candidate availabilityimproving features to invest in and slots them into specific releases Goal is to select the right combination of functional, nonfunctional, and availability-improving features that cost-effectively meets the market s needs 25 Reliability Roadmap April 2007

26 Identifying Candidate Features Candidate reliability/availability-improving features can be identified in several ways 1. Input from System Architects, Developers, Systems Engineers, and Technical Support Engineers they know many of the weaknesses and areas-for-improvement in their products 2. Analysis of field outages and lab data 3. High availability techniques assessment tool Targeted downtime reduction analysis, e.g., if there is too much downtime from HW or SW on a particular FRU or module, then focus on ways to minimize that downtime 26 Reliability Roadmap April 2007

27 Refreshing Reliability Roadmap Recommend refreshing reliability roadmap for every major release by 1. Gather and analyze latest field data to estimate latest field availability performance and estimate availability parameters 2. Use lab data to estimate availability of most recent product release (since field data probably isn t available) 3. Re-estimate availability of future releases based on observed latest field and lab data, and latest feature plans 4. If significant gap appears between updated estimated future availability and baselined by-release availability targets, then Increase investment in availability-improving features or decrease investment in reliability-degrading features and/or Revise baselined by-release availability targets (e.g., postpone target for meeting availability target, adopt more aggressive availability-improvement plan for future releases) and/or Reset availability expectations with customers 27 Reliability Roadmap April 2007

28 Outline The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Roadmapping Solutions Elements of a Solution Availability Roadmap Availability-improving Features for Solutions Recommendations for Product Managers 28 Reliability Roadmap April 2007

29 Roadmapping End-to-End Solutions New solutions or expanded deployment of existing solutions may have availability gaps compared to incumbent or alternative solutions, such as IPTV over DSL v. cable TV Wireless local loop v. DSL or cable 3G v. 2G; eventually 4G v. 3G For solution-level roadmapping, it is essential to Select one (or more) specific solution configurations to model and analyze Select and precisely define the correct solution-level availability metrics While % service availability may be fairly easy to understand and define for a single network element, it is much harder to precisely characterize availability for IMS, 3G or IPTV solutions 29 Reliability Roadmap April 2007

30 Solution Availability Roadmap Elements Key elements of a solution availability roadmap 1. Precisely define solution availability metrics 2. Define solution deployments/applications to be analyzed 3. Specify ultimate quantitative system availability goal(s) 4. Construct appropriate mathematical availability model of solution(s) 5. Insert availability estimates (or actuals) for all elements in solution, and compute resulting solution-level availability 6. Create downtime budget for all network elements that achieves desired availability and is consistent with business considerations 7. Commit availability-improving feature plans to close gaps between network element availabilities and targets required to achieve solution availability Note: these features are likely to affect both solution architecture/configuration, Lucent-developed products and partner/oem/odm products 8. Construct a by-solution-release view of when availability-improving features will be phased into solution 30 Reliability Roadmap April 2007

31 Solution Availability-Improving Features Solution-level availability is improved via the following general techniques: 1. Reconfigure elements in the solution (e.g., add redundant elements or interconnects, make network elements geographically-redundant) 2. Increase robustness of end to end applications software (e.g. protocol enhancements, reliable/dependable transactions/services) 3. Improve availability of individual network elements 4. More/better network-level testing 5. Replace network element with alternative product (perhaps from alternate supplier) 6. Adopt alternate solution architecture/configuration/protocol (e.g., support distributed elements/protocols rather than standalone elements) 31 Reliability Roadmap April 2007

32 New Challenges - Availability for Solutions that Provide Blended Services Different perspectives of solution availability: End user view Service providers view Solution availability metrics Downtime oriented: downtime min/yr, 5 9 s availability, etc. Defects oriented: ineffective attempts, cut-off calls, etc. Service oriented: service reliability, etc. Risk oriented: security related availability, etc. Analysis complexities: Access solutions vs. core Control plane vs. traffic plane Call processing vs. management visibility Application variations 32 Reliability Roadmap April 2007

33 Outline The Business Problem and Solution Reliability Roadmap Elements Availability Improving Features Connecting-the-Dots Roadmapping End-to-End Solution Availability Recommendations for Product Managers 33 Reliability Roadmap April 2007

34 Recommendations for Product Managers PdM should own and drive the reliability roadmap process 1. Product Management should assign an owner for reliability roadmap (product manager, delegated to SAE, or managed jointly) 2. Product management should set quantitative availability goal and release target to achieve that goal E.g., MR/ECO SRD 3. Owner for reliability roadmap (with support from cross-functional team of Architecture, Development, Reliability Team, Systems Engineering or other) should analyze and propose availability-improving features 4. Product management selects the right mix of availability-improving features per release and gets those features committed 5. Revisit reliability roadmap for every major release, and make revisions as appropriate 34 Reliability Roadmap April 2007

35 35 Reliability Roadmap April 2007 Thank You