Aspira Dependability Prediction with UltraSAN

Size: px
Start display at page:

Download "Aspira Dependability Prediction with UltraSAN"

Transcription

1 Aspira Dependability Prediction with UltraSAN Aspira Systems Engineering Bryce Kuhlman Steve Beaudet Vision Statement A standardized method of system dependability modeling that streamlines communication between designers and dependability analysts and provides a framework for the development of % available systems

2 Goals Modeling used as a tool to develop detailed understanding of system dependability performance Analysis throughout the entire product life cycle. Streamlined communication between product engineers, 3rd party vendors, and dependability analysts. One model, many measures (with distributions). Timely modeling and analysis Low cycle time for new models and analysis (usually less than one week, depending on the product complexity and familiarity. Models evolve with the design; more detail is added as it becomes available. Trade studies defined by availability team and system designers drive design for High Availability Modeling Process Two levels of modeling: description and computation Dependability Description Model (DDM) Explanation of how the system works in an Availability sense. Standardized framework based on existing methodologies in dependability analysis and system design which can be easily understood by all. Provides detailed system descriptions. Designers become an integral part of system dependability evaluation. Focus is on description, not evaluation and therefore circumvents the need for expertise in particular modeling techniques (Markov Process, SPN, SAN, etc.) To achieve % availability, all possible sources of service interruption must be addressed. Dependability Computation Model (DCM) Calculation of measures defined in DDM UltraSAN

3 Dependability Description Model Measures Availability = Probability of a user being able to setup a new connection Reliability = Probability of an existing connection being dropped Maintenance = Number of maintenance events necessary Bellcore / TL-9000: outage, DPM, OFM, etc. Model Assumptions System Description Dependability Block Model Identification of Serial Blocks Common failure impact, detection, response, repair Dependency Graphs Block Dependability Information General Information Failure Information Detection Information Recovery Information Notification Information Repair Information Upgrade Information Dependability Description Model For each serial block, for type of information Description of design aspect Impact on other components Applicable Parameters Time distribution & parameters Probability Basis for Parameter Estimate Effect(s) of failed activity Reference next escalation level of detection or effect of failed detection, next level of response

4 Dependability Computation Model (DCM) Simulation and analysis for the purpose of estimating measurements defined in the DDM. Created by dependability analysts based on information contained in the DDM. Model precisely how the system behaves UltraSAN Selection Factors Output measures are accompanied with estimated distributions Distribution shape gives insight Expected performance of individual networks, small populations can be understood Monte Carlo simulation utilized to avoid state-space explosion and to support non-exponential time distributions. All details specified in the DDM can be modeled using UltraSAN simulation. Model how design works. Avoids Markov Model simplifications One model to estimate all measures. Composed model supports modular programming, model reuse, and development time reduction. Exceptionally fast simulation time

5 Modeling Capability Details Rate Distributions (exponential, Weibull, lognormal, etc.) Failure time distributions Failure detection, response, and notification (time distribution and probability Distributions that reflect real experience Detection and Response Multiple levels of detection and response escalation Effects of protocols and packet networks in fault management Software Architecture Model details of how the software modules work and fail together, how they interact, and their relationship to the hardware. Event edge (time-independent) impacts Modeling Capability Details Repair Dependency Example: If a single port on a multi-port adapter fails, the entire port adapter and all of its connections must be disabled to replace the port adapter. Operational Dependency Failure of some elements disables other elements. Example: If a processor fails, all applications are disabled and cannot fail until the processor is brought back online. Procedural Errors Failures caused by network operators performing routine or specialized operations on the network. Maintenance strategies Planned upgrade

6 UltraSAN Modeling Process Defined UltraSAN templates cover extensive range of configurations and procedures: Component failure (HW and SW) Redundancy (active and standby) Operational and repair dependency Detection and response time and probability Detection and response escalation Repair / replacement Maintenance/Procedural Error Upgrades Standardized measurement definitions Standards for naming conventions, time-increment, variable usage Detailed model validation procedures Strict configuration management guidelines Desired UltraSAN Enhancements GUI enhancements cut and paste, rename more robust text/code editor complete model compilation at all levels of definition to circumvent mandatory subnet->composed->reward->study sequence Additional model composition formalisms (graph models, etc.) Path-based reward variables Integration with Design of Experiments functionality for evaluation of sensitivities Token specification (colored tokens, data structures, etc.) Easier specification of user-defined functions Triangular distribution Architecture-independent multi-processor runs Port to Windows 2000 Alternate project documentation format (HTML, PDF, etc.) Improved documentation Worked complex examples Including tricks

7 Summary The attainment of 5 NINES Availability performance requires detailed design specifically targeted for availability enhancement A process has been developed that drives and records the Availability design detail The use of UltraSAN has allowed us to calculate the results of that implementation detail Easy to learn without extensive mathematical background Deals with large state space reflective of design detail Rapid simulation time Distributions as part of inputs and in outputs