Certifying Automotive Electronic Solutions

Size: px
Start display at page:

Download "Certifying Automotive Electronic Solutions"

Transcription

1 Page 1 Certifying Automotive Electronic Solutions The failure of electronic components is one of the relevant reasons for breakdowns in the automotive domain today. The number of breakdowns that could be traced back to bugs in the electronic system has increased to 55 percent. Analysts expect a further increase of electronically induced breakdowns of up to 60 percent in 2004 [1]. However, sustainable innovations in the automotive industry are increasingly dependent on more complex electronic and software systems. Introducing a new technology into automotive applications may involve unpredictable failure modes, imposing severe risks on the car maker and causing potential financial losses. New safetyrelevant control applications such as driver assistance systems or advanced chassis control systems require sophisticated fault tolerance mechanisms and result in complex electronic control units (ECUs). Furthermore, shrinking geometries, lower power voltages, and higher frequencies have a negative impact on reliability [2]. Integrating safety-relevant and non-safety-relevant sub-systems into a common distributed architecture (in an attempt to reduce costs and increase reliability) make the situation even worse. When built from scratch, safety-relevant control applications may even surpass the capabilities of the most experienced multidisciplinary engineering teams. The main concern will be on how to manage quality, reliability, and safety of complex distributed systems in cars. Achieving System Safety Fault Hypothesis is Key A new dimension of safety emerges not only for automotive systems without mechanical backup but for systems where the functional difference between the fully operational electronic system and the purely mechanical back-up function is substantial. This difference in functionality may well lead to critical situations. It thus becomes more and more a requirement that electronics failures must be extremely unlikely (typically 10-9 per hour). Such low failure rates can only be achieved by a fail-operational system which keeps up operation until a safe state is reached. The fail-operational assumption requires that the system remains fully operational after any arbitrary fault of at least a single component. The key challenge in the design of these systems is the achievement of the required safety and reliability at the system level. Typically, the likelihood of failures which lead to a safety-critical situation must be better or equal than 10-9 per hour. Such an ultra-high dependability can only be realized by fault tolerance since the intrinsic ECU and System-on-a-Chip (SoC) reliability is orders of magnitudes lower. For all these considerations it is key which faults need to be tolerated by the system. This is defined by the fault hypothesis.

2 Page 2 As one component is an appropriate unit of failure, several non-fault-tolerant components are to be grouped into a fault-tolerant unit (FTU). If a failure of one of these components occurs, the services of the overall system must still be available. This masks one failure and thus lies within the so called single-fault hypothesis. The single-fault hypothesis requires that any single component fault up to an arbitrary loss of a complete ECU is reliably detected and tolerated. Moreover, massive transient faults affecting several components must be handled by the system such that the system resumes operation within a short time interval after the fault has ended. The failure modes of a component can be classified as: Fail-silent: In case one of the components of an FTU fails, it will not send any more messages, or it sends an explicit I am out of order message. This implies that the rest of the system gets a consistent view of the failure of the component. The unit is fast enough in detecting an error in the time or value domain so that it will shut down before sending something wrong. Fail-restrained: The erroneous component may send wrong values for a bounded interval of time. After a periodic self-test it realizes the failure and switches itself off. The rest of the system gets a consistent view of the failure within a bounded time interval. Fail-consistent: The erroneous component may send wrong values and has no self-detection for that. The rest of the FTU components get a consistent view of the state/output of the wrong one, so by applying a specific agreement algorithm the faulty component can be detected. The agreed value will be delivered to the application. Fail-uncontrolled: The nodes in the system may show a Byzantine behavior, i.e. different nodes get different results from the sender. The rest of the system gets arbitrary views of the state/output of the Byzantine node. This behavior can be caused for instance by an electromagnetic interference or by a slightly-off specification (SOS) error. For the safety analysis it is most desirable that the system can handle the worst failure mode: fail-uncontrolled behavior. Moreover, during system design it is of paramount importance to achieve independence, i.e. there are no common mode failures which affect more than one component. If these conditions are satisfied, there are known procedures to establish a safety case and attain certification. Certification and Safety Standards The next relevant step after addressing failures in the phase of systems operation is to address design faults. Given a sound fault-tolerant electronics architecture for safety-relevant automotive systems is already in place, certification is typically used to avoid and minimize the likelihood of design faults. In the aerospace domain, accepted standards like the Radio Technical Commission for Aeronautics (RTCA) standards DO-178B or DO-254 have already led to a well-established and very successful safety culture concerning safety-relevant electronics. DO-178B is about Software Considerations in Airborne Systems and Equipment Certification, focusing on software aspects in systemdevelopment. DO-178B defines a development process for software components, qualifies software-tools, and aims at reusability of software. DO-254 is about "Design Assurance Guidance for Airborne Electronic Hardware", addressing the hardware development process and including

3 Page 3 guidelines for the development of circuit boards, ASICs (Application-Specific Integrated Circuits) or other programmable hardware components. There are several standards, guidelines, and methods in the automotive domain that provide information on processes as well; among them are the Motor Industry Software Reliability Association (MISRA) guidelines. IEC is a generic standard that can be used directly by the industry but can also help to develop sector standards (e.g. machinery, chemical plants, medicine or rail) or product standards (e.g. power drive systems). It provides a means for users and regulators to gain confidence when using computer-based technology. IEC is focused on system-level safety and is related to system-level safety ARP4761 processes (including ARP4754) with design assurance guidelines DO-178B for software and DO-254 for complex hardware [3]. These standards support the development of safe systems by providing a framework based on a prescribed set of best practices for the development of safe software/hardware, and system-level safety considerations (see Fig. 1). As there is no established standard for certification in the automotive industry, it is felt that this situation will soon change due to an increased safety relevance of electronic systems and a rising customer awareness of the impact of electronics failures. Fig. 1: Basic overview of integrated development and safety processes There can be no doubt that IEC ranks among the most important safety standards for electronic systems. This international standard refers to the functional safety of electric, electronic and programmable electronic systems and will play an important role in the whole development process of various industrial applications. To ensure reliability, safety and maintainability, IEC covers the entire life cycle of a safety-relevant system, including concept, development, utilization, and decommissioning of the system. IEC "Functional Safety of Electrical, Electronic and Programmable Electronic Safety-Related Systems" has been published in full and is now in a "maintenance phase". The first request to national committees for comments on all parts

4 Page 4 of IEC took place in January 2001; the revised version of IEC will be published in March 2006 [4]. In August 2002, the European Committee for Electrotechnical Standardization (CENELEC) adopted, ratified, and published IEC as DIN EN (classification VDE 0803), a series of standards. TTA A Case Study of an Architecture for Safety-Relevant Applications The Time-Triggered Architecture (TTA) provides a computing infrastructure for the design and implementation of dependable distributed embedded systems. A large real-time application is decomposed into nearly autonomous networks and nodes. A fault-tolerant global time base of known precision is generated at every node. In the TTA this global time is used to precisely specify the interfaces among the nodes, to simplify the communication and agreement protocols, to perform prompt error detection, and to guarantee the timeliness of real-time applications. The TTA supports a two-phase design methodology, architecture design, and component design. During the architecture design phase the interactions among the distributed components and the interfaces of the components are fully specified in the value and time domain. In the succeeding component implementation phase the components are built, taking these interface specifications as constraints. This two-phase design methodology is a prerequisite for the composability of applications implemented in the TTA and for the reuse of pre-validated components within the TTA [5]. The Time-Triggered Protocol (TTP ) is the communication protocol of the TTA for hard real-time fault-tolerant communication. TTP provides hard real-time message delivery with minimal jitter. Different fault tolerance strategies are supported. It is guaranteed that no single failure of any part of the communication system could lead to disruption of the communication. TTP provides distributed fault-tolerant clock synchronization. Extensive mechanisms for error detection, recovery, and reintegration of nodes are provided. The protocol has been designed for highest data efficiency and minimal protocol overhead. Furthermore, TTP supports composability by its precisely defined behavior in the value and time domain. TTP has its special focus on safetyrelevant high-speed applications resulting in low costs and high protocol efficiency [6]. TTP is based on more than 20 years of development work. During that time a great number of patents were filed and the protocol was stabilized. All those activities resulted in the launch of the first TTP communication controller in An automotive qualified third generation communication controller is available since Products for TTP are currently developed, verified, and validated in compliance with Federal Aviation Administration (FAA) guidelines, commonly used for safety-critical applications in the aerospace industry. The OSEKtime-based operating system TTP OS is developed according to the RTCA software standard DO-178B Level A. The firmware providing the protocol functionality of the AS8202NF communication controller (based on TTP-C2NF) is certifiable in compliance with the DO-178B standard for Level A applications. TTP Verify has been designed as a software verification tool in compliance with the software development standard RTCA DO-178B and supports the verification of safety-critical distributed control systems developed under RTCA DO-178B Level A. The experiences gained in the process of making TTP products meet aerospace standards form a solid basis for developing automotive applications to be certifiable according to IEC Certifiable development increases the safety of the system infrastructure. Therefore, the design of TTP products follows rigid aerospace safety-oriented processes. First of all, the specification and

5 Page 5 requirements are written down and peer-reviewed by independent reviewer teams. The architecture is developed on the basis of these requirements, and the traceability from high-level to low-level requirements and to source code and test cases is established. The source code is developed from low-level requirements, thereby guaranteeing that none of the features will be omitted in the code and architecture. Any functionality not covered by requirements must be removed. It has to be proven that no unintended functionality is available because this increases the probability of unknown failure modes. The quality and safety are designed in from the beginning rather than only tested by the end of the development. The costs of safety-relevant development are reduced by upfront investments in specification and requirement activities. However, it does not mean that the tests are developed, reviewed, and executed with less scrutiny (see Fig. 2). Fig 2: Overview of development process Furthermore, traceability from requirements is established to module, integration and system-level tests. This ensures that all requirements are tested. The evidence that all code is tested must be given by a 100% Modified Condition/Decision Coverage analysis (MCDC). This means that all statements in the code have been executed, all decision outcomes have been checked, and all independent variable combinations in a decision have been tested. After such tests it is extremely improbable that any unintended functionality is built into the software. Thus, certification proves that there is a system with intended functionality and features, and the probability of unintended functionality is reduced to a minimum. Complex hardware (ASIC) behaves in a way similar to software. Its reliability is in most cases not determined by statistical failures but by failure modes introduced through immature development methods. The number of internal states of complex devices can be hard to test in a given time. Therefore, a safety-relevant design similar to software processes must be applied in order to prevent safety risks in design and requirements. In the field of safety-relevant electronic systems, formal verification and fault injection are heavily used methods to guarantee a high degree of reliability in addition to a mature development

6 Page 6 process. Both theoretical proofs and empirical tests were applied to TTP. Leading universities and research centers such as SRI International and University of Ulm accomplished formal certification of the Time-Triggered Protocol and its pertinent mechanisms [7]. The process of formal verification is continued by a NASA-sponsored project. EC-funded projects such as Predictably Dependable Computing Systems (PDCS) and Fault Injection TTA (FIT) covered fault avoidance, fault tolerance, fault handling, and fault prediction. Real tests with millions of faults, heavy ion radiation experiments, and experiments with electromagnetic interferences were also carried out successfully [8]. Fig. 3: More than 170 man years of development invested in safety A qualifiable communication infrastructure, developed with safety as its primary objective (see Fig. 3), is a prerequisite for safe and reliable ECU networks. However, in order to reduce costs and improve system-level safety the following issues must also be managed: System complexity (architecture and infrastructure): The ever increasing number of automotive networks induces the need for methods to manage the overall system complexity in cars. The TTP tool chain provides seamless sub-system composability, clean interface design, and automatic code generation from MATLAB /Simulink models. Car maker/supplier collaboration: As a basic property of the Time-Triggered Architecture, the separation of system and sub-system development and the focus on system functionality and requirements in the early phase of design help both the car makers and their suppliers. Safety-relevant software/hardware development processes: To reduce the costs for software and hardware development, system testing and validation of system-level safety, the development process of safety-relevant systems needs to be controlled by state-of-the-art engineering and organizational methods. The quality of safety-relevant products for the TTA is guaranteed by rigid safety-oriented development procedures.

7 Page 7 TTA provides a verified, validated and certifiable core technology for safety-relevant systems. The available tool environment offers comprehensive means to support the development of dependable systems. The aerospace certification experiences can be leveraged to enhance the safety of automotive applications and to fulfill upcoming automotive certification requirements. Summary The automotive industry gradually directs certification to the domain of safety-relevant electronic systems. IEC "Functional Safety of Electrical, Electronic and Programmable Electronic Safety-Related Systems" is regarded as a prospective generic safety standard for automotive electronic systems. Any system architecture for dependable automotive applications should be certifiable in order to guarantee a trustworthy level of dependability of the whole system. This in turn requires the definition of a proper fault hypothesis. Certification of newly developed systems is facilitated through proper architectural design guidelines and reusable certification packages. The Time-Triggered Architecture (TTA) provides a mature framework for developing innovative safety-relevant applications. The TTA is a qualifiable infrastructure to be used in projects that need certification for the design of easily integrated systems and safe control networks. References [1] C. Constantinescu. Trends and Challenges in VLSI Circuit Reliability. IEEE Computer Society, Jul-Aug [2] F. Dudenhöffer in Chip-Forum Computer als virtueller Beifahrer at IIA in Frankfurt, Sep [3] C. Bauer and D. Plawecki. A comparative study IEC vs. RTCA/DO-178B Applicability and Adequacy for Software Development and Certification of Airborne Systems. TÜVit Conference IEC61508, Jan [4] International Electrotechnical Commission (IEC). IEC Frequently asked questions. Nov [5] H. Kopetz and G. Bauer. The Time-Triggered Architecture. Proceedings of the IEEE Special Issue on Modeling and Design of Embedded Software, Jan [6] S. Poledna et al. Die Kommunikationsarchitektur für X-by-wire Systeme. Automotive Electronics. Sep [7] J. Rushby. An Overview of Formal Verification for the Time-Triggered Architecture. SRI International, Menlo Park, California, Sep [8] H. Sivencrona. Heavy-Ion Fault Injection in TTP-C2 Implementation. Report of the SP Swedish National Testing and Research Institute, Sep 2003.

8 Page 8 Contact TTTech Computertechnik AG Schoenbrunner Strasse 7 A-1040 Vienna, Austria Tel.: Fax: office@tttech.com Web: