ARC BRIEF. PAT and the Need for High-Availability Platforms

Size: px
Start display at page:

Download "ARC BRIEF. PAT and the Need for High-Availability Platforms"

Transcription

1 ARC BRIEF By ARC Advisory Group OCTOBER 2005 PAT and the Need for High-Availability Platforms Continuing advances in biomedical technology, new and emerging compliance requirements, socio-political pressures, and global competition are driving new mission-critical systems into pharmaceutical manufacturing. This paper looks at the hardware and system implications of the new regulatory environment. THOUGHT LEADERS FOR MANUFACTURING & SUPPLY CHAIN

2 Pharmaceutical Manufacturing Needs High Availability Systems Scientific, Risk-based Approach GMP Initiative PAT Initiative Counterfeit Dug Initiative Patients Continuing advances in biomedical technology, new and emerging compliance requirements, socio-political pressures, and global competition are driving new mission critical systems into pharmaceutical manufacturing. It is estimated that within the next few years, over 40 percent of new drug entities will come from biotechnology processes rather than traditional chemical processes. Small molecule chemical reactions are well understood. However, large molecule biotechnology processes are not as well understood or predictable. Batches also tend to be more expensive, and Continuous Process Optimization Research Robust Clinical Trials Robust, Increased Clinical Manufacturing Technology Suppliers Revalidation may take over 60 days to complete. Because this represents increased financial risk and potential product quality variability, substantially more on-line process data are required to monitor and analyze in-process batches. Rising healthcare costs and demand for more affordable drugs have begun to erode the industry s traditionally high margins. At the same time benefits offered by patent protection continue to erode, reducing the commercial life cycle of the drug and the financial returns it provides. In addition, competitive nonpatent-infringing drugs often reach the market within a few years, eroding both market share and margins. The FDA s new scientific risk-based approach to regulating the pharmaceutical industry in the 21 st century and initiatives like Process Analytical Technology (PAT) are changing inspection, compliance, and enforcement policy requirements. Commercial Manufacturing QMS The New Pharmaceutical Business Model Requires Improved Automation Systems with Reliability, Availability, and Security CAPA Customers Pharmaceutical manufacturers must now be able to demonstrate an increasing knowledge of their manufacturing processes, implement continuous quality verification, and make continuous improvement a part of their quality management system. Low asset utilization and compliant paper records obtained from automated systems will no longer be adequate. To succeed in this challenging business environment and continue to comply with evolving FDA requirements, manufacturing will need substantially more on-line process data, 2 Copyright ARC Advisory Group ARCweb.com

3 fully-electronic records, and on-line multi-factorial analysis tools. Increased dependence on extensive real-time electronic batch records demands increased availability of the applications and the real-time data storage devices used to collect and store them. ARC believes that the PAT initiative is forcing the industry to re-examine the reliability, availability, and security of its entire automation infrastructure and architecture. Conservative Technology Adoption Pharmaceutical Manufacturers have lagged in using IT on a broader scale to automate and streamline time-consuming manufacturing processes such as batch record systems, and for maintaining process quality control. Limited automation, electronic record keeping and information collection diminish manufacturing efficiency and a costly regulatory reporting infrastructure. Pharmaceutical industry equipment utilization hovers around 40 percent, which is significantly below that of most industries. Industry statistics for batch quality failures range from 5 to 15 percent. PAT and electronic batch record (EBR) information systems are key elements to eliminate inefficiencies and transform manufacturing processes so that they better meet regulatory and market demands. PAT and EBR systems can create a common data framework that turns plant floor data into a strategic-level tool. The information they generate can help production managers increase yield while helping analysts and executives root out inefficiency, plan capacity usage, and meet regulatory reporting requirements. Transforming production systems will be challenging because doing so is not simply a matter of extending conventional corporate IT systems and management approaches down to the manufacturing plant level. Reporting requirements and public safety regulations make reliability a much higher priority at the manufacturing level than in the general corporate IT environment. Outages and glitches common in corporate infrastructures can mean loss of batch data and the batch itself. Copyright ARC Advisory Group ARCweb.com 3

4 PAT & EBR information systems The industry needs real-time information technology solutions to meet market and regulatory pressures, but will need reliable computing infrastructures that are always available. For PAT to provide real-time quality control, its underlying computing infrastructure must be continuously available. EBR needs continuous computing, as even one system crash breaks the electronic record chain, potentially turning a batch of product into the quarter s biggest loss. This has pushed the inevitable rise of fault-tolerant computing infrastructures in the industry with vendors such as Stratus Technologies. Continuous availability infrastructures based on fault-tolerant hardware and software components provide the much higher degree of reliability percent uptime that pharmaceutical companies need to ensure more robust and reliable EBR and PAT solutions. Manufacturing companies should be cautious when evaluating how their technology vendors define the term continuous availability. Vendors often describe continuous availability in differing ways and often have varying experience in delivering these solutions. Error Classifications Computer hardware crashes can generally be attributed to two classes of errors: hard errors and transient errors. While both hard and transient errors usually result in downtime for a standard server and initiate a failover recovery procedure in a cluster, the similarities end there. Hard errors are usually reproducible, consistent and easy to isolate. In contrast, transient errors are unpredictable random events that are virtually impossible to isolate on a conventional server. Compounding the problem of transient errors is that they can cause silent data corruption that results in the system generating false outputs. The consequences can be severe. Irretrievable loss of critical data, costly solutions downtime, and failure to meet regulatory compliance may all occur when silent data corruption goes unchecked. 4 Copyright ARC Advisory Group ARCweb.com

5 Over time, these factors can cause affected components to move from a fully functioning state to an intermittent state and, finally, to a hard-failed state. Depending on the defect, the component may be in an intermittent state for a relatively long period of time during which transient errors may occur more frequently. The typical industry-standard server is designed with price/performance as its primary goal. Availability viewed as a secondary objective, allows minimal design margins and marginal components are all too often the outcome. Such a system is prone to transient errors when subjected to system load, the component manufacturing process, and environmental conditions. Solution Options: Clusters v. Fault Tolerance Description Stratus Cluster Availability % 99.9% Recovery Time Zero Minutes Copies of O/S 1 Multiple Symmetric Multi-Processing System Operation New, low-cost technology for fault-tolerant platforms is now available for Microsoft Windows environments. These consist of fully replicated, faulttolerant hardware solutions, with duplicate components operating in lockstep, bundled together with Microsoft Windows so that the whole platform Available Single System Image Available Multisystem Cluster is highly available. In the event of a component failure, there is no interruption in processing, no lost data, and no slowdown in performance. Manufacturers should revisit some old assumptions about where they might benefit from deploying these platforms. Not only is the price point of these fault-tolerant systems very attractive compared to decades-old products, but there are also substantial cost savings compared to cluster-based solutions. Implementation Single Support Contact No work required Yes Script Development and Testing 3 rd Party Comparison of Fault Tolerant Solutions Compliance requirements and new regulatory trends put a premium on real-time manufacturing information, and faulttolerant systems can help ensure that the information is always available. Mission critical automation systems, production management systems, business systems, Copyright ARC Advisory Group ARCweb.com 5

6 and collaborative systems can all benefit from this technology. Cluster Issues Clustering is a technique in which physical connections and software programs link two or more servers (nodes) so that when a failure occurs on one node, its workload can fail over to the surviving node. The goal of the cluster is to establish a high-availability environment that minimizes application downtime. In the Windows world, the de facto cluster offering is Microsoft s Cluster service. Cluster service will support the formation of clusters that can contain eight nodes with up to 64 processors per node. A number of vendors including Legato Systems, NCR, Oracle, and VERITAS Software, offer products that complement or compete with Cluster Service. A cluster is not a single product per se, but rather a collection of hardware, software, and enabling technologies (such as Cluster service), that are combined to create a solution that provides a high level of availability. While the traditional clustering approach to fault tolerance does provide for enhanced availability, there are significant limitations. Cluster solutions do not provide fault tolerance (failure and repair/recovery is transparent to the user), only failover (a backup system automatically restarts the applications and logs on the users). Implementation requires the development, testing, and support of custom failover scripts, licensing and installation of multiple copies of software, and possibly application modifications for a cluster environment. In the event of a hardware failure, a cluster failover always loses all memory contents, and several minutes will be required to recover. Cluster solutions offer 99.9 percent availability (about 8 hours down per year), but fault tolerant solutions offer percent availability (about 5 minutes down per year). Hardware Fault Tolerance The first requirement for high-availability systems is hardware fault tolerance. All aspects of the design must work concurrently to prevent unplanned downtime, not simply minimize it. Preventing downtime is a key design point that differentiates server manufacturers from each other. The ftserver W Series family from Stratus Technologies is one such server that has this unique differentiation from robust traditional servers and high-availability clusters. Notably, off-the-shelf Windows-based applications need not be modified in any way to benefit from the designed-in safeguards of the ftserver. This advantage represents a considerable im- 6 Copyright ARC Advisory Group ARCweb.com

7 provement compared with clusters that require failover scripting, repeated test procedures, and software changes to make applications cluster-aware. The ftserver W Series family eliminates single points of failure using replicated components that continue uninterrupted processing even in the event of a component malfunction. Hardware faults are handled automatically by the system, without failover delay or data loss. Using Stratus intellectual property (lockstep technology), W Series systems maintain multiple CPU-memory units in precise synchronization executing the same instructions at exactly the same clock cycle. Lockstep processing ensures that any errors, even transient errors, are detected and that the system can survive any CPU-memory unit error without interrupting processing and without losing any data or state. The fault-tolerant I/O subsystem is logically separate from the CPUmemory subsystem. Hardware logic, in the form of custom ASICs, acts as a PCI bridge between the CPU and I/O, and provides the core error detection, fault isolation, and synchronization logic for the lockstep architecture. Custom logic within the CPU/memory subsystem contains the primary PCI interfaces, interrupt control functions, and transaction ordering logic. Custom logic within the I/O subsystem contains the voting logic, secondary PCI interfaces, and error registers. Fault-tolerant I/O is implemented through the use of replicated PCI buses, replicated I/O adapters, and replicated devices. All critical PCI adapters are duplicated as well: SCSI, SATA, Ethernet, remote management, and Fibre Channel. Internal SCSI and SATA disk storage, along with expansion Fibre Channel storage, is mirrored (RAID 1), connected via two independent storage buses. Connections to external Fibre Channel hardware RAID arrays are also duplicated to ensure full fault-tolerant operation. Multiple paths are therefore available to any logical I/O operation, including both internal and external storage operations. Any I/O operation failure will result in a retry using an alternate path that ensures successful completion of the I/O operation. Stratus approach to availability is based on a design philosophy that detects, isolates, and corrects errors before they cause system downtime or corruption of valuable business data. Preventing downtime is a key design Copyright ARC Advisory Group ARCweb.com 7

8 point that differentiates Stratus servers from conventional servers and highavailability clusters. While many servers offer duplicated power supplies, fans, and disk drives, the ftserver system from Stratus provides protection for core system components that include motherboards, processors, memory, I/O buses, and I/O adapters. Another advantage of this approach is that an ftserver system presents a single-system view and runs a single copy of all software, which typically reduces software licensing costs and simplifies administration as compared with multi-node cluster alternatives. Software Availability The second requirement for high-availability systems is for maximizing software availability. Clusters rely on standard hardware, software, and service models that do not help prevent failures, isolate failures, or resolve failures. They simply recover from failures. Stratus software availability features seek to prevent outages, minimize those that cannot be prevented, and resolve problems so that they do not happen again. Because software is particularly vulnerable to hardware errors, proper error handling can avert many potential software problems. With conventional servers, many problems attributed to software are actually caused by transient hardware errors. While no computer system can prevent a transient error from occurring, Stratus line of fault-tolerant systems has been engineered to detect, isolate, and withstand transient hardware errors. The lockstep processing discussed above ensures that any errors, including transient errors, are detected and that the system can survive any CPU-memory unit error without interrupting processing and without loss of data or state information. In addition to riding through the error condition, Stratus systems capture and log information about the transient occurrence and will automatically take the affected component out of service if it reaches a threshold beyond which a failure is likely to occur. In the event that a component is taken out of service, its partner component simply continues to operate as normal. Stratus does not change any of the core Windows code. This guarantees 100 percent binary compatibility of all Windows applications. The ftserver systems running Windows have demonstrated hardware and operating system availability levels beyond %, as measured by actual production system data. Stratus does change the Windows environment, but only changes areas designed to be customized by hardware and software part- 8 Copyright ARC Advisory Group ARCweb.com

9 ners and separated from the main body of Windows code by documented, well-defined interfaces. Drivers cause a significant percentage of NT failures. Stratus driver hardening goes beyond Windows improvements to further reduce driverinduced OS failures. The driver defines its memory boundaries and works with Stratus hardware to automatically prevent memory transfers beyond the defined memory boundaries. This prevents a bad PCI card from crashing the system. The new Microsoft driver model for Windows uses WMI (Windows Management Instrumentation) for management, control, and reporting functions. Stratus hardened drivers are completely compatible with WMI. Stratus recommends that all drivers be hardened. Hardened drivers for all installed adapters are required in order to receive Stratus 100 percent availability guarantee. Incompatible versions of hardware and software from different suppliers are common. The Resource Inventory Manager (RIM) identifies all system hardware and software configuration elements, along with their revision levels, at initial install and all configuration changes. This information is stored and is also sent to the Stratus Customer Assistance Center (CAC), which can check known conflicts and help diagnose any problems. Serviceability The third requirement for high-availability systems is designed-in serviceability. Serviceability is built into the ftserver hardware design in the form of customer replaceable modules, automatic fault isolation and remote management, and reporting through the Stratus remote management card. The Stratus Service Network (SSN) enables remote access to every customer system. The Stratus Customer Assistance Center provides 24/7 critical support. The Stratus Technologies ftserver system automatically isolates failures to the component level while continuing operation on a second component. Failures are automatically reported to the CAC via a dial connection. A replacement component is shipped from Stratus for next-day arrival. The customer replaces the component while the system continues to operate. The new component is automatically integrated into the running system. The system and application continue to run normally through this entire process. Copyright ARC Advisory Group ARCweb.com 9

10 Each ftserver comes with two ftserver Management PCI adapters. These adapters are, themselves, board level computers. They run independently of the host system and are powered even if the rest of the system is powered off. Either redundant ftserver Management adapter provides full control over the ftserver. Access is controlled through a TCP/IP interface via dial modem or local Ethernet. If a customer calls, Stratus will troubleshoot the problem. If the problem is in Microsoft Windows code, Stratus calls in Microsoft for support. Stratus also has licensed Windows source code and a staff of kernel-trained engineers. Microsoft has also given Stratus access to its OS debugging tools. Closing Thoughts The new scientific risk-based approach to regulating the pharmaceutical industry is changing the paradigm for compliance, inspection, and enforcement. It puts a premium on understanding your process, which requires real-time manufacturing information. Fault-tolerant systems can help ensure that the information is always available. Mission critical automation systems, production management systems, business systems, and collaborative systems can all benefit from this technology. When considering the impact of the PAT initiative, pharmaceutical manufacturers should re-examine the reliability, availability, and security of their entire automation infrastructure and architecture. Manufacturers should consider low-cost technology for fault-tolerant platforms now available for Microsoft Windows environments, and should revisit some old assumptions about where they might benefit from deploying these platforms. Not only is the price point of these fault-tolerant systems very attractive compared to their decades-old predecessors, but there are also substantial cost savings compared to today s cluster-based solutions. 10 Copyright ARC Advisory Group ARCweb.com

11 About the Authors: Greg Gorbach: As ARC s Vice President, Collaborative Manufacturing and Architecture, Greg Gorbach is a thought leader in Collaborative Manufacturing and provides clients in a number of manufacturing vertical markets with strategic advice in dealing with boundarycrossing business processes. Greg s primary areas of focus are Collaborative Manufacturing, Production Management, Business Process Management, Manufacturing Performance Services, and the synchronization of plant systems with CRM, ERP, PLM, Supply Chain and other business systems. He brings over twenty years of hands-on experience to ARC, with direct experience within manufacturing organizations, as well as extensive experience with suppliers to manufacturers. John Blanchard: John is part of the manufacturing automation consulting group at ARC covering the food & beverage, life sciences, fine chemicals, and CPG industries. He concentrates on batch process automation, governmental regulations such as US FDA 21 CFR Part 11, industry automation requirements, and evolving issues and technologies affecting these industries. John has over 25 years of experience as both a user and a supplier in the food, beverage, and pharmaceutical industries as a manufacturing engineer, project engineer, project manager, industry marketing manager, and automation consultant. Founded in 1986, ARC Advisory Group has grown to become the Thought Leader in Manufacturing and Supply Chain solutions. For even your most complex business issues, our analysts have the expert industry knowledge and firsthand experience to help you find the best answer. We focus on simple, yet critical goals: improving your return on assets, operational performance, total cost of ownership, project time-to-benefit, and shareholder value. All information in this report is proprietary to and copyrighted by ARC. No part of it may be reproduced without prior permission from ARC. ARC Advisory Group, Three Allied Drive, Dedham, MA USA Tel: , Fax: , ggorbach@arcweb.com Visit our web page at ARCweb.com Copyright ARC Advisory Group ARCweb.com 11

12 3 ALLIED DRIVE DEDHAM MA USA BOSTON, MA PITTSBURGH, PA PHOENIX, AZ SAN FRANCISCO, CA CAMBRIDGE, U.K. Düsseldorf, GERMANY MUNICH, GERMANY HAMBURG, GERMANY TOKYO, JAPAN BANGALORE, INDIA