Accelerate and Operationalize AI Deployments Using AI-Optimized Infrastructure

Size: px

Start display at page:

Download "Accelerate and Operationalize AI Deployments Using AI-Optimized Infrastructure"

Darrell Merritt
5 years ago
Views:

1 Sponsored by: IBM In the digital era, artificial intelligence (AI) has grabbed the center stage of business intelligence. While the power and promise of AI is exciting, organizations are challenged to successfully transition past proof of concepts to production and scale. Accelerate and Operationalize AI Deployments Using AI-Optimized Infrastructure June 2018 Written by: Ritu Jyoti, Vice President Introduction Digital transformation (DX) is reaching a macroeconomic scale. Intelligent applications based on artificial intelligence (AI), machine learning (ML), and continual deep learning (DL) are the next wave of technology transforming how consumers and enterprises work, learn, and play. While data is at the core of the new digital economy, it's also about how you sense the environment and manage the data from edge to core to cloud, analyze it in near real time, learn from it, and then act on it to AT A GLANCE KEY STATS IDC predicts by 2019, 40% of DX initiatives will use AI services; by 2021, 75% of commercial enterprise apps will use AI, over 90% of consumers will interact with customer support bots, and over 50% of new industrial robots will leverage AI. affect outcomes. The Internet of Things (IoT), mobile devices, big data, AI, ML, and DL all combine to continually sense and collectively learn from an environment. What differentiates winning organizations is how they leverage that to deliver meaningful, value-added predictions and actions for improving industrial processes, healthcare, experiential engagement, or any other kind of enterprise decision making. AI business objectives are balanced between tactical and strategic, and they can range from improvement in operational efficiencies to increasing competitive differentiation, from maximizing existing product revenue to launching new digital revenue streams. AI has grabbed the center stage of business intelligence, despite having been around for decades, due to the growing pervasiveness of data, the scalability of cloud computing, the availability of AI accelerators, and the sophistication of the ML and DL algorithms. IDC predicts that by 2019, 40% of DX initiatives will use AI services; by 2021, 75% of commercial enterprise apps will use AI, over 90% of consumers will interact with customer support bots, and over 50% of new industrial robots will leverage AI. However, AI's critical role within almost half of DX initiatives will put new stress on the skills needs in IT. By 2020, IDC predicts that 85% of new operation-based technical hires will be screened for analytical and AI skills, enabling the development of data-driven DX projects. At the same time, CIOs must create and continuously enhance an integrated enterprise digital platform that will enable a new operating and monetization model. The IT organization will need to be among the enterprise's best and first use-case environments for AI in development, data management, and cybersecurity.

2 AI Model and Workload Deployments: Challenges and Needs AI is changing the way business processes are carried out in the digital era. While the power and promise of AI is exciting, deploying AI models and workloads is not easy. Despite all the buzz, most organizations are struggling through proof of concepts (POCs) and only a few have made it to full production. At issue is the fact that ML and DL algorithms need huge quantities of training data (typically ranging from 8-10 times the volume used for traditional analytics), and AI effectiveness depends heavily on high quality, diverse, and dynamic data inputs. Historically, data analytics centered around large files, sequential access, and batched data. Modern data sources and characteristics are different. Today, data consists of small to large files and structured, semistructured, and unstructured content. Data access varies from random to sequential. By 2025, more than a quarter of the global data set will be real time in nature, and real-time IoT data will make up more than 95% of it. In addition, data is increasingly distributed across on-premises, colocation, and public cloud environments. In January 2018, IDC surveyed 405 IT and data professionals in the U.S. and Canada who had successfully completed an AI project, had budget control or influence, and were responsible for evaluating or architecting a platform to run AI workloads. The survey sought to determine how organizations use and manage AI-enabled technologies and to identify the infrastructure used for running cognitive/ml/ai workloads, the deployment location of the technology, and the associated challenges and needs. As shown in Figure 1, survey respondents identified their key AI deployment challenges as dealing with massive data volumes and associated quality and management issues. FIGURE 1: CHALLENGES DEPLOYING AI WORKLOADS 50% Data Volume and Quality 47% Advanced data management 44% Skills Gap 39% Access to high degrees of compute power 39% Algorithmic bias, privacy and ethical concerns 28% Concern about job loss Source: Cognitive, ML, and AI Workloads Infrastructure Market Survey, IDC, January 2018; n=405, 1,000+ employees (U.S.); 500+ employees (Canada) #US Page 2

3 Poor data quality has a direct correlation to biased and inaccurate model buildout. Ensuring data quality with large volumes of dynamic, diverse, and distributed data sets is a difficult task as it is hard for the developers to know, predict, and code for all the appropriate checks and validations. To address these challenges, enterprises desire an autonomous data quality and validation solution. Such a solution would automatically learn data's expected behavior, create thousands of data validation checks without the need for coding, update and maintain the checks over time, and eliminate both anticipated and unanticipated data quality errors to make the data more trustable and useable. Talent meaning both AI engineers and data scientists will be needed to support the growing segment of AI-dependent DX initiatives. However, IT organizations are grappling with a shortage of these professionals as well as a skills gap (see Figure 1.) For example, there is a high learning curve in building/optimizing and training models, a skill set most data scientists don't have. As businesses adopt AI, the following user personas are intricately involved:» Data scientists are big data wranglers and model builders. They take a mass of data points (unstructured and structured) and use their formidable skills in math, statistics, and programming to clean, massage, and organize the data. They then apply all their analytic powers industry knowledge, contextual understanding, skepticism of existing assumptions to uncover hidden solutions to business challenges. A data scientist may be required to: o Extract, clean, and prune huge volumes of data from multiple internal and external sources o Employ sophisticated analytics programs, ML, and statistical methods to prepare data for use in predictive and prescriptive modeling o Explore and examine data to determine hidden weaknesses, trends, or opportunities o Invent or build new algorithms and models to solve problems and build new tools to automate work o Train, optimize, and deploy data-driven AI models at scale to the most pressing challenges o Maintain the accuracy of AI models Talent meaning both AI engineers and data scientists will be needed to support the growing segment of AI-dependent DX initiatives. o Communicate predictions and findings to management and IT through effective data visualizations and reports» Data engineers/administrators build massive reservoirs for data. They develop, construct, test, and maintain architectures such as databases and large-scale data processing systems. Once continuous pipelines are installed to and from these huge "pools" of filtered information, data scientists can pull relevant data sets for their analyses.» Data/IT architects integrate AI frameworks into infrastructure deployment strategy and are responsible for supporting the scale, agility, and flexibility of the environment. #US Page 3

4 When these user personas were asked by IDC for their top decision criteria in choosing an AI solution, they identified security, cost effectiveness, and operationalization (building, tuning, optimizing, training, deployment, and inferencing) of data models/intelligence, as shown in Figure 2. FIGURE 2: DECISION CRITERIA FOR AI INFRASTRUCTURE/SOLUTIONS Decision Criteria for AI Infrastructure Operationalizing data models and intelligence Meets compliance requirements Cost effectiveness Security Scalability Flexibility Data Scientist Data Architect/Administrator Source: IDC, 2018 With all these factors in mind, the essential keys to successful AI deployments are:» Data Scientists' Productivity Building, testing, optimizing, training, inferencing, and maintaining the accuracy of models is integral to AI workflow. These neural-network models are hard to build. To build, test, and deploy large-scale ML/DL models, data scientists normally use a variety of tools like RStudio and Spark with open source frameworks like Tensorflow and Caffe, as well as programming software like R or Python. However, the selection and installation of the open source frameworks and the initiation of the modeling processes can be a cumbersome affair, and it may take weeks or months to get things working. Building and optimizing models can require manually testing thousands of combinations of hyperparameters. Training models may take weeks or months to complete in some use cases: for example, a healthcare organization took a year to build and train a medical model to detect an early-stage cancer. Training's iterative nature can require millions of jobs over hours, days, or weeks. Currently, the process requires jobs to complete to see if a training job is successful, which means organizations might run the training job for a week before realizing it did not converge. It is also necessary to restart training jobs if a server/gpu fails, which means everything must start over. #US Page 4

5 Data scientists look for efficiency and automation in cleaning data from multiple sources and reducing the noise. They also need help in building, tuning, and choosing the right features for the model, including simplification in determining hyperparameter settings. Data scientists look for speed and agility in iterative and cyclic processes. They also look for superior performance and flexible scaling of infrastructure resources for the training phase to cut down long training times and for tools to easily run jobs in a distributed cluster environment. There is a need for agile workload management, especially the ability to run jobs more efficiently to maximize resource utilization for performance and cost efficiency, and to visualize and stop jobs if they are not converging. Software that allows jobs to automatically roll to a different server/gpu if the previous one fails is also needed.» Optimized Infrastructure and Efficient Data Management Examining the data pipeline for AI workloads shown in Figure 3, we see that the application profiles, compute, and I/O profiles change from ingestion to real-world inferencing. ML and DL need huge quantities of training data. Both training and inferencing are compute-intensive and require high performance for fast execution. AI applications push the limits on thousands of GPU cores or thousands of CPU servers. AI and DL require a new class of accelerated infrastructure primarily based on GPUs. For the linear math computations needed for training neural network models, a single system configured with GPUs is significantly more powerful than a cluster of nonaccelerated systems. However, not all AI deployments are the same. Organizations should explore heterogenous processing architectures (e.g., GPUs, FPGAs, ASICs, or Manycore processors) based on the performance, operating environment, required skill sets, costs, and energy demand for their AI deployments. FIGURE 3: DATA PIPELINE FOR AI WORKLOADS Ingest Compute & I/O Normal compute Write-heavy Transform Compute & I/O Normal Compute Read+Write heavy Explore Compute & I/O Compute Intensive Read heavy Train Compute & I/O Compute Intensive Read+Write heavy Real-world Inference Compute & I/O Compute Intensive Read heavy concurrent access Source: IDC, 2018 We also know that parallel compute demands parallel storage. While the training phase requires large data stores, inferencing has less need for it. The inference models are often stored in a DevOps-style repository where they benefit from ultra-low-latency access. While the training phase is set once the execution model has been developed based on the data and the workload has moved to the inferencing stage, re-training of the model is often needed as new or modified data comes to light. In some cases, the #US Page 5

6 real-time nature of the application may require near constant re-training and updating of the model. Organizations also may benefit from re-training the model over time due to additional data sources and insights in play. If data doesn't flow smoothly through the pipeline, productivity will be compromised and organizations will need to commit increasing effort and resources to manage the pipeline. Data architects and engineers are challenged to ensure the agility, flexibility, scale, performance, security, and compliance requirements of AI workloads while keeping the costs in check. Enterprises clearly cannot support a cutting-edge tool like AI on legacy infrastructure that is challenged to meet the required needs for scale, elasticity, compute power, performance, and data management. Organizations are now using different infrastructure solutions and approaches to support the data pipeline for AI, a process that generally leads to data silos. Some create duplicate copies of the data for the pipeline so to not disturb the stable applications. Instead, organizations need to adopt infrastructure that is dynamically adaptable, scalable, and intelligent (self-configurable, self-optimizing, and self-healing). Such an infrastructure is tuned for varied data formats and access, and it can process and analyze large volumes of data. It also possesses the speed to support faster compute calculations and decision making, manage risks, and reduce the overall costs of AI deployments.» Enterprise Readiness Businesses are always concerned about the impact of bringing emerging technologies and frameworks into an enterprise setting. They are looking for enterprise readiness security, reliability, support, and other criteria, as noted in Figure 2. Most of the AI/ML/DL frameworks, tool kits, and applications available do not implement security, relegating their use to disconnected experiments and lab implementations. Also, most companies choose to invest in separate clusters to run AI, which is costly and inefficient. An additional challenge in DIY-built systems is the difficulty in getting enterprise-grade support from multiple vendors. Considering IBM for AI/ML/DL Workloads IBM's strategy is to make AI/ML/DL more accessible and more performant. By combining IBM PowerAI, IBM Spectrum Conductor Deep Learning Impact, and IBM Spectrum Scale software with IBM Power Systems and IBM Elastic Storage Server (ESS), organizations could quickly deploy an optimized and fully supported platform for AI/ML/DL workloads with high performance. This approach lessens the challenges of pulling a solution together from open source. All the components of IBM's solution are sourced, acquired, and fully backed and supported by the company (includes level 1-3 support). If support is required, the same single point of contact is used for all the IBM components and all cases are owned and managed by IBM. The company also manages all software patches and upgrades to the systems. With this solution, organizations can simplify the development experience and reduce the time required for training the AI model. The solution also enables a productive PoC and allows growth into a multitenant system in which multiple data scientists running different environments and frameworks can seamlessly share a common set of resources that scale to serve an organization, while integrating the platform into existing IT infrastructure. IBM has one of the most comprehensive solutions stacks with tools and software for all the critical personas of AI deployments including data scientists. Unlike the competition, IBM has a #US Page 6

7 diverse and growing set of enterprise customers including financial organizations that run their AI deployments in production environments with full confidence in the security and data privacy functionality of the system. The IBM Reference Architecture shown in Figure 4 consists of the following software components and infrastructure stack:» IBM PowerAI is a software distribution package containing major open source DL frameworks for model training such as TensorFlow, Caffe, and their associated libraries. These frameworks deliver superior throughput and performance by being optimized to take advantage of the NVLink interconnect between CPU and GPU on the IBM Power Systems servers.» IBM Spectrum Conductor is a highly available multitenant application designed to build a shared, enterprise-class environment for deploying and managing modern computing frameworks and services such as Spark, Anaconda, TensorFlow, Caffe, MongoDB, and Cassandra. Data transformation and preparation is a highly manual set of steps: identifying and connecting to data sources, extracting to a staging server, using tools and scripts to manipulate the data (e.g., removing extraneous elements, breaking large images down to "tile" size so they will fit in GPU memory). The IBM Spectrum Conductor, which provides a distributed Apache Spark and application environment, helps to automate and run these tasks in parallel to speed up and automate the process. It establishes and maintains connections to storage resources and captures data formatting information to enable rapid iteration through the end-to-end DL process. Spectrum Conductor also provides centralized management and monitoring, and it implements end-to-end security for running the frameworks and applications within its context, creating a secure AI/ML/DL environment. Security is implemented around: o Authentication: Support for Kerberos, SiteMinder, AD/LDAP, and OS auth is provided, as is Kerberos authentication for HDFS. o Authorization: This feature allows for fine-grained access control, ACL/role-based control (RBAC), Spark binary life cycle, notebook updates, deployments, resource planning, reporting, monitoring, log retrieval, and execution. o Impersonation: Different tenants can define production execution users. o Encryption: SSL and authentication between all daemons is enabled.» IBM Spectrum Conductor Deep Learning Impact builds a DL environment with an end-to-end workflow that allows data scientists to focus on training, tuning, and deploying models into production. The build/training stage of the AI/ML/DL process is compute heavy and very iterative, and expertise in model tuning and optimization is scarce. This is the stage where an organization's gaps in DL and data science skills hurt the most. PowerAI and Spectrum Conductor Deep Learning Impact assist in the selection and creation of models using cognitive algorithms to suggest and optimize the hyperparameters. They also support elastic training to allow for flexible allocation of resources at and during runtime, which provides the ability to dynamically share resources, prioritize one job over another, and enable resiliency to failure. Runtime training visualization allows the data scientist to see the model's progress and provides the opportunity to stop training if it is not providing the right results, which helps in delivering more accurate neural models faster. #US Page 7

8 » IBM Spectrum Scale is an enterprise-grade parallel file system that provides resiliency, scalability, and control as well as security with storage encryption. Spectrum Scale delivers scalable capacity and performance to handle demanding data analytics, content repositories, and technical computing workloads. Storage administrators can combine flash, disk, cloud, and tape storage into a unified system with higher performance and lower cost than traditional approaches. FIGURE 4: IBM REFERENCE ARCHITECTURE IBM AI Architecture from Experimentation to Production Experimentation Single Tenant Stabilization & Production Secure Multitenant Expansion Enterprise Scale / Multiple Lines of Business Existing Existing Organization Infrastructure Organization Data Scientist s w orkstations Infrastructure High-Speed Network Subsystem Master & Failover Master Nodes IBM Power Systems LC921 & LC922 High-Speed Network Subsystem Login Nodes IBM Power Systems LC921 & LC922 IBM Power Systems AC922 Internal SAS drives & NVM s Training Cluster IBM Power Systems AC922 IBM Elastic Storage Server (ESS) Training & Inference Cluster IBM Power Systems AC922, LC921 & LC922 IBM Elastic Storage Server (ESS) IBM Spectrum Conductor Deep Learning Impact IBM PowerAI - Optimized OSS Frameworks OSS ML/DL Frameworks / BYOF Red Hat Linux / Docker IBM Spectrum Scale One software stack for experimentation to production Source: IBM» Infrastructure for the solution consists of the following elements: o Compute: The IBM Power Servers feature CPU:GPU NVLink connection, which delivers higher I/O bandwidth than x86 servers. They are also capable of supporting large amounts of systems memory.» IBM Power System AC922 supports 2-6 NVIDIA Tesla V100 GPUs with NVLink providing CPU:GPU bandwidth speeds of 100 GB/sec air cooled or 150 GB/sec water cooled. This system supports up to 2TB total memory.» IBM Power System S822LC for HPC supports 2-4 NVIDIA Tesla P100 GPUs with NVLink GPUs providing CPU:GPU bandwidth speeds of 64 GBps. This system supports up to 1TB total memory. o Storage: IBM ESS combines IBM Spectrum Scale software with IBM POWER8 processor-based I/O-intensive servers and dual-ported storage enclosures. IBM Spectrum Scale is the parallel file system at the heart of IBM ESS. IBM Spectrum Scale extends system throughput as it grows while providing a single namespace. #US Page 8

9 Challenges and Opportunities According to IDC, public cloud leads in the deployment of AI models and workloads, closely followed by private cloud deployments. The decision to run the AI pipeline on public cloud vs. on-premises is typically driven by data gravity, where the data currently exists or is likely to be stored. Easy access to compute resources and applications as well as the speed by which the capabilities need to be explored and deployed are also important factors. Public cloud services provide data scientists and IT professionals with the infrastructure and tools to train AI models, experiment with new algorithms, or to learn new skills and techniques in an easy and agile way. In the longer term, organizations will likely deploy the training of the model on-premises, due to the desire to hold on to their IP and insights. Edge deployments are in their infancy. There is a lack of resources at the edge, yet in some situations it is imperative that the processing happen there. For example, a temperature sensor on a critical machine on a factory floor sends readings associated with imminent failure to the edge infrastructure, the readings are quickly analyzed, and a technician is dispatched to repair the machine in time to avoid a costly shut down. In IDC's opinion, IBM's Reference Architecture for AI is an optimized software and hardware stack. It consists of tested, supported, and ready-to-use open source frameworks, highthroughput GPU, and storage that is intelligent, scalable, secure, metadata rich, cloudintegrated, multiprotocol, high performing, and efficient. As stated, the solution can reduce the complexity of AI deployments and help organizations improve productivity and efficiency, lower acquisition and support costs, and accelerate adoption of AI. Potential opportunities to enhance the solution might include the following:» Seamless integration with multiple public cloud services such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure along with IBM Cloud in ever-expanding hybrid and multi-cloud deployments.» Support of smaller form-factor edge infrastructure that can hold temporal data for edge inferencing.» Support of heterogenous processing architectures (e.g., GPUs, FPGAs, ASICs, or Manycore processors) to provide organizations the flexibility to select the appropriate acceleration technology based on the performance, operating environment, required skill sets, costs, and energy demand for their AI deployments. Conclusion Running AI/ML/DL algorithms-infused applications to achieve superior business outcomes is critical for enterprises worldwide and necessary for the bulk of DX efforts and use cases. To help organizations accelerate AI-driven business outcomes and overcome deployment obstacles, IDC offers the following guidance:» Focus on the business outcomes, keep the project timeline well defined, and prioritize projects with immediate revenue and cost impact.» Seek out software tools to simplify and automate data preparation and accelerate the iterative building, training, and deployment of AI models to drive improved business outcomes. #US Page 9

» Look for dynamically adaptable, simple, flexible, secure, cost-efficient, and elastic infrastructure that can support high capacity along with high throughput and low latency for high performance

10 » Look for dynamically adaptable, simple, flexible, secure, cost-efficient, and elastic infrastructure that can support high capacity along with high throughput and low latency for high performance training and inferencing experience.» Embrace intelligent infrastructure, leverage it for predictive analytics and valuable insights, then slowly phase in task automation once the trustworthiness and quality of data is established. About the analyst: Ritu Jyoti, Vice President Ritu Jyoti is Vice President, Systems Infrastructure Programs, for IDC's Enterprise Storage, Server, and Infrastructure software team, which includes research offerings and quarterly trackers as well as advisory services and consulting programs. Ms. Jyoti's core research coverage includes a cognitive/artificial intelligence, big data, and analytics workloads service that looks at the impact of new machine learning, Hadoop, NoSQL databases, and analytics technologies on infrastructure software and hardware markets and digital transformation IT transformation data infrastructure strategies. IDC Corporate USA 5 Speen Street Framingham, MA 01701, USA T F idc-insights-community.com This publication was produced by IDC Custom Solutions. The opinion, analysis, and research results presented herein are drawn from more detailed research and analysis independently conducted and published by IDC, unless specific vendor sponsorship is noted. IDC Custom Solutions makes IDC content available in a wide range of formats for distribution by various companies. A license to distribute IDC content does not imply endorsement of or opinion about the licensee. External Publication of IDC Information and Data Any IDC information that is to be used in advertising, press releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or Country Manager. A draft of the proposed document should accompany any such request. IDC reserves the right to deny approval of external usage for any reason. Copyright 2018 IDC. Reproduction without written permission is completely forbidden. #US Page 10