Cognitive, AI and Analytics examples, trends and directions Ulrich Walter Cognitive Systems HPC & Cloud Sales Leader Hanover, 22.12.2017
Overall Artificial Intelligence (AI) Space Cognitive / ML/DL Human Intelligence Exhibited by Machines Machine Learning Human Trained using large amounts of data & ability to learn how to perform the task Deep Learning IT Systems break tasks into Artificial Neural Networks New Data Sources: NoSQL, Hadoop & Analytics New class of applications Machine Learing & Training Pattern matching Image Real-time decision support Complex workflows Data Lakes Extend Enterprise applications Finance: Fraud detection / prevention Retail: shopping advisors Healthcare: Diagnostics and treatment Supply chain and logistics Extend Predictive Analytics to Advance Analytics with AI 2 Growing across Compute, Middleware, and Storage
Deep Learning in a Nutshell Shallow (supervised) machine learning pipeline Very difficult to find robust mathematical Representations Feature extraction Done by human experts 0.34 9.34 1.45 0.01 2.55 learning model Coffee Mugs
Deep Learning in a Nutshell Deep machine learning learning pipeline learning model Unstructured data Feature extraction 0.34 9.34 1.45 0.01 2.55 modeling Semantic label Coffee Mug closed optimization of this problem by Neuronal Networks with many layers
Current State of DL Infrastructure Open Source Software Caffe (Facebook) Torch Theano Tensor Flow (Google) CNTK (Microsoft) DSSTNE (Amazon)... https://www.bloomberg.com/news/articles/2016-07-21/google-sprints-aheadin-ai-building-blocks-leaving-rivals-wary
Impacts on sizing Neural Network Size Time Data Volume 6
For machine learning you need the following components Input layer Hidden Layer (s) Output layer + + 1. A large set of tagged data 2. A neuronal network 2. A HPC server with GPUs and an extreme high internal bandwidth 7
CNN in multiple layered convolutional neuronal networks (CNN) Raw data Iterated data Tagged data Elephants Chairs 8
9
Sic Transit Gloria Mundi Google Brain 2012 16.000 Servers ~ 8 MW/h ~ 50 TFLOPS 3 NVIDIA PASCAL GPUs ~ 0,9kW/h ~ 62 TFLOPS
Some principles of AI Collect Detect Data Text Voice Image & Video Sensor Store Data Compress Map Reduce Semantic Syntax Sentiment NLP Image recognition Analytics Sensor data analytics Analyse Consolidated Response or Analysis Recommended action (human intervention) Automated action Data Collection, Storage and Distribution POWER AI Framework Storage nodes InfiniBand EDR 100GBit Switch Fabric complementing IBM Watson 11
Leveraging the first CPU designed for accelerated computing Faster Cores than x86 Larger Caches Per Core than x86 5X Faster CPU-GPU Data Communication POWER8 P8 CAPI NVLink PCIe High Performance Cores Fast & Large Memory System Fast PowerAccel Interconnects for Accelerators 12
Introducing the IBM Power Systems LC Line OpenPOWER servers for cloud and cluster deployments that are different by design High Performance Computing S822LC For Big Data S822LC For High Performance Computing NEW S821LC S812LC S822LC Storage rich single socket system for big data applications Memory Intensive workloads Ideal for storage-centric and high data throughput workloads Brings 2 POWER8 sockets for Big Data workloads Big data acceleration with work CAPI and GPUs Incorporates the new POWER8 processor with NVIDIA NVLink Delivers 2.8X the bandwidth to GPUs accelerators Up to 4 integrated NVIDIA Pascal GPUs 2 POWER8 sockets in a 1U form factor Ideal for environments requiring dense computing 2X memory bandwidth of Intel x86 systems Memory Intensive workloads 2016 IBM Corporation
Power S822LC for HPC (aka Minksy) vs x86 with P100 GPU 2.8X the CPU-GPU bandwidth compared to x86 based systems S822LC for HPC with CPU-GPU NVLink capability not available on x86 servers ~13% faster than any PCI-E platform with 4 GPUs S822LC for HPC packaging allows for higher power/frequency X86 P100 PCI-E Performance compares Kinetica: 2.7X vs x86 with 4 PCI-E based P100 CPMD: 3X performance of CPU only implementation The first ever GPU accelerated version of CPMD NAMD: 30% increase when combine with visualization code 14
PowerAI takes advantage of NVLink between POWER8 & P100 to increase system bandwidth NVLink between CPUs and GPUs enables fast memory access to large data sets in system memory Two NVLink connections between each GPU and CPU-GPU leads to faster data exchange NVLink Tesla P100 GPU GPU Memory System Memory POWER8 CPU 80 GB/s 115 GB/s Tesla P100 GPU GPU Memory P100 GPU GPU Memory System Memory POWER8 CPU 115 GB/s 80 GB/s Tesla P100 GPU GPU Memory NVLink
Nvidia P100 Pascal GPUs inside a Minsky system
POWER 8 CAPI Coherent Accellerator Processor Interface 17
Compute Node IBM Power 822LC HPC IB EDR Adapter 2 * 100 Gbit SSD or SAS 4 Lanes / CPU (115GB/s per CPU) PEX/ CAPI CPU 1 POWER 8+ 8 or 10Core NVLINK 40GB+40GB bidirectional POWER8 SMP-A 3 * 12,8GB/s CPU 2 POWER 8+ 8 or 10Core NVLINK 40GB+40GB bidirectional PEX /CAPI On Board 4 * 10 Gbit Etn 4 * NVIDIA TESLA 100 GPU NVMe 1.6TB
This indicates the timing and data bandwidth effects which are important for time optimization of running AI training jobs this is the NVLINK effect! 19
Built with Collaborative Innovation OpenPOWER 299 OpenPOWER members contribute to 87 OpenPOWER ready products and 17 servers delivering choice to industry Close partnership with major AI/accelerator industry leader Nvidia Open Source Workloads Now hyper-focused on expanding Cognitive/AI industry applications Enterprise Support/Subscription model OpenCAPI High bandwidth open interconnect to attach to accelerators and SCM Open Frameworks Highly optimized & accelerated Cognitive/AI frameworks Cognitive/AI SDK for deployment and deployment tools
Problem identification Is it a machine perception problem No Look at other approach Yes Is there sufficient data to train on? No Gather more data Training and inference Execute Training models Data transformation Align relevant data sets using big data ETL middleware to standard schema Select and define training algorithm Evaluate reuslts and fine tune algorithms Deploy for production 21
Several Options to Realize Performance Enhancements via GPU Acceleration Easy Ease of Use Best Application Performance Best Libraries ESSL/PESSL NVIDIA Libraries Math library, cublas, NPP, etc Easy to Implement Tested and Supported Limited your needs may not be covered Programing models supporting directives OpenACC Open MP Modification of existing programs with directives Compiler assists with mapping to device Programing language which targets GPU CUDA Most time intensive Requires expertise Achieves best performance results
23 Deep learning in action
Thank you! ibm.com/systems/hpc 24