RAPIDS GPU POWERED MACHINE LEARNING

Size: px

Start display at page:

Download "RAPIDS GPU POWERED MACHINE LEARNING"

Richard Cox
5 years ago
Views:

1 RAPIDS GPU POWERED MACHINE LEARNING

2 RISE OF GPU COMPUTING APPLICATIONS GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS X per year SYSTEMS 10 4 CUDA ARCHITECTURE X per year 10 2 Single-threaded perf Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp

3 EXTENDING DL BIG DATA ANALYTICS From Business Intelligence to Data Science ARTIFICIAL INTELLIGENCE Analytics Traditional Machine Learning (regressions, decision trees, graph) Deep Learning DENSE DATA TABULAR/SPARSE DATA DENSE DATA TYPES (images, video, voice) DATA SCIENCE

USE CASES IN EVERY INDUSTRY CONSUMER INTERNET Personalized

promotions Preventing credit card fraud and cyber attacks FINANCIAL

optimization based on market signals Fraud detection HEALTHCARE Better

4 USE CASES IN EVERY INDUSTRY CONSUMER INTERNET Personalized recommendations to drive viewership Optimized ad targeting Preventing churn by identifying factors that influence loyalty RETAIL Inventory forecasting Personalized recommendations Optimized pricing and promotions Preventing credit card fraud and cyber attacks FINANCIAL SERVICES Personalized guidance on financial products Return optimization based on market signals Fraud detection HEALTHCARE Better disease prediction with genomic medicine Improved health outcomes with analysis of EMRs Predictive care/treatment

5 TODAY S DATA SCIENCE STIFLES INNOVATION Ie: HURRY UP AND WAIT Manage Data Training Evaluate Deploy All Data ETL Structured Data Store Data Preparation Model Training Visualization Inference Slow Training Times for Data Scientists

7 DATA SCIENCE CHALLENGES 30+ Hours to build GBDT Days Data Transformation Weeks Feature Engineering Months Scoring Pipelines More servers and infrastructure yielding diminishing performance returns SLOW TRAINING SLOW DATA PROCESSING ESCALATING TCO

8 Definition XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBOOST It is a powerful tool for solving classification and regression problems in a supervised learning setting.

9 PREDICT: WHO ENJOYS COMPUTER GAMES Example of Decision Tree Source:

10 COMBINE TREES FOR STRONGER PREDICTIONS Example of Using Ensembled Decision Trees Source:

11 RAPIDS OVERVIEW

12 GPU Accelerated Data Science RAPIDS RAPIDS is a set of open source libraries for GPU accelerating data preparation and machine learning. OSS website:

13 RE-IMAGINING DATA SCIENCE WORKFLOW Open Source, End-to-end GPU-accelerated Workflow Built On CUDA Data preparation / wrangling Optimized ML model training Data visualization libraries data insights cudf cuml Visualization

14 RAPIDS LIBRARIES cudf GPU accelerated software for doing data manipulation and data preparation. Accelerates loading, filtering, and manipulation of data for model training data preparation. Python drop-in Pandas replacement built on CUDA C++ cuml GPU accelerated traditional machine learning libraries. XGBoost, Kalman, K-means, KNN, DBScan, PCA, TSVD and more. cugraph Collection of graph analytics libraries. Coming soon.

15 DASK RAPIDS OPEN GPU DATA SCIENCE Software Stack Python Data Preparation cudf Model Training cuml Visualization cugraph PYTHON RAPIDS DEEP LEARNING FRAMEWORKS CUDF CUML CUGRAPH CUDNN CUDA APACHE ARROW on GPU Memory

THE RAPIDS VALUE PROPOSITION High Performance, Easy-to-use Data

improve your productivity with near-interactive data science Top

iterating on models faster and deploying them more frequently

toolchain with minimal code changes and no new tools to learn

allow data scientists to be more productive Open Source

is supported by NVIDIA and built on Apache Arrow TCO Reduction

16 THE RAPIDS VALUE PROPOSITION High Performance, Easy-to-use Data Scientist Data Science Leader Reduced Training Time Drastically improve your productivity with near-interactive data science Top Model Accuracy Increase machine learning model accuracy by iterating on models faster and deploying them more frequently Hassle-Free Integration Accelerate your Python data science toolchain with minimal code changes and no new tools to learn Increased Data Scientist Productivity Reduce training time, allow data scientists to be more productive Open Source Customizable, extensible, interoperable the open-source software is supported by NVIDIA and built on Apache Arrow TCO Reduction Decrease the server costs, footprint, power consumption of your ML workloads reducing the TCO

17 RAPIDS DEPLOYMENT STACK TARGET INDUSTRIES Retail Finance CICN Healthcare TARGET AUDIENCE AND RECOMMENDED SYSTEMS Individual Data Scientist Shared Infrastructure For Data Scientists Quadro GV100 WS 2 GV100, NVLink DGX Station 4 V100, NVLink Cloud V100 Cloud Instances V100 Servers 4-8 V100, NVLink, HGX-1, HGX-2 DGX-1 8 V100, NVLink DGX-2 16 V100, NVLink Cloud V100 Cloud Instances

PILLARS OF RAPIDS PERFORMANCE CUDA Architecture NVLink/NVSwitch Memory Architecture TSV DRAM Core Die 6x NVLINK DRAM Core Die NVSWITCH DRAM Core Die DRAM

18 PILLARS OF RAPIDS PERFORMANCE CUDA Architecture NVLink/NVSwitch Memory Architecture TSV DRAM Core Die 6x NVLINK DRAM Core Die NVSWITCH DRAM Core Die DRAM Core Die Massively parallel processing High speed connecting between GPUs for distribute algorithms Base Die Iu-Bump Large virtual GPU memory, high-speed memory

DESIGNED TO DO THE PREVIOUSLY IMPOSSIBLE NVIDIA Tesla V100 32 GB Tensor Core GPUs 1 2 Two GPU Boards 8 V100 32GB GPUs

4 TB/sec bi-section bandwidth 3 9 4 Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth 5 PCIe

19 DESIGNED TO DO THE PREVIOUSLY IMPOSSIBLE NVIDIA Tesla V GB Tensor Core GPUs 1 2 Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth 5 PCIe Switch Complex 30 TB NVME SSDs Internal Storage 8 6 Two Intel Xeon Platinum CPUs TB System Memory Dual 10/25/100 Gb/sec Ethernet 9 19

NVSWITCH: THE REVOLUTIONARY AI NETWORK FABRIC Inspired by leading edge research that demands unrestricted model parallelism Like the evolution from dial-up to broadband, NVSwitch

20 NVSWITCH: THE REVOLUTIONARY AI NETWORK FABRIC Inspired by leading edge research that demands unrestricted model parallelism Like the evolution from dial-up to broadband, NVSwitch delivers a networking fabric for the future, today Delivering 2.4 TB/s bisection bandwidth, equivalent to a PCIe bus with 1,200 lanes NVSwitches on DGX-2 = all of Netflix HD <45s 20

21 TRADITIONAL HPC CLUSTER 300 Servers $3M 180 kw

22 GPU-ACCELERATED HPC + AI CLUSTER 1 DGX-2 10 kw 1/8 the Cost 1/15 the Space 1/18 the Power

23 FASTER INSIGHTS FOR MACHINE LEARNING DGX-2 544X Speedup Compared to CPU-Only Server Nodes HGX CPU instances 50 CPU instances 30 CPU instances 544X speedup 20 CPU instances 1 CPU instance ,000 1,500 2,000 2,500 3,000 3,500 Process Time (min) cuio/ cudf (Load and Data prep) Data Conversion XGBoost GPU Measurements Completed on DGX-2 running RAPIDS CPU: 20 CPU cluster- comparison is prorated to 1 CPU (61 GB of memory, 8 vcpus, 64-bit platform), Apache Spark US Mortgage Data Fannie Mae and Freddie Mac M mortgages Benchmark 200GB CSV dataset Data preparation includes joins, variable transformations

24 FASTER SPEEDS, REAL WORLD BENEFITS cuio/cudf Load and Data Preparation cuml XGBoost End-to-End 20 CPU Nodes 2, CPU Nodes 2, CPU Nodes 30 CPU Nodes 1, CPU Nodes 1, CPU Nodes 50 CPU Nodes CPU Nodes 1, CPU Nodes 100 CPU Nodes CPU Nodes 1, CPU Nodes DGX-2 42 DGX DGX-2 5x DGX x DGX x DGX-1 0 1,000 2,000 3, ,000 1,500 2,000 2,500 Time in seconds Shorter is better 0 2,000 4,000 6,000 8,000 10,000 cuio / cudf (Load and Data Preparation) Data Conversion XGBoost Benchmark CPU Cluster Configuration DGX Cluster Configuration 200GB CSV dataset; Data preparation includes joins, variable transformations. CPU nodes (61 GiB of memory, 8 vcpus, 64-bit platform), Apache Spark 5x DGX-1 on InfiniBand network

25 DGX POD FOR RAPIDS RAPIDS.AI - Open GPU Data Science

27 Principal Component Analysis (PCA) Before Now! CPU vs GPU PORTING EXISTING CODE PCA Training and query results: CPU: ~5 minutes GPU: ~7 seconds

28 HOW? DOWNLOAD AND DEPLOY Source available on GitHub Container available on NGC and Docker Hub Conda and PIP PIP available at a later date NGC Source code, libraries, packages On-premises Cloud

29 ACCELERATING MACHINE LEARNING The RAPIDS Ecosystem Open Source Community Enterprise Data Science Platforms Startups Deep Learning Integration GPU Servers Storage Partners