HETEROGENEOUS SYSTEM ARCHITECTURE: FROM THE HPC USAGE PERSPECTIVE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China
AGENDA: GPGPU in HPC, what are the challenges Introducing Heterogeneous System Architecture (HSA) How HSA benefits GPGPU in HPC usage Taking HSA to the Industry
GPU IN HPC WHAT ARE THE CHALLENGES? Massively Parallel Processing? Finding Parallelism? SIMDs/Vector-Arrays? Bringing Data to Computation? Refine the algorithm? 3 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
--
THE PROBLEM WHY IS IT DIFFICULT? Algo., programming Hardware, tool-chain Not every HPC domain-science programmer could use GPUs Efforts on tailoring algo. Even the size of the problem Code reuse remains an issues Data transfer cost Distributed memory space between CPU and GPU embarrassed (legacy) programming models High software runtime overhead Special purpose devices that lacks the necessary tools 5 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
BUT Several efforts still targeted at utilizing GPUs in HPC Hybrid computing became a common term, heterogeneity is now becoming a norm Getting performance is still a problem in general purpose HPC US Department of Energy's 20 MW expectation ExaScale system is probably going to end up being a optimization problem to solve 6 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
RE-THINKING CPU+dGPU 7 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
CHANGING THE THINKING 8 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
INTRODUCING HETEROGENEOUS SYSTEM ARCHITECTURE Brings All the Processors in a System into Unified Coherent Memory 9 HPC China 2012 HSA: from the HPC usage perspective Oct, 30, 2012
INTRODUCING HETEROGENEOUS SYSTEM ARCHITECTURE Brings All the Processors in a System into Unified Coherent Memory POWER EFFICIENT INDUSTRY SUPPORT EASY TO PROGRAM OPEN STANDARD FUTURE LOOKING ESTABLISHED TECHNOLOGY FOUNDATION 10 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HSA APU FEATURE ROADMAP Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode scheduling GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 11 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HSA COMPLIANT FEATURES Optimized Platforms GPU Compute C++ support Support OpenCL C++ directions and Microsoft s upcoming C++ AMP language. This eases programming of both CPU and GPU working together to process parallel workloads. User mode scheduling Drastically reduces the time to dispatch work, requiring no OS kernel transitions or services, minimizing software driver overhead Bi-Directional Power Mgmt between CPU and GPU Enables power sloshing where CPU and GPU are able to dynamically lower or raise their power and performance, depending on the activity and which one is more suited to the task at hand. 12 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HSA COMPLIANT FEATURES Architectural Integration Unified Address Space for CPU and GPU GPU uses pageable system memory via CPU pointers Fully coherent memory between CPU & GPU The unified address space provides ease of programming for developers to create applications. For HSA platforms, a pointer is really a pointer and does not require separate memory pointers for CPU and GPU. The GPU can take advantage of the CPU virtual address space. With pageable system memory, the GPU can reference the data directly in the CPU domain. In prior architectures, data had to be copied between the two spaces or page-locked prior to use. And, NO GPU memory size limitation! Allows for data to be cached by both the CPU and the GPU, and referenced by either. In all previous generations, GPU caches had to be flushed at command buffer boundaries prior to CPU access. And unlike discrete GPUs, the CPU and GPU in an APU share a high speed coherent bus. 13 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
FULL HSA FEATURES System Integration GPU compute context switch GPU tasks can be context switched, making the GPU a multi-tasker. Context switching means faster application, graphics and compute interoperation. Users get a snappier, more interactive experience. GPU graphics preemption As more applications enjoy the performance and features of the GPU, it is important that interactivity of the system is good. This means low latency access to the GPU from any process. Quality of service With context switching and pre-emption, time criticality is added to the tasks assigned to the processors. Direct access to the hardware for multi-users or multiple applications are either prioritized or equalized. 14 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HSA SOLUTION STACK System components: Compliant heterogeneous computing hardware A software compilation stack A user - space runtime system Kernel - space system components Application SW Application Domain Specific Libs (Bolt, OpenCV, many others) Overall Vision: Make GPU easily accessible Support mainstream languages, expandable to domain specific languages Complete GPU tool-chain, Programming & debugging & profiling like CPU does Make compute offload efficient Direct path to GPU (avoid Graphics overhead) Eliminate memory copy, Low-latency dispatch Make it ubiquitous Drive HSA as a standard through HSA Foundation Differentiated HW CPU(s) GPU(s) Open Source key components HSA Software Drivers HSA Runtime HSAIL HSA Finalizer GPU ISA OpenCL Runtime DirectX Runtime Legacy Drivers Other Runtime Other Accelerators 15 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
Driver Stack HSA Software Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Domain Libraries HSA Domain Libraries OpenCL 1.x, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver HSA JIT Task Queuing Libraries HSA Runtime HSA Kernel Mode Driver Hardware - APUs, CPUs, GPUs AMD user mode component AMD kernel mode component All others contributed by third parties or AMD 16 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves tomorrow under HSA 17 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HSA COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 18 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HSA INTERMEDIATE LAYER - HSAIL HSAIL is a virtual ISA for parallel programs Finalized to ISA by a JIT compiler or Finalizer Low level for fast JIT compilation Explicitly parallel Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Syscall methods GPU code can call directly to system services, IO, printf, etc Debugging support 19 HPC Advisory Council HSA: platform for the future Oct, 28, 2012
HSA TAKING PLATFORM TO PROGRAMMERS Balance between CPU and GPU for performance and power efficiency Make GPUs accessible to wider audience of programmers Programming models close to today s CPU programming models Enabling more advanced language features on GPU Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Complete tool-chain for programming, debugging and profiling HSA provides a compatible architecture across a wide range of programming models and HW implementations. 20 HPC China 2012 HSA: from the HPC usage perspective Oct, 30, 2012
HSA VALUES GPGPU EASIER TO PROGRAM Pointer is a pointer! Expressive runtime for rich high level programming language, C/C++, Java, Python, C# More programming models support, OpenCL, C++ AMP, OpenMP Cacheable and coherent memory, more data structure allowed to be freely shared Single Source for all processors on the SOC 21 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
HSA VALUES GPGPU PERFORMANCE AND POWER EFFICIENCY Pass pointer rather than moving data, support more problem with different dataset Reduced kernel launch time and efficient CPU/GPU communication Hardware managed queue and scheduling, allows for very low-latency comm. between devices. Good for performance and power effficiencypreemption and context switching, Support for Multiple Pre-emption Concurrent and context GPU switching, process, Preemptive Support for Multitasking Multiple Concurrent of CPU/GPU process, resources Preemptive Multitasking of CPU/GPU resources Bi-Directional Power Mgmt between CPU and GPU, Turbo Core technology for more power efficiency 22 HPC China 2012 HSA: from the HPC usage perspective Oct, 30, 2012
TAKING HSA TO THE INDUSTRY Copyright 2012 HSA Foundation. All Rights Reserved.
Copyright 2012 HSA Foundation. All Rights Reserved. 24 HSA FOUNDATION INITIAL FOUNDERS represented by, CVP, Heterogeneous Applications and Developer Solutions represented by, ARM Fellow and VP of Technology, Media Processing represented by Vice President, Marketing represented by, Senior Director, CTO Office represented by, Director, Linux Development Center
AMD S OPEN SOURCE COMMITMENT TO HSA We will open source our linux execution and compilation stack Jump start the ecosystem Allow a single shared implementation where appropriate Enable university research in all areas Component Name AMD Specific Rationale HSA Bolt Library No Enable understanding and debug OpenCL HSAIL Code Generator No Enable research LLVM Contributions No Industry and academic collaboration HSA Assembler No Enable understanding and debug HSA Runtime No Standardize on a single runtime HSA Finalizer Yes Enable research and debug HSA Kernel Driver Yes For inclusion in linux distros 25 HPC China 2012 HSA: from the HPC usage perspective Oct, 30, 2012
THE FUTURE OF HETEROGENEOUS COMPUTING The architectural path for the future is clear Programming patterns established on Symmetric Multi-Processor (SMP) systems migrate to the heterogeneous world An open architecture, with published specifications and an open source execution software stack Heterogeneous cores working together seamlessly in coherent memory Low latency dispatch No software fault lines APU server will unleash the GPGPU power in HPC domain 26 HPC China 2012 HSA: from the HPC usage perspective Oct, 30, 2012
WHERE ARE WE TAKING YOU? Switch the compute, don t move the data! Every processor now has serial and parallel cores All cores capable, with performance differences Simple and efficient program model Platform Design Goals Easy support of massive data sets Support for task based programming models Solutions for all platforms Open to all 27 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012
THANK YOU! Access HSA: http://developer.amd.com http://hc.csdn.net Haibo Xie: haibo.xie@amd.com
DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. 2012 Advanced Micro Devices, Inc. 29 HPC China 2012 HSA: from the HPC usage perspective Oct. 30, 2012