Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6

Size: px
Start display at page:

Download "Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6"

Transcription

1 Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6 Yanhua Sun 1 Gengbin Zheng 1 Chao Mei 1 Eric J. Bohm 1 James C. Phillips 1 Terry Jones 2 Laxmikant(Sanjay) V. Kále 1 1 University of Illinois at Urbana-Champaign 2 Oak Ridge National Lab sun51@illinois.edu November 27, 2012 Yanhua Sun et al. U of Illinois at Urbana-Champaign 1/22

2 Motivation - Molecular Dynamics Simulation Critical in understanding the functioning of biological machinery Yanhua Sun et al. U of Illinois at Urbana-Champaign 2/22

3 Motivation - Molecular Dynamics Simulation Critical in understanding the functioning of biological machinery Challenging - femto-second timestep to maintain accuracy Hundreds of nanoseconds (even microseconds) are needed to observe interesting phenomena Yanhua Sun et al. U of Illinois at Urbana-Champaign 2/22

4 Motivation - Molecular Dynamics Simulation Critical in understanding the functioning of biological machinery Challenging - femto-second timestep to maintain accuracy Hundreds of nanoseconds (even microseconds) are needed to observe interesting phenomena 10 8 steps millisecond timestep simulation Yanhua Sun et al. U of Illinois at Urbana-Champaign 2/22

5 Motivation - Molecular Dynamics Simulation Critical in understanding the functioning of biological machinery Challenging - femto-second timestep to maintain accuracy Hundreds of nanoseconds (even microseconds) are needed to observe interesting phenomena 10 8 steps millisecond timestep simulation a single time step of 1 million atom simulation : 20seconds scale to hundreds of thousands cores fine-grained decomposition Yanhua Sun et al. U of Illinois at Urbana-Champaign 2/22

6 Outline Analyze and optimize the fine-grained molecular dynamics simulation on Cray XK6 supercomputer Yanhua Sun et al. U of Illinois at Urbana-Champaign 3/22

7 Outline Analyze and optimize the fine-grained molecular dynamics simulation on Cray XK6 supercomputer Outline Background of Cray XK6, Charm++ and NAMD Yanhua Sun et al. U of Illinois at Urbana-Champaign 3/22

8 Outline Analyze and optimize the fine-grained molecular dynamics simulation on Cray XK6 supercomputer Outline Background of Cray XK6, Charm++ and NAMD Detect the performance bottleneck Yanhua Sun et al. U of Illinois at Urbana-Champaign 3/22

9 Outline Analyze and optimize the fine-grained molecular dynamics simulation on Cray XK6 supercomputer Outline Background of Cray XK6, Charm++ and NAMD Detect the performance bottleneck Optimization techniques Yanhua Sun et al. U of Illinois at Urbana-Champaign 3/22

10 Outline Analyze and optimize the fine-grained molecular dynamics simulation on Cray XK6 supercomputer Outline Background of Cray XK6, Charm++ and NAMD Detect the performance bottleneck Optimization techniques Performance results Yanhua Sun et al. U of Illinois at Urbana-Champaign 3/22

11 Cray XK6 - Jaguar at ORNL Processors 16-core Interlagos processor, GPGPU Kepler GPGPU (results are on Fermi) CPU set - Jaguar XK6; GPU set - TitanDev Yanhua Sun et al. U of Illinois at Urbana-Champaign 4/22

12 Cray XK6 - Jaguar at ORNL Processors 16-core Interlagos processor, GPGPU Kepler GPGPU (results are on Fermi) CPU set - Jaguar XK6; GPU set - TitanDev Network 3D torus Gemini Interconnect Hardware support for RDMA user Generic Network Interface (ugni) Yanhua Sun et al. U of Illinois at Urbana-Champaign 4/22

13 Charm++ Runtime System Charm++ is a parallel programming model that implements message-driven objects. Yanhua Sun et al. U of Illinois at Urbana-Champaign 5/22

14 Charm++ Runtime System Charm++ is a parallel programming model that implements message-driven objects. Improve both the performance and productivity Adaptive runtime system - mapping, balancing, etc. Multithreading mode (SMP) for multi-core computers Yanhua Sun et al. U of Illinois at Urbana-Champaign 5/22

15 SMP Charm++ implementation on ugni Worker threads put messages into Comm thread queues Medium/Large messages(> 1KB) - RDMA Small messages - SMSG Polling network messages Yanhua Sun et al. U of Illinois at Urbana-Champaign 6/22

16 NAMD A highly scalable molecular dynamics application developed in the mid-1990s, based on Charm++. Yanhua Sun et al. U of Illinois at Urbana-Champaign 7/22

17 NAMD A highly scalable molecular dynamics application developed in the mid-1990s, based on Charm++. Simulation box is spatially divided into patches Force calculation between two patches is assigned to compute objects Yanhua Sun et al. U of Illinois at Urbana-Champaign 7/22

18 NAMD A highly scalable molecular dynamics application developed in the mid-1990s, based on Charm++. Simulation box is spatially divided into patches Force calculation between two patches is assigned to compute objects Complexity is O(NlogN) by using short-range and long-range calculation GPU case : short-range work on GPUs, long-range work on CPU Yanhua Sun et al. U of Illinois at Urbana-Champaign 7/22

19 PME Long-range calculation is implemented via particle-mesh Ewald method (PME) Yanhua Sun et al. U of Illinois at Urbana-Champaign 8/22

20 PME Long-range calculation is implemented via particle-mesh Ewald method (PME) Pencil PME communication pattern Yanhua Sun et al. U of Illinois at Urbana-Champaign 8/22

21 Molecular Systems Molecule Atoms Cutoff(Ȧ) Simulation Box DHFR x62x62 Apoa x108x77 1M STMV x216x M STMV x1084x867 Table: Parameters for four molecular systems Yanhua Sun et al. U of Illinois at Urbana-Champaign 9/22

22 Earlier Work NAMD 100M-atom Benchmark on Jaguar XT With PME Ideal scaling Speedup Number of Processor Cores NAMD scaled up to 224K cores on Cray XT5 for a 100-million atom simulation. Speedup starts to falter beyond 64K cores. Yanhua Sun et al. U of Illinois at Urbana-Champaign 10/22

23 Performance Analysis Trace-based Performance Analysis Tool Projections Automatic runtime instrumentation module Java-based GUI program to visualize and analyze the performance data Yanhua Sun et al. U of Illinois at Urbana-Champaign 11/22

24 4 nodes with GPU Yanhua Sun et al. U of Illinois at Urbana-Champaign 12/22 Timelines for CPU and GPU runs Purple: short-range work; Green: long-range work; Red: integration; White: idle 4 nodes with CPU only

25 512 nodes with CPU only 512 nodes with GPUs Yanhua Sun et al. U of Illinois at Urbana-Champaign 13/22

26 Trace back Multiple Messages for Critical Path Yanhua Sun et al. U of Illinois at Urbana-Champaign 14/22

27 Speedup Communication on Critical Path - Priority Messages Short-range calculation, PME work driven by different messages Yanhua Sun et al. U of Illinois at Urbana-Champaign 15/22

28 Speedup Communication on Critical Path - Priority Messages Short-range calculation, PME work driven by different messages Sender: Out-of-Band sending Receiver: Priority execution Increase responsiveness Yanhua Sun et al. U of Illinois at Urbana-Champaign 15/22

29 The message tracing of patch-to-pme in Projections timeline for DHFR running on 1024 cores Without optimization With optimization Yanhua Sun et al. U of Illinois at Urbana-Champaign 16/22

30 Persistent Messages Persistent channels for FFT are setup at the beginning No need to allocate memory Direct one-sided put 10% overall performance improvement Yanhua Sun et al. U of Illinois at Urbana-Champaign 17/22

31 PME Decomposition Theorem T = T comm + T comp = D 4 B α + NlogN P (1) Yanhua Sun et al. U of Illinois at Urbana-Champaign 18/22

32 PME Decomposition Theorem T = T comm + T comp = D 4 B α + NlogN P (1) Reduce the number of messages to utilize network more efficiently (1 pencil per physical node) More parallelism to utilize CPU resources (1 pencil per core) Tradeoff ( 1 pencil per Charm++ process) Yanhua Sun et al. U of Illinois at Urbana-Champaign 18/22

33 PME Decomposition Theorem T = T comm + T comp = D 4 B α + NlogN P (1) Reduce the number of messages to utilize network more efficiently (1 pencil per physical node) More parallelism to utilize CPU resources (1 pencil per core) Tradeoff ( 1 pencil per Charm++ process) CkLoop library to utilize all cores Yanhua Sun et al. U of Illinois at Urbana-Champaign 18/22

34 Performance Results - DHFR 4.0 Optimized ugni baseline ugni MPI Charm++ ms/step number of cores NAMD DHFR CPU performance on TitanDev Yanhua Sun et al. U of Illinois at Urbana-Champaign 19/22

35 Performance Results - XK6 V.S. XT5 (100 million atoms) ugni-based Charm++ on TitanDev(GPU) ugni-based Charm++ on Jaguar XK6 MPI-based Charm++ on Jaguar XT5 MPI-based Charm++ on Jaguar XK6 ms/step K 16K 64K 128K number of cores Yanhua Sun et al. U of Illinois at Urbana-Champaign 20/22

36 Conclusion and Future Work Conclusion Techniques to analyze and optimize NAMD on both application and runtime system Timestep of 100M STMV is improved from 26ms/step on Jaguar XT5 to 13ms/step XK6 Yanhua Sun et al. U of Illinois at Urbana-Champaign 21/22

37 Conclusion and Future Work Conclusion Techniques to analyze and optimize NAMD on both application and runtime system Timestep of 100M STMV is improved from 26ms/step on Jaguar XT5 to 13ms/step XK6 Future Work Topology-aware PME distribution and communication Multi-level summation to replace PME Yanhua Sun et al. U of Illinois at Urbana-Champaign 21/22

38 Tutorial Charm++, 8:30AM-12:00PM on Sunday November 11 HPC Challenge BoF, 12:15PM-1:15PM on Tuesday November 13, in 255-A Sidney Fernbach Award Talk, 11:30AM-12:00PM on Wednesday November 14th, in 155-E Dissertation showcase Saving Energy and Power, 11:15AM - 11:30AM on Wednesday November 14, in 155-F Paper talk Optimizing fine-grained communication in a biomolecular dynamics simulation application on Cray XK6 on Wednesday Nov 14, in 355-EF Charm++ BoF, 12:15PM-1:15PM on Thursday November 15, in 255-A Yanhua Sun et al. U of Illinois at Urbana-Champaign 22/22