Challenges for Performance Analysis in High-Performance RC

Size: px

Start display at page:

Download "Challenges for Performance Analysis in High-Performance RC"

Archibald Stokes
5 years ago
Views:

1 Challenges for Performance Analysis in High-Performance RC July 20, 2007 Seth Koehler Ph.D. Student, University of Florida John Curreri Ph.D. Student, University of Florida Dr. Alan D. George Professor of ECE, University of Florida NAME = speaker

2 Outline Introduction Background information Challenges for RC performance analysis Performance analysis framework Case study: N-Queens Conclusions 2

3 Introduction RC applications CPUs and FPGAs can achieve large performance gains or use less power Difficult to understand and improve application performance Complexity of designs Large-scale systems Heterogeneous resources Performance analysis tools (PATs) Collect, view, and analyze performance data Sequential Parallel Dual-Paradigm Debug Performance Debug Performance Debug Performance Handle instrumentation and measurement automatically Organize and analyze large volumes of data Integrate multiple sources of data (CPU and FPGA) Understand application behavior to locate performance problems Less Difficulty level More 3

4 Background information Goals for PATs Low impact on application behavior High-fidelity performance data Concise visualization Adaptable Portable Automated Traditional PATs monitor from the CPU Significant amount of application behavior is left unmonitored in RC applications System Machine Node Original Application Instrument Instrumented Application Measure Measured Data File Present Visualizations Analyze (Manually) Potential Bottlenecks Optimize Optimized Application Board Execute Execution Environment Analyze (Automatically) Modified Application FPGA / Device Network Legend... Traditional Processor Communication FPGA Communication Secondary Interconnect... Primary Interconnect Network Secondary Interconnect Main Memory CPU CPU CPU FPGA Board Primary Interconnect Secondary Interconnect FPGA FPGA FPGA CPU & Primary Interconnect On-board FPGA App core App core App core Embedded CPU(s) 4

5 Challenges for RC performance analysis How do we expand notion of software PATs into softwarehardware realm of RC? Instrumentation Choosing data to monitor Selecting an instrumentation level (source, binary, etc) Automating instrumentation Measurement Recording and storing performance data Managing shared resources Presentation Viewing diverse behavior of CPUs and FPGAs Integration Effectively monitoring software and hardware performance in a unified tool 5

6 Challenges Instrumentation Choosing data to monitor Control and communication are good starting points State machines, pipelines, memory accesses (on-chip, on-board, off-board), component idle time Application knowledge helpful Selecting an instrumentation level * Source (HDL), binary (FPGA bitstream), intermediate Binary instrumentation difficult Lack of access to (or documentation of) intermediate levels Tradeoffs in portability (system & language), flexibility, perturbation of design area/speed, difficulty, source correlation, and time required Automating instrumentation Non-trivial to detect FPGA control and communication No standard function calls (e.g., MPI_Send in software) CPU * P. Graham, B. Nelson, and B. Hutchings. Instrumenting Bitstreams for Debugging FPGA circuits, Proc. of 9 th annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Apr. 2001, pp FPGA 6

Challenges Measurement Recording and storing performance data Compromise between understanding application behavior and exhausting memory Profiling (summary stats) vs.

7 Challenges Measurement Recording and storing performance data Compromise between understanding application behavior and exhausting memory Profiling (summary stats) vs. tracing (time & associated data) Memory scenarios are complex (size & speed, owned vs. shared, availability, usage, etc.) May require transfer of performance data to CPU main memory CPU may have to initiate transfer (polling or interrupt) Managing shared resources No virtualization of resources Communications channel (FPGA interconnect) On-board memory Splicing into resources previously owned by application is tricky CPU APP Shared Resource Management Perf FPGA Board 7

Challenges Presentation Viewing diverse behavior of CPUs and FPGAs FPGAs do not fit nicely into CPU model Many active tasks per processing node Components vary in type & purpose Difficult to show

8 Challenges Presentation Viewing diverse behavior of CPUs and FPGAs FPGAs do not fit nicely into CPU model Many active tasks per processing node Components vary in type & purpose Difficult to show computation progress of single component Scalability CPUs FPGAs Trace-based displays are useful but scale poorly Hierarchical views (like Ganglia) may be effective Ideally, visualizations capture the potential and the observed communication and computation 4Gb 4.9Gb 934Mb 910Mb 971Mb 891Mb 450Mb 973Mb 949Mb 964Mb CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 126Mb 88Mb 129Mb 113Mb 2.0Gb 2.1Gb 1.9Gb FPGA 0 FPGA 1 FPGA 2 FPGA 3 1.8Gb 8

Challenges Integration Effectively monitoring software & hardware performance in unified tool Clocks must be synchronized to determine event ordering among nodes Must choose where to monitor an event

9 Challenges Integration Effectively monitoring software & hardware performance in unified tool Clocks must be synchronized to determine event ordering among nodes Must choose where to monitor an event (hardware, software, both) Balance efficiency, difficulty, and accuracy May need complimentary modifications to software and hardware Implement shared communication channel Parallel Performance Wizard (PPW) Designed to monitor performance in PGAS (Partitioned Global Address Space) languages Higher abstractions in PGAS languages (e.g., implicit communication and data distribution) make it difficult to understand application behavior Extend the GASP (Global Address Space Performance) interface Add generic events (e.g., FPGA Write, Read, Reset, Initialize) Lack of standard APIs for accessing FPGAs is problematic Augment visualizations to support (and be meaningful to) RC applications 9

10 Performance analysis framework Hardware Measurement Module (HMM)... On-board Memory (DDR/QDR) Block RAM Buffer 10

11 Performance analysis framework (cont) User Application (HLL) Color Legend CPU(s) Original Application FPGA Access Methods (Wrapper) Framework User Application Process is automatable! Original top-level file FPGA(s) Module Module Submodule Submodule Submodule User Application (HDL) Legend Original RC Application Additions by Instrumentation Additions are temporary! 11

12 Performance analysis framework (cont) Performance thread (HMM_Main) periodically transfers data from FPGA to memory Adaptive polling frequency can be employed to balance fidelity and overhead Measurement can be stopped and restarted (similar to stopwatch) Application HMM_Init HMM_Start HMM_Stop HMM_Finalize HMM_Main (thread) 12

13 Performance analysis framework (cont) New top-level file arbitrates between application and performance framework for off-chip communication Splice into communication scheme Acquire address space in memory map Acquire network address or other unique identifier Connect hardware together Define event triggers Challenges in Automation Custom APIs for FPGAs Custom user schemes for communication Application knowledge not available Instrumentation Framework 13

14 Case study: N-Queens * Overview Find number of distinct ways n queens can be placed on an nxn board without attacking each other Performance analysis overhead Sixteen 32-bit profile counters One 96-bit trace buffer (completed cores) Main state machine optimized based on data Improved speedup (from 34 to 37 vs. Xeon code) N-Queens results for board size of 16 Slices (% relative to device) Block RAM (% relative to device) Frequency (MHz) (% relative to orig.) Communication (KB/s) Original 9, <1 XD1 Instr. 9,901 (+4%) 15 (+2%) 123 (-1%) 33 Xeon-H101 Original 23, <1 14 Instr. 26,218 (+6%) 22 (0%) 101 (0%) * Standard backtracking algorithm employed 30 Speedup Q Q Q Q Application speedup over single 3.2GHz Xeon FPGAs 8-node 3.2GHz Xeon 8-node H101 Optimized 8-node H101

Conclusions Developed RC performance analysis framework First RC performance concept and tool framework (per extensive literature review) Tracing, profiling, & sampling available Automated

15 Conclusions Developed RC performance analysis framework First RC performance concept and tool framework (per extensive literature review) Tracing, profiling, & sampling available Automated instrumentation in progress Application case-study performed Observed minimal overhead from tool Speedup achieved due to performance analysis Future work Automation, integration with software PAT, analysis & visualization, HLL-mappers, further case studies 15