OpenWLC: A Workload Characterization System. Hong Ong, Rajagopal Subramaniyan, R. Scott Studham ORNL July 19, 2005

Size: px

Start display at page:

Download "OpenWLC: A Workload Characterization System. Hong Ong, Rajagopal Subramaniyan, R. Scott Studham ORNL July 19, 2005"

Arline Jenkins
5 years ago
Views:

1 OpenWLC: A Workload Characterization System Hong Ong, Rajagopal Subramaniyan, R. Scott Studham ORNL July 19, 2005

2 Outline Background and Motivation Existing Tools and Services NWPerf OpenWLC Overview Goals Preliminary Design Timeline

Typical Machine Evaluation Determine a machine s applicability to a range of

Use synthetic codes and application kernels to capture performance properties of:

Software systems and applications (e.g., NPB, HPCtoolkits, Genesis, STSWM).

3 Typical Machine Evaluation Determine a machine s applicability to a range of applications. Use synthetic codes and application kernels to capture performance properties of: Hardware & architecture (e.g., NAS, SPEC, SPLASH). Operating System components (e.g., lmbench). Software systems and applications (e.g., NPB, HPCtoolkits, Genesis, STSWM). Various kinds of middleware. Work with vendors and system architects to improve the systems. Focus: Evaluation of emerging platforms to understand their benefits.

4 What we want to know? Methods to answer the following: What percentage of jobs use all that memory/disk? What is the average job size? How does job size impact efficiency of the calculation? Where is the bottleneck for the average run of a job? What is the sustained performance for the average job? What are the impacts of slow memory to the average user? Could we have less storage on the next system if we had a shared memory system or a shared disk pool? and more. What is the average user doing vs. what the box was designed for?

5 Workload Characterization Focus: Evaluation of deployed platforms to understand how they are used by average users. How effectively is the box used as opposed to what the box was designed for? Determine how existing platforms are utilized by applications and general users Efficiency of applications, System utilization, Average job sizes. Work with application developers, users and manufacturers to decrease application runtime and increase resource utilization.

6 Percent of jobs Most users do not utilize the full capability of the system. Memory footprint per node during FY04 on 11.4TF HPCS2 system at PNNL Most jobs use <25% of available memory (max avail is 6-8G) Large jobs use more memory. I need 6GB of memory per CPU CPU s Used

7 Aggregate Results 10% of the >256CPU jobs have the CPU scheduled for idle >50% of the time. Sustained Performance as a function of CPU count Busy Cycles as a function of job size The median sustained performance for jobs over 256CPU s is 3%.

8 Aggregate Results Stalled cycles as a function of job size Sum of all stalls due to: BE_FLUSH_BUBBLE_ALL Branch misprediction flush or exception BE_EXE_BUBBLE_ALL Execution unit stalls BE_L1D_FPU_BUBBLE_ALL Stalls due to L1D (L1 data cache) micropipeline or FPU (floating-point unit) micropipeline BE_RSE_BUBBLE_ALL Register stack engine (RSE) stalls BACK_END_BUBBLE_FE Frontend stalls

9 Job size - HPCS2 is primarily focused on small capacity jobs. Percent of cycles used by job size 50% <12% of the computer <33% of the computer <50% of the computer >50% of the computer Oct Nov Dec Jan Feb Mar Apr May Jun Jul

10 Some Observation After analyzing 19,883 jobs Median sustained FLOP performance for jobs over 256CPU s is 3% Less than 20% of the memory is in use at any given point in time. Less than 5% of the jobs use heavy IO (>25MB/s per GF DGEMM) Over 60% of the cycles are for jobs that use less than 1/8 of the system. 10% of the >256CPU jobs have the CPUs scheduled for idle >50% of the time. High User expertise Most Utilization Platform Evaluation Computer Design Low Scalability of application on platform we have determined that most users do not use the system as designed. High

11 Existing Tools System monitoring Provide a continuous collection and aggregation of system performance data. Examples PerfMiner Ganglia NWPerf SuperMon CluMon Nagios PCP MonALISA Application monitoring Measure actual application performance via a batch system. Examples PerfMiner NWPerf Our main focus Profiling Provide a static, instrumentation tool, which focuses on source code that users have direct control. Examples PerfSuite HPCToolkits SvPablo TAU Kojak Vampir IPM

12 Existing Tools: Limitations Ganglia: Great tool, easy to use. Difficult to extend in a low system impact fashion. Unpredictable system performance due to unsynchronized collection (ref: Petrini). RRD storage works well for fixed time series graphs, but not suitable for ad-hoc queries. Supermon: Good per-node performance. Polling-based systems are almost implicitly unsynchronized. Yet to provide a storage solution. Node management seemed excessively cumbersome and manual. Kernel module reliance reduces portability and increases administrative cost of deployment (similar to dproc).

13 Existing Tools: Limitations Other solutions such as dproc, CARD, PARAMON have not proved to be able to scale to large clusters. Clumon/PCP oriented towards point-in-time data, not long term systematic analysis (although potential exists for extension).

14 NWPerf 27 metrics are collected on all nodes once per minute Hardware performance counters including: flops, memory bytes/cycle, total stalls Local scratch usage (obtained via fstat() ) Memory swapped out (total), swap blocks in and out Memory free, used, and used as system buffers Block I/O in, and out Kernel scheduler CPU allocation to user, kernel, and idle time Processes running, and blocked Interrupts, and context switches per second. Lustre I/O (Shared global filesystem) All-to-all 256 CPU (12K runs llcbench) execution time [s] None 1X 10X No LSF 100 The 3 graphs are from the same 3 day 600CPU run 0 Min Mean NWPerf: Ryan Mooney, Scott Studham, Ken Schmidt

15 NWPerf: Unresolved Issues Deployment and configuration: Difficult to deploy and tune. User interface: Requires work to standardize and extend existing APIs. Software architecture: Centralized scheduler and collection server constitute single point of failure. Unable to scale to O(1000) nodes Data management: Collection and storage may not be able to scale. Visualization: Simple. Portability: Specific to Linux cluster. PROJECT IS HALTED!

16 What is next? Version1 CASC 2003 Version2 NWPerf Version3 OpenWLC Ease of use Requires User Interaction No User Interaction No User Interaction Portability Hard coded for one computer Designed for a Linux cluster Runable on all major architectures Users O(10) O(100) O(1000) Jobs O(10^2) O(10^4) O(10^5) Metrics collected O(10) O(20) O(40) Deployed sites 1 1 O(10-100)

17 OpenWLC Goals Resolve issues in NWPerf, i.e., Cross-system portability. Standardized system interface. Better visualization. Improved scalability. Perform cross node and cross job event correlation. For example, we want to be able to say job one is using the shared file system so jobs ten and twenty were slowed down. Perform validation tests at different sampling rates to determine preferred sample rates for different data points. Detect and describe event anomaly based on gathered profiles. Perform (potentially) hardware fault analysis. More importantly, we want to characterize more accurately what parts of the system are important for efficient job performance from an average user s perspective. Deployed at DOD MOD and DOE test locations.

18 OpenWLC Design constraints Load > O(40) metrics (e.g., FLOPs, Int Ops, Mem BW, Network BW, Disk BW.) for all jobs run on the system in a (central) database. Cannot impact jobs performance by greater than 1% Fine data granularity to see the needs to use different algorithms. Efficency (FLOPS) 24% average efficiency over 128 nodes (~1/3TF) 80% 70% 60% 50% 40% 30% 20% 10% 0% Time Keep all data to develop mathematical center profiles. ccsd Cannot be architecture specific for portability ccsd(t)

19 Foreseen Critical Challenges Scalability: Single node: Software scalability and features. Doing more without breaking or rewriting code. # of metrics. System-wide (due to increasing number of nodes): Communication management. Network load. Data volume at the collection node. Data management. Visualization. Storage. A sensible workload model to characterize what parts of the system are important for efficient job performance from an average user s perspective.

20 Preliminary OpenWLC Framework

21 Solutions to Scalability Solving single node scalability: Variable sampling frequency e.g., collect set of 10 parameter. Modules/driver may run as a kernel thread. Encode collected data for transmission to collection node to reduce transmission overhead and network traffic. Solving system-wide scalability: Hierarchically layered architecture. Scheduled communication with sub-collectors. Identify a medium term data storage for highly structured time series data. Existing solutions, MySQL and Postgresql, are less than ideal. Better data aggregation methods

22 Data Aggregation and Management Want a high throughput, low overhead, highly synchronized, massively scalable data collection subsystem for collecting arbitrary data. NWPerf does a pretty good job of most of this except the arbitrary data part and ease of use. Supermon has some interesting capabilities in its data encapsulation, but their collection infrastructure has some weaknesses. Maybe a hybrid approach.

23 Analytical/Modeling subsystem. Could use variations of clustering algorithms. Need to identify workload model for different platforms. Problem: This is a huge project in itself. Potential future work. Do not have to re-engineer our current (preliminary) design.

Summary High User expertise Computer Platform Design Evaluation Most Utilization High User expertise Platform Computer EvaluatioDesign n Users Capacity Center Utilization Low High Scalability of

24 Summary High User expertise Computer Platform Design Evaluation Most Utilization High User expertise Platform Computer EvaluatioDesign n Users Capacity Center Utilization Low High Scalability of application on platform Low High Scalability of application on platform Work with the application community, to raise the users awareness on how to efficiently use the systems and ensure the users applications scale on the given NLCF platforms.

25 Benefits Benefit system designers: What is really important (memory, flops, something else, all of the above?) Benefit users: Is my application sized properly? Is it having obvious problems? Benefit code developers: What resources is the application using, does it have good performance? Stop the hype: What is performance, we needed a simple way to compare performance characteristics for a large set of real world applications.

26 Status Collaborators: ANL, Louisiana Tech. Kick-off meeting in June Presently focusing on OpenWLC system design and APIs. Timeline: October 2005 Finalized design and APIs. Early 2006 Beta version.

27 Acknowledgements Funding for OpenWLC is from the Department of Defense High performance Computing Modernization Office. This NWPerf research described in this presentation was performed using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the U.S. Department of Energy's Office of Biological and Environmental Research and located at the Pacific Northwest National Laboratory. PNNL is operated for the Department of Energy by Battelle. Ryan Mooney and Ken Schmidt for their tireless work to develop NWPerf. Jarek Nieplocha for his guidance on how to quantify system impacts. Experiments and data collection were performed on the Pacific Northwest National Laboratory (PNNL) 977-node Linux 11.8 TFLOPs cluster (HPCS2) with 1954 Itanium-2 processors. The data collection server is a Dual Xeon Dell system with a 1TB ACNC IDE to SCSI Raid Array. This research is sponsored by the Office of Advanced Scientific Computing Research; U.S. Department of Energy. The work was performed at the Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05-00OR22725.

Performance monitors for multiprocessing systems

Performance monitors for multiprocessing systems School of Information Technology University of Ottawa ELG7187, Fall 2010 Outline Introduction 1 Introduction 2 3 4 Performance monitors Performance monitors