NSF Future of High Performance Computing. Bill Kramer

Size: px
Start display at page:

Download "NSF Future of High Performance Computing. Bill Kramer"

Transcription

1 NSF Future of High Performance Computing Bill Kramer

2 Why Sustained Performance is the Critical Focus Memory Wall Limitation on computation speed caused by the growing disparity between processor speed and memory latency and bandwidth From 1986 to 2000, processor speed increased at an annual rate of 55%, while memory speed improved by only 10% per year Issue Memory latency and bandwidth limitations within processor make it difficult to achieve major fraction of peak performance of chip Latency and bandwidth limitations of communication fabric make it difficult to scale science and engineering applications to large numbers of processors TF/s Rela#onship between Peak, Linpack and Sustained Performance Using SSP Ra#o Linpack/SSP for NERSC Systems Peak (TF) Linpack (TF) Normalized SSP (TF) Ra-o Linpack/SSP for NERSC Systems 2

3 Recommendation Adopt a longer term focus, rather than the three to 5 year focus, which is really just the useful lifetime of a single system. Achieving and using an Exascale systems, or the equivalent of 10s of 100 Petascale systems, will span 15 years and a progression of resource deployments. NSF will be well served to create a 15 year funding program the combines the total cost of acquiring, supporting and using the resources. This strategy should include creating a supporting facility infrastructure that allow efficient technology refresh to be quickly deployed and integrated with the existing resources. To enable effective resource insertion, NSF should separate the selection of organizations that provision and support HPC resources from the resource selection itself. The current NSF practice of issuing separate solicitations that combine an organization as a service provider and a sole system choice for each resource refreshment leads to sub-optimization that can result in neither the most effective organization nor the best value technology. Focus on true application sustained performance. Using something like Sustained System Performance to determine the best value resource solutions will enable NSF to have the most cost effective computing environments for the computational science communities. The use of state of the practice open, best value procurements that enable comparing technology choices on sustained performance but allow vendors flexibility. NSF should take the lead in redefining the debate away from simple metrics and TOP500 and towards meaningful measures for science. 3

4 Recommendation NSF should follow the industry trend to concentrate its computational and data storages resources at a few locations that can then make long term investments that are amortized over a series of technology refreshments. These locations should be determined by the ability of the organization(s) to manage large scale, early release systems, support an evolving computational science community, cost effective extreme scale infrastructure, ability to attract and engage to world class computer science and computational science staff. The NSF should develop an appropriate balance of production quality and experimental resources. Production quality means systems from well known architectures (albeit they may be early deliver versions of new generations) with proven Performance, Effectiveness, Reliability, Consistency and Usability for the primary mission of use by for computational science. Experimental resources are those that have potential to be disruptive technology leading to significant (~10x) performance and/or price performance improvements. The mission of these types of systems is clearly different and would have different missions. A typical investment strategy might be 85% production/15% experimental. NSF should establish a best practice review of both US fund resources, and international funding programs. NSF should invest in performance based design for all application areas. 4

5 Geographic Distribution of PRACs Leaders

6 Recommendation NSF should separate the provisioning of a national science network from mid-ware software and/or compute and storage resource provisioning. A national science network that serves the extreme scale computational data resources, major communities of computational and data scientists, major observational and experimental resources needs a long term roadmap that has consistent funding and a plan to technology insertion. A model for such a plan can be found in the DOE s ESnet program among others. NSF should likewise have a sustained program for distributed (aka cloud) middle ware software creation and support. This support needs to be synchronized with the computational, data and networking components of the NSF strategy, but needs to be an independent program component. NSF should support expanded development and evolution of extreme scale system software aligned with the IESP roadmap. There are contract arrangements that can assure both high quality systems and services and innovation and advanced technology in whatever balance NSF needs. Performance and Rewards based contracts Deployment Project Management and On-going operational assessments ala ITIL Example - agreement 6 year base term, renewable for up to a total of 16 years Automatic and well as discretionary extensions that benefit both NSF and providing organizations 6

7 ADDITIONAL SLIDES 7

8 A Generalized Sustained System Performance (SSP) Framework Is an effective and flexible way to evaluate systems Determined the Sustained System Performance for each phase of each system 1. Establish a set of performance tests that reflect the intended work the system will do Can be any number of tests as long as they have a common measure of performance 2. A test consists of a code and a problem set 3. Establish the amount work (ops) the test needs to do for a fixed concurrency or a fixed problem set 4. Time each test execution use wall clock time 5. Determine the amount of work done for a given scalable unit (node, socket, core, task, thread, interface, etc.) Work = Total operations/total time/number of scalable units used for the test 6. Composite the work per scalable unit for all tests Composite functions based on circumstances and test selection criteria Can be weighed or not as desired 7. Determine the SSP of a system at any time period by multiplying the composite work per scalable unit by the number of scalable units in the system 12/4/09 8

9 Examples of Using the (SSP) Framework Test a system upon delivery, use to select a system, etc. Determine the Potency of the system - how well will the system perform the expected work over some time period Potency is the sum, over the specified time, of the product of a system s SSP and the time period of that SSP over some time period Different SSPs for different periods Different SSPs for different types of computation units (heterogeneous) Determine the Cost of systems Cost can be any resource units ($, Watts, space ) and with any complexity (Initial, TCO, ) Determine the Value of the system Value is the potency divide by a cost function If needed, compare the value of different system alternatives or compare against expectations 12/4/09 9