Making the Right Hand Turn to Power Efficient Computing

Size: px
Start display at page:

Download "Making the Right Hand Turn to Power Efficient Computing"

Transcription

1 Making the Right Hand Turn to Power Efficient Computing Justin Rattner Intel Fellow and Director Labs Intel Corporation 1

2 Outline Technology scaling Types of efficiency Making the right hand turn 2

3 Historic Perspective Cubic Meters 1.E+06 1.E+04 1.E+02 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 V Tubes Bipolar NMOS Size CMOS Second 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E-08 1.E-09 1.E-10 1.E-11 V Tubes Bipolar NMOS Gate Delay CMOS Joules 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 V Tubes Bipolar Energy/Transition NMOS CMOS V Tubes Bipolar Bipolar NMOS NMOS CMOS CMOS? Scaling will continue 3

4 Technology Scaling GATE SOURCE Xj GATE DRAIN SOURCE DRAIN D Tox BODY BODY Leff X & Y Dimensions scale down by 30% Z-Oxide thickness scales down Vcc & Vt scaling Doubles transistor density Faster transistor, higher performance Lower active power Technology scaling is is a great thing 4

5 Supply Voltage Scaling Supply Voltage (V) um ~30% 0.5um 0.35um 0.25um 0.18um 0.13um 90nm 65nm 45nm ~15% or less 30nm Supply voltage scaling has slowed 5

6 Active Power Projection Active Power (W) X Tr Growth 1.5X Tr Growth Assumptions: - 15% Vdd scaling - 50% Freq scaling (Per generation) 1 180nm 130nm 90nm 65nm 45nm Power will limit transistor integration 6

7 Sub-threshold Leakage MOS Transistor Characteristics Ids (log) Vt Exponential Increase in Ioff Vgs Ioff (na/u) nm µ Temp (C) Transistors will be dimmers,, not switches 7

8 SD Leakage Power 1000 SD Leakage Power (W) X Tr Growth 1.5X Tr Growth Assumptions: - 15% Vdd scaling -5X Ioffscaling (Per generation) 1 180nm 130nm 90nm 65nm 45nm Excessive sub-threshold leakage power 8

9 Frequency Variations in P, V, and T 30% 20X Leakage Die-to-die variation Within-die variation Static for each die Device Ion Process Voltage Temperature Years very slow Supply voltage (V) Time dependent degradation Vmax: reliability & power Vmin: frequency Time (usec) Chip activity change Current delivery RLC Dynamic: ns to us Within-die variation Temperature Temp ( o C) cache core 120C Tmax: frequency & power Throttle Time (usec) 70C Activity & ambient change Dynamic: us Within-die variation 9

10 Impact of Critical Paths Number of dies 60% 40% 20% # critical paths 0% Clock frequency Mean clock frequency # of critical paths Impact of transistor parameter variations: - Wide distribution of circuit frequency - Lower mean freq with # of critical paths - Encourages more localized, clustered designs 10

11 Tough Platform Demands PC tower Mini tower µ tower Slim line Small pc Shrinking volume Reduced Noise Yet, Higher Performance System Volume ( cubic inch) Reduced thermal budget Higher heat sink volume Higher air flow rate Thermal Resistance ( o C/W) Heat-Sink Volume (in 3 ) Air Flow Rate (CFM) Pentium III Pentium Power (W) Thermal Requirement Requirement At Odds Projected Air Flow Rate Projected Heat Dissipation Volume 11

12 BOM Cost Squeeze US $ Desktop PC ASP Value Source: Dataquest Personal Computers Performance $2000 PC cost ( 97) CPU Fax/Modem Margin Motherboard Monitor Memory Margin BOM Ass y Motherboard CD ROM HDD $1 $7 $15$28 Power Supply $28 EMI Thermals MB PWB Pwr Delivery Budget for power and cooling is is shrinking 12

13 Data Centers: Rack Mount Limits Servers & Storage (~15%) Network Equipment (~28%) Data Center Cooling Capability (~8%) Power (Watts) 4X Cost W/sq. ft W/sq. ft Rack Utilization Utilization Wasted space and higher cost in future 13

14 Outline Technology scaling Looking at efficiency Making a right hand turn 14

15 Power Efficiency Growth (X) from previous microarchitecture Die Area Performance Power S-Scalar Dynamic Deep Pipe In the same process technology, compare: Scalar Super-scalar Dynamic Deep pipe 2-3X Growth in area ~1.4X Growth in Integer Performance ~1.7X Growth in Total Performance 2-2.5X Growth in Power 2-2.5x 2.5x growth in in power / / generation 15

16 Energy Efficiency Reduction in MIPS/Watt 40% 20% 0% S-Scalar Dynamic Deep Pipe 20-30% drop in in energy efficiency / / generation 16

17 Circuit Efficiency % Power for 1% Performance Static 5% Domino 15% Domino 5% Other fast circuits 15% Domino Scalar Sup-Scalar Dynamic Deep Pipe Assumptions: Activity: Static = 0.2, Domino = 0.5 Clock consumes 40% of full chip power Faster circuits contribute to to power inefficiency 17

18 Outline Technology scaling Types of efficiency Making a right hand turn 18

19 Power Comes First 100 Power not a Consideration Pentium Pro Microarchitecture Pentium Microarchitecture Optimize Performance then Optimize Power Pentium 4Microarchitecture Set Power Envelope then Optimize Performance Performance Desktop Watts Mobile & 1U/Blade Server PDA & Slim Mobile Business as usual is is not an option 19

20 Low Power and High Performance Maximize battery life (fixed energy) Energy = Texec * Power (1/Perf) * Power Increasing the Performance by 10% and the Power by 10% will end up with same battery life Maximize performance within a given power envelope (Thermal constrains) f K*V Power = α*c*v 2 *f α*c*f 3 power/power = ((f+ f) f) 3 - f 3 )/ f 3 3 f f / f Perf = IPC*f The right trade off between Performance and Power IPC < 3 Power 3 is the metric 20

21 Less is More Strive to accomplish the same task in less energy and less time - Higher performance at lower energy can always be traded with same performance at lower power Methodology works at all levels - Aggressive clock gating - Caching - dumb and smart - Better branch predictors - Smart work reduction - Prioritize useful over speculated work - Fixed functions 21

22 Less is More in Banias Improved branch prediction - Over 20% fewer branch mispredictions Dedicated stack manager - Over 5% uop reduction Uop fusion - Over 10% uop reduction Big L2 cache Achieving Higher Performance at Lower Power 22

23 Reducing Active Power Multiple Supply Voltages Low Supply Voltage Slow Fast Slow High Supply Voltage Vdd Logic Block Throughput Oriented Design Freq = 1 Vdd = 1 Throughput = 1 Vdd/2 Logic Block Logic Block Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den =

24 Critical Scheduling Data Mgmt System Large Slow Schedule Window Register File Large schedule window (ILP) Exploit instruction criticality - Latency tolerant bypass - Limited critical resources Slow Cluster Small Fast Schedule Window Tag Data Fast Cluster 50-60% non-critical 40-50% critical 24

25 Recycling Waste 3 Wasted execution - Spec Exec vs Retired - ~30% in 1 st gen OOO - ~60% in 2 nd gen OOO - ~160% in future Leverage info from wasted execution Improve branch Prediction - 30% reduction in mispredication rate - 18% to 48% less wasted execution Can we reuse the some of the execution result too? Exec/Retire ? Misprediction Latency 25

26 Efficiency Through xmt MHz CPU GAP 500 Mem Thermals & power delivery designed for full HW utilization ST Single Thread MT1 Wait for Mem Multi-thread thread Wait for Mem MT2 Wait MT3 Multithreading improves performance without impacting thermals & power delivery

27 Power Efficient Asymmetric Threads Function AsymmetricThreading (w/co-processor ISA) - Partition single thread into a main thread with special function threads - Special function unit is more area and power efficient Performance Asymmetric Threading (ISA compatible) - Serial code on heavy core - Parallel code on smaller and power efficient core 27

28 Chip Multi-Processing C1 C3 Cache C2 C4 Relative Performance CMP ST Die Area, Power Multi-core, each core MT Shared cache and front side bus Each core has different Vdd & Freq Core re-cycling to spread hot spots Lower junction temperature 28

29 Summary Technology scaling can and will continue Challenges to power and energy efficiency are real but surmountable....through evolutionary approaches to circuits and microarchitecture 29