EV8: The Post-Ultimate Alpha. Dr. Joel Emer Intel Fellow Intel Architecture Group Intel Corporation

Size: px

Start display at page:

Download "EV8: The Post-Ultimate Alpha. Dr. Joel Emer Intel Fellow Intel Architecture Group Intel Corporation"

Brittney Reed
6 years ago
Views:

1 EV8: The Post-Ultimate Alpha Dr. Joel Emer Intel Fellow Intel Architecture Group Intel Corporation

2 Alpha Microprocessor Overview Higher Performance Lower Cost 0.35µm EV EV µm 0.18µm EV µm EV µm EV µm EV First System Ship

3 Goals Leadership single stream performance Extra multistream performance with multithreading Without major architectural changes Without significant additional cost

4 EV8 Architecture Overview Aggressive instruction fetch unit 8-wide super-scalar scalar execution unit 4-way simultaneous multithreading (SMT) Large on-chip L2 cache Direct RAMBUS interface On-chip router for system interconnect for glueless,, directory-based, ccnuma with up to 512-way multiprocessing

5 System Block Diagram EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO

6 Instruction Issue Time Reduced function unit utilization due to dependencies

7 Superscalar Issue Time Superscalar leads to more performance, but lower utilization

8 Predicated Issue Time Adds to function unit utilization, but results are thrown away

9 Chip Multiprocessor Time Limited utilization when only running one thread

10 Fine Grained Multithreading Time Intra-thread dependencies still limit performance

11 Simultaneous Multithreading Time Maximum utilization of function units by independent operations

12 Basic Out-of of-order order Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire PC Register Map Regs Dcache Regs Icache Thread-blind

13 SMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire PC Register Map Icache Regs Dcache Regs

14 Architectural Abstraction 1 CPU with 4 Thread Processing Units (TPUs( TPUs) Shared hardware resources TPU 0 TPU1 TPU2 TPU3 Icache TLB Dcache Scache

15 Key Design Principles High throughput single stream design Enhancements for SMT

16 Little s Law Average Number of Tasks in Region (N) Throughput (T) = Average Latency in Region (L)

17 Little s Law for Instruction Fetch L = fixed pipe length + average memory latency N = number of instructions fetched N T = L

18 Instruction Fetch Unit Wider fetch Fetch more statically consecutive instructions Limited by trace length Trace Cache Build sequences of dynamically consecutive instructions Significantly greater complexity Double fetch Fetch two non-consecutive blocks of instructions

19 Instruction Fetch Unit Address Latches Icache Rate Matching Buffer Collapsing Logic Line Predictor Misprediction Calculation Branch Predictor Jump Address Predictor Return Predictor

20 Instruction Fetch Characteristics Two 8-instruction 8 fetches per cycle 16 branch predictions per cycle Jump target prediction Return address prediction Rate matching buffer of fetched instructions Collapse fetched instructions into groups of 8

21 Execution Unit Issue Queue Regs Dcache Regs

22 Execution Unit Characteristics Single issue queue 8-wide 112+ entries Register file 512 registers 16 read/8 write ports Function units 8 integer ALUs 4 floating ALUs 4 memory operations (2 read/2 write)

23 Little s Law for Execution Unit L (min) = Number of cycles in pipe T (desired) = Number of desired instructions per cycle (8) N 8 =

24 Little s Law for Execution Unit Little's Law for the IQ Max IPC Average Q chunk Lifetime

25 Key Design Principles High throughput single stream design Enhancements for SMT

26 Additions for SMT Replication required resources Program counters Register File (architectural space) Register maps Sharable resources Register file (rename space) Instruction queue Branch predictor First and second level caches Translation buffers.

27 Approaches Replicated resources used for all per TPU state (except register file) some sharable resources where design is easier (*) E.g,, return stack predictor Shared resources used for register file (*) all other sharable resources (*) * Policy may be needed to make priority decisions

28 Choosing Policy

29 Choosing policies FIFO trivial Round robin easy Proportional special case Icount-style fair

30 Icount Choosing Policy

31 Why Does Icount Make Sense? N T = L N/4 T/4 = L

32 Choosers Address Latches Line Predictor Retire

33 Choosers - Fetch Address Latches Line Predictor Retire

34 Choosers Fetch Address Latches Line Predictor Retire

35 Choosers - Map Address Latches Line Predictor Retire

36 Choosers - Retire Address Latches Line Predictor Retire

37 Choosers LD/ST numbers cache Address Latches Line Predictor Retire

38 Choosers LD/ST numbers cache Address Latches Line Predictor Retire

39 Choosers Miss/Store cache Address Latches Line Predictor Retire

40 Choosers Fetch Chooser - Icount Map Chooser - Icount LD/ST Number Chooser - Proportional Retire Chooser Round Robin Load miss Chooser Round Robin Store Buffer Chooser - FIFO

41 Area Cost of SMT Support Total overhead: 6%±

42 Multiprogrammed workload 250% 200% 150% 100% 1T 2T 3T 4T 50% 0% SpecInt SpecFP Mixed Int/FP

43 Decomposed SPEC95 Applications 250% 200% 150% 100% 1T 2T 3T 4T 50% 0% Turb3d Swm256 Tomcatv

44 Multithreaded Applications 300% 250% 200% 150% 100% 1T 2T 4T 50% 0% Barnes Chess Sort TP

45 Acknowledgements Tryggve Fossum Chuan-Hua Chang George Chrysos Steve Felix Chris Gianos Partha Kundu Jud Leonard Matt Mattina Matt Reilly

Physically Constrained Architecture for Chip Multiprocessors

Physically Constrained Architecture for Chip Multiprocessors A Dissertation Presented to the faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the