GTC Using GPUs to Speedup Chip Verification. Tomer Ben-David, VP R&D

Size: px
Start display at page:

Download "GTC Using GPUs to Speedup Chip Verification. Tomer Ben-David, VP R&D"

Transcription

1 GTC-2012 Using GPUs to Speedup Chip Verification Tomer Ben-David, VP R&D

2 The Verification Bottleneck Long verification process 66% of designs, verification takes 50% of the design cycle In ~40% of projects, simulation regression runtime is longer than 1 day SoC simulation challenge Over 40% of designs are larger than 10M gates Difficult to simulate the entire design/soc Excessive computing resources Required 10 s or 100 s GBytes of memory Needed most advanced CPUs Source: 2010 Wilson Research Group and Mentor Graphics Functional Verification Study Slide: 2

3 RocketSim TM Verification Verification is the bottleneck of the design cycle Simulation Simulation speed is the bottleneck of verification Rocketick s Solution RocketSim TM accelerates simulations by 10X or more Slide: 3

4 Simulators using CPUs Event driven, implemented with a single queue of events HDL (Verilog / VHDL) naturally maps to event driven even better than it maps for synthesis Very short-simple calculations, handling them one-at-a-time makes a lot of sense Memory access patterns For every complete clk-cycle, most of the simulation state is accessed As design size increases we get more cache miss Multi-core CPUs: The HW Only one order of magnitude Limited memory bandwidth The SW solutions High-level partitioning Relaxation latency Slide: 4

5 Applications using GPUs Source: NVIDIA Slide: 5

6 Typical CUDA development flow Application/ Problem Find algorithm that solves the problem Split code to CPU/GPU regions Profiling and Performance analysis Coding for CPU/GPU parts Debug & fix Slide: 6

7 The nature of our problem in hand Many CUDA developers are facing the same challenge: Efficient coding is not trivial In our case the The Problem varies with every User s chip design: Graph with billions of processing nodes Huge amount of dependencies Mapping different Verilog constructs to SIMT Compilation of User s design has to take minutes, development flow with CUDA takes weeks! Slide: 7

8 The solution: Virtual Machine Rocketick s VM is implemented using CUDA The VM processes recipes efficiently, maintaining graph dependencies Proprietary tools for debug and profiling Ideal-fit target platform for RocketSim s compiler Each GPU forward compatibility Slide: 8

9 Rocketick s technology Breaking the Dependency Barrier Slide: 9

10 Traditional Simulators Vs. RocketSim Slide: 10

11 RocketSim - Compilation Stages Analyze Parse HDL source files Static Elaboration RTL Elaboration Compile Create optimal dependency graphs Calculate optimal GPU invocation schemes Generate skeleton (ske.v) Assembly Calculate optimal memory allocation for variables Generate final recipes for the GPU virtual machine Slide: 11

12 The challenge of squeezing out performance Wide memory accesses Efficient Kernel invocations (static & dynamic work aggregation) Very large dataset, very small GPU shared memory Host-GPU synchronization (simulation state) Long sequential logic (implementation optimizations and GPU-CPU handoff) Multi-GPUs (much larger latencies) Slide: 12

13 Nvidia tools Vs. proprietary tools When gdb is just not enough When CUDA profiler is just not enough sid index: 6 (2x5_#14_comb_logic_6) dump region: global mem=0x000a3800 (sid:2x5_#14_comb_logic_6 seq/peg: 0/0) {8 : <UID_wPEGelmnt:[100009]>} [0080] (1B) OUT [-S] 00 ROCK_SKE_TOP.csr_dtct_0[0] [0081] (1B) IN [L-] 00 ROCK_SKE_TOP.csr_posneg_1[0] [0082] (1B) OUT [-S] 00 ROCK_SKE_TOP.pos_dtct_1[0] [0083] (1B) OUT [-S] 00 ROCK_SKE_TOP.neg_dtct_2[0] [0084] (1B) IN [L-] 00 counter_tb.my_counter.rst[0] [0085] (1B) OUT [-S] 01 counter_tb.my_counter.n5[0] [0086] (1B) IN [L-] 00 counter_tb.my_counter.clk[0] [0087] (1B) IN [L-] 00 ROCK_SKE_TOP.csr_p_0[0] [0091] (1B) IN [L-] 00 1'h0[7:0] [0092] (4B) CACHE [L-] AAAAAAAA 32'haaaaaaaa[31:0] [0096] (4B) CACHE [L-] FFFFFFFF 32'hffffffff[31:0] [0100] (4B) CACHE [L-] 'h0[31:0] [0104] (4B) CACHE [L-] counter_tb.my_counter.n7[31:0] Slide: 13

14 RocketSim bringing results! Slide: 14

15 RocketSim Overview Summary 10x or more acceleration factor Works seamlessly with all leading simulators Supports extremely huge designs (Giga-gates) Slide: 15

16 Rocketick Need Big industry pain - Huge amount of $ s spent on verification Challenge Mapping graph with dependencies to GPU very hard to do! Results 10X-30X acceleration on the largest designs in the word Slide: 16

17 Thank you for more information please visit our web site: