利用交互调试和自动优化技术提高 RTL 设计功耗效率 ANSYS 2015

Size: px
Start display at page:

Download "利用交互调试和自动优化技术提高 RTL 设计功耗效率 ANSYS 2015"

Transcription

1 利用交互调试和自动优化技术提高 RTL 设计功耗效率 ANSYS 2015

2 Powe r Gap Power Budgeting Challenge Design Trends Increasing Power Gap Ref: Cisco Multi-IP, multi-core integration Reduced battery life Ref: Samsung, Asia Tech Forum GHz+ performance Degraded thermal performance

3 Power Reduction Early Power Decisions High Impact 100% 50% Large Impact Small Impact 0% RTL Design Logic Synthesis Physical Design Timing Closure RTL Design-for-Power Power-Performance-Area Trade-offs Voltage / Power Domain Planning Block-level Clock and Data Gating Eliminate Redundant Activity Low Power Implementation Power Switch Sizing / Placement Clock Gater Cloning / Decloning Multi-Vt Optimization Power Integrity Verification

4 Power (W) Best Practices for Low-Power RTL Design Track power via regressions RTL Power Regression Flow 6.00E E E E E E E+00 Version 1 Version 2 Version 2 (Typ) Version 1 (Typ) Version 2 (Idle) Version 1 (Idle) Perform design trade-offs TRANSMIT MODE RECEIVE MODE Reduce power automatically RTL Design-for-Power Platform Residual receive activity in transmit mode Profile design activity Peak Power = Average power 391mW= 239mW Enabled Clock Inactive Data Debug power hotspots Check power vs. budget

5 1. Perform Design Trade-Offs at RTL Quick Design Iterations Design Specification ~22 mins Effective Design-for-Power Adder + RTL Power Mux ~20 hours RTL Design Gate-Level Design Power-per-Function Gate-level Power Registe r Layout Power-per-Gate

6 Power (W) Explore Power of Micro-Architectures Power efficiency across different architectures Validate across multiple modes of operation 6.00E E E E-02 Typical mode, Arch #2 Version Arch#2 Arch#2 2 (Typical (Typ) (Typical mode) mode) Typical mode, Arch #1 Version 1 (Typ) Idle mode, Arch #2 Version 2 (Idle) Idle mode, Arch #1 Version 1 (Idle) 2.00E E E+00 Ref: Architectural Exploration: Area-Power tradeoff in a transmitter design,, MIT

7 2. Profile Design Activity Conventional signal activity viewer Difficult to validate activity coverage Difficult to analyze activity per hierarchy PowerArtist Activity Viewer Reset Activity Pipeline Fill Activity Optimal Time Interval Redundant Pipeline Activity Identify power-critical windows Qualify vectors per mode Identify wasted activity

8 Best Practices for Analyzing Design Activity Per mode Per hierarchy Transmit Mode Receive Mode Redundant data activity? Clock Register Memory Per net category Flop clock pins per hierarchy Redundant clock activity?

9 Identify Power-Critical Cycles for Integrity Clock/Power Gating di/dt V-drop di/dt Automatic Cycle Selection on GPU Core Di/dt event not at the same time as the peak Frame: CYCLE_POWER Start time: Finish time: Vdd L di/dt Packag e Chip Peak = 6X Average Power Frame: DIDT Start time: Finish time: GB FSDB, 632K cycles, 3.3M instances High-performance engine optimized for M+ cycles Identifies Peak and dp/dt activity windows Direct interface: PowerArtist RedHawk

10 3. Check Power vs. Budget, Early Get early visibility into power: Average, Peak, Power Waveform Guide power-related design decisions early: grid, package, decap Avoid schedule and price impact Average Power Time-based Power Power by Hierarchy, Category, Mode Power by Clock, Power Domains Peak Power and Time Power Waveforms per Hierarchy, Category

11 Profile RTL Power for Live Applications Conventional two-step power analysis flow 1 Activity Generation Emulator Write FSDB FSDB 2 Power Analysis Read FSDB PowerArtist 100+M cycles Very large files File writing slows emulator File reading slows power tool Accelerated Dynamic Activity Streaming Flow Streaming interface Dynamic Read Waveform API PAVES Socket 1 Activity Generation and Power Analysis in parallel RTL power budgeting and Gate sign-off for live applications Full-chip capacity Up to 4.5X faster TAT vs FSDB No loss of accuracy

12 Is RTL Power Reliable for Early Decisions? RTL Power module PA (... WLMs (posedge clk) begin Clock dout modeling <= din1; Inferencing end assign out = sel? dout : din2;... endmodule Clock distribution Parasitics Multiple Vt Low-power structures Post-Layout Power Representative Layout PACE bridges RTL Implementation Gap PowerArtist PACE Model Pre-Layout Power Budgeting PACE net parasitics Replace inaccurate WLMs PACE clock trees RTL CTS engine, mesh & tree PACE cell selection Cell distribution including multiple Vt

13 RTL Power Accuracy for FinFET Designs PACE-based RTL Power Accuracy for 16FF IP RTL vs Gates Total Power: within 15% Combo: 20.6% Total: 14.4% Clock: 9.0% Register: 2.8% RTL vs Gates Clock Power: within 15%

14 RTL vs. Gates: Accuracy and Performance RTL Power Accuracy: ~15% RTL Power: ~30X faster

15 Accurate Power Engine Foundation Technology 4. Identify and Debug Power Hotspots Graphical Debug Quickly spot power anomalies: Where? Interactively identify root cause: Why? When? Trace power-annotated schematics and design hierarchy Tcl-based Debug Automate custom power reduction beyond standard tool reports Industry-standard performance with collections access to OpenAccess DB Complete access to power and activity properties Power Efficiency Metrics Identify inefficiency at different abstractions: design hierarchy, clock hierarchy, instance Cycle-accurate metrics include Clock Gating Efficiency

16 Visually Debug Power for Anomalies Inactive Data, Active Clock Power bugs Power incorrect, functionally correct Large power savings Designers spot bugs Browse by absolute power Browse by relative power Cross probe to schematics, RTL Use power metrics

17 Automate Custom Reports with Tcl Quick access to power and design properties Memory consuming high power Glitchy ALUs: ALUs with either input not coming directly from registers List of ungated registers: RTL file and line, power consumed, bit width Clock enable efficiency: Per clock gate, with downstream power the gate controls ALU mux selects: List of signals that constitute the mux select for ALU data

18 Review Power Efficiency Using Metrics By Hierarchical Instance By Clock By Flop / Latch

19 5. Reduce Power Early at RTL Block-level Clock and Data Gating Clock Active, Data Inactive Clock Inactive, Data Active Block-level Clock Gating Block-level Data Gating 1.1 Clock Pins Redundant Total Pin Mode Instance Cycles Cycles Name Name Name CLKA read top.core1.t1.dpmem.m Input and Redundant Pins Redundant Total Pin Mode Instance Toggles Toggles Name Name Name AB[8] read top.core1.t1.dpmem.m Redundant activity in read mode Wasted Activity per Mode sel (t-1) data se l 0 1 Leverage quasi-static signals for stability-based coarse CG

20 Leverage Automated Techniques Clock / Clock Gating Memory Subsystem Control Logic and Datapath Increase CG coverage Improve clock en efficiency Stability and observability Eliminate redundant access Split wide memories Exercise sleep modes Eliminate redundant activity Use don t care conditions Isolate datapath operators

21 Predicted Power Savings (normalized) Maximize Reduction, Minimize Iterations Saving estimates include added and removed logic, changed activity Top 5 RTL changes 50% identified power savings # RTL Changes (Design Effort) Prioritize high impact reductions Minimize design impact

22 6. Track Power via Regressions Monitor power creep across design development cycle Tcl interface to database enables custom queries Detailed tracking across design hierarchy, clock, supply Utility tracks change in power across two versions Sample report Typical Regression Framework 30+ blocks in a typical SoC 2+ vectors per block Vectors written for power: idle, active Daily block-level runs Weekly chip-level runs Track power change Track reduction opportunities

23 RTL Capacity: Large Designs / FSDBs FSDB captures only power-critical signals identified by PowerArtist FSDB size: 1/4 TAT: 4X faster Loss of accuracy: 2%

24 PACE: Physical-aware Predictable RTL Power Budgeting High-capacity, High-performance Power Analysis and Reduction PAVES: Power profiling for Real-time Applications RPM: RTL Power-driven Physical Power Integrity

25 ANSYS Customers in DAC 2015 LOGO Early Power Grid Prototyping for Atom Cores Early Power Prediction Enables Better Power Efficiency and Faster Design Closure Automated Flow for Technology and Design Specific Power Grid Design Accurate Silicon Correlation using Chip-Package Co-analysis Clock Jitter Analysis Flow using RedHawk-PJX Thermal Integrity and Thermal-aware EM Reliability Check for 3D Stacked Die Power Integrity Challenges for Next Generation Smart Phone SoC Taming Power Integrity Sign-off Challenges of Gigascale Designs RTL Power Methodology for Energyefficient IP and SoC Designs for Mobile Application IR Analysis of Large ICs through Distributed Multi-Processing Technology Power Integrity Analysis of a Combined PCB, Package and Die Achieving Power and Reliability Signoff for Automotive Semiconductor Design Integrating PathFinder into SoC ESD CAD Sign-off Flow High-frequency, High-power Magnetic Component Design with Maxwell 3D

26 谢谢