

### Power Management Early in the Design Flow: Exploration to Implementation

### April, 2007 Holly Stump, VP Marketing



© 2007 by Sequence Design, Inc.



#### Power Management Early in the Design Flow: Exploration to Implementation

- Managing Power at the Architectural Level
- **RTL Power Management**
- Power Debug Environment
- Silicon-Aware Power Management
- Software and Mode-Dependent Stimulus
- Identifying and Eliminating Wasted Power at RTL
- Popular Power Reduction Techniques
- Clock Power and Clock Gating
- Multi-Vt
- Voltage Islands
- Power Gating
- Power Regression Testing
- **Power Metrics**
- Summary





### **Design Investigations and Power Requirements**

**Design Investigations Tracked by Power Consumption** 

Total # Design Investigations Tracked = 13,546 (Jan - Oct, 2006)



Ref: Chip Design Trends Newsletter, John Blyler, Dec 2006





### **Power Is The New Performance!**

Low Power Is Critical Due To:





Survey Summary of SoC Designers DAC 2006 Sample size = 115



# **Managing Power at the Architectural Level**

• Are you an architect?

2007 Sequence De

- What-if analysis for micro-architectures
- Optimization for power, performance, area

### ESL and other explorations

### Intelligent debug environment

- RTL power estimation
- RTL power management





### **Power-Aware ESL Synthesis Flow**



Area vs. performance tradeoff Area vs. performance vs. **power** tradeoff





### 802.11a: Optimized for power, area, performance

802.11a Wi-Fi transmitter

### 7 candidate micro-architectures

- Push-button tool flow:
  - Bluespec for design
  - Sequence for power
  - Synopsys for RTL synthesis
- Final design: 4 milliwatts



### 7 micro-architectures implemented and explored in only 5 engineer-days

Source: Dave, Pellauer, Gerding & Arvind Courtesy: MIT





### **802.11a Transmitter Overview**

 $\odot$ 







# **IFFT module (combinational)**



Each of the 48 radix4 blocks looks like this:



All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power,



. . .





# **IFFT: Micro-architectural exploration**



Each stage's 16 radix4 blocks could be also implemented with 8, 4, 2 or 1 radix4 block(s) used over multiple cycles

Each stage is almost identical, why not fold and re-use what you can?







# Superfolded circular pipeline: Just one Radix-4 node!









### **Performance Results**

All the combinations created and explored within <u>five</u> days

Designers were <u>astounded</u> to find that their intuitions were wrong and that the critical areas for reducing power were not where they suspected

|                            |                                                                         | PowerTheate                                                                      | r SOC: com                                        | nb.scn - Pow                              | ver Consun                                          | nption                                                                                                      |                                                      |
|----------------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------------------|---------------------------------------------------|-------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
| S≷ <u>F</u> ile <u>E</u> d | it <u>P</u> ower <u>V</u> iew                                           | <u>O</u> ptions                                                                  |                                                   |                                           |                                                     |                                                                                                             | <u>H</u> e                                           |
| Somb c                     | ontroller/m<br>onv_encode<br>yc_extender<br>fft/mklFFT_<br>nterleaver/r | kController<br>r/mkConvEnc<br>/mkCyclicExt<br>Comb<br>nkinterleaver              | -toC/FI<br>`toS/FI<br>coder_2<br>ender            | FO2_wid<br>FO2_wid<br>4_48                | dth24_g<br>lth27_g<br>dataQ/f<br>header(<br>ordered | er_DW01_sub_i<br>juarded1<br>liarded1_1<br>FIFOL1_width2<br>Q/FIFOL1_widt<br>Q/FIFOL1_widt<br>Q/FIFOL1_widt | >= 2.00 %<br>>= 1.00 %<br>h<br>b<br>t<br>t<br>t<br>t |
|                            |                                                                         | 1apper_48_64<br>kScrambler_4                                                     |                                                   |                                           |                                                     |                                                                                                             | >= 0.10 %<br>>= 0.00 %<br>Bucket                     |
|                            |                                                                         | kScrambler_4                                                                     |                                                   | ary                                       |                                                     | <u>&lt;</u>                                                                                                 | >= 0.00 %                                            |
|                            |                                                                         | kScrambler_4                                                                     | wer Summ                                          | comb.res                                  |                                                     |                                                                                                             | >= 0.00 %                                            |
| S<br>Z                     | crambler/m                                                              | KScrambler_4                                                                     | wer Summ                                          |                                           | 1                                                   |                                                                                                             | >= 0.00 %                                            |
| N<br>Power                 | crambler/m                                                              | RScrambler_4                                                                     | wer Summ<br>Static<br>32.3uW                      | comb.res<br>Dynamic                       | Total                                               | Library Defaults                                                                                            | >= 0.00 %                                            |
| N<br>Power                 | Attributes                                                              | KScrambler_4 Power Contribution Internal power Pad power                         | wer Summ<br>Static<br>32.3uW                      | comb.res<br>Dynamic<br>1.02uW             | Total<br>33.3uW                                     |                                                                                                             | >= 0.00 %<br>Bucket                                  |
| Power                      | Attributes                                                              | KScrambler_4 Power Contribution Internal power Pad power Clock power Clock power | Wer Summ<br>Static<br>32.3uW<br>0W<br>0W<br>535nW | comb.res<br>Dynamic<br>1.02uW<br>0W       | Total<br>33.3uW<br>0W<br>0W<br>947nW                | Library Defaults                                                                                            | Power                                                |
| Power                      | Attributes                                                              | KScrambler_4 Power Contribution Internal power Pad power                         | wer Summ<br>Static<br>32.3uW<br>0W<br>0W          | comb.res<br>Dynamic<br>1.02uW<br>0W<br>0W | Total<br>33.3uW<br>0W<br>0W                         | Library Defaults                                                                                            | Power<br>Consumption                                 |

| 802.11a Design<br>(by IFFT block type) | Area<br>(um^2) | Symbol<br>Latency<br>(cycles) | Throughput<br>(clks/<br>symbol) | Min frequency required (MHz) | Average<br>Power<br>(mW) | Optimal        |
|----------------------------------------|----------------|-------------------------------|---------------------------------|------------------------------|--------------------------|----------------|
| Combinational                          | 4.91           | 10                            | 4                               | 1.0                          | 3.99                     | <b>  power</b> |
| Pipelined                              | 5.25           | 12                            | 4                               | 1.0                          | 4.92                     |                |
| Folded - 16 radix4                     | 3.97           | 12                            | 4                               | 1.0                          | 7.27                     |                |
| Folded - 8 radix4                      | 3.69           | 15                            | 6                               | 1.5                          | 10.9                     |                |
| Folded - 4 radix4                      | 2.45           | 21                            | 12                              | 3.0                          | 14.4                     | Original       |
| Folded - 2 radix4                      | 1.84           | 33                            | 24                              | 6.0                          | 21.1                     | designer       |
| Folded - 1 radix4                      | 1.52           | 57                            | 48                              | 12.0                         | 34.6                     | intuition      |





# **ESL Study in Video Decoding**

0





### **ESL Synthesis, Estimation: Decoder Hardware**





RTL Estimate Results for "video demo" application:

- 40.4k gates
- 100MHz
- Library: TSMC 90G Worst Case
- 0.9V power supply
- Power analysis
  - RTL estimate = 4.79mW
  - Gate-level results = 4.40mW







### **RTL Power Management**

- 80% of chip power is determined at RTL (or earlier)
  - SoC power must be dealt with at RTL
  - Gate level appropriate for high-accuracy verification



SEQUENCE Enabling Power-Aware SoC Design<sup>SI</sup>



# **Power Debug Environment**

#### What is critical?

- Architectural trade-offs not available at gate
- Estimate block, IP and full chip power
- Vector and vectorless modes

#### **Performance and Capacity**

- RTL: 10X gate level throughput
- RTL abstraction / capacity

#### Accuracy

- Within 20% of gate
  - Clock power algorithms
  - Macro-level power modeling
  - Library, memory, IP power attributes
  - Robust power arc matching

#### **Fast Power Debug**

- Visibility and prioritization
  - Thermal map on design hierarchy tree highlights areas to investigate
  - Cross probing to source: isolate power problems
  - Detailed visual and textual reports



## Silicon-Aware Power Management

Sequence Desig

2007

0





### Software / Vectors / Modal Power Analysis



Gate-level power verification

Dynamic voltage drop

Sequence Desi

2007

#### Modal Power Analysis at RTL

- Run simulation vectors for all critical modes of operation
- Analyze power and activity per mode
- Find modal power bugs

#### **Power Vector Forward**

- Identify and qualify worst case power cycles
- Feed forward to gate-level power analysis
- Feed forward to implementation DVD analysis



# Identifying, Eliminating Wasted Power

#### Common errors, mishaps and wasted power

- Enabled clock toggles while data is inactive (shown)
- Data toggles on register input while clock is inactive (register power)
- Wasted (un-gated) clock toggles while data is inactive
- Use clock gating cell with local explicit clock enable instead of feedback mux
- Enable active, data / clock not
- Mux select active, data inactive....
- Memory splitting advisory

Sequence Des

2007

#### A variety of errors pertaining to

- Clock, datapath, control, memory, I/O
- Muxes, clock-gated registers, memories





# **Eliminate Wasted Power**

2007 Sequence Des

- Pre-defined topology rules based on activity
  - Reporting where, how and how much power can be reduced
    - Clock, datapath, control, memory, I/O
    - Muxes, clock-gated registers, memories

|                         |                                                                                                                                                                                                                                                                                                                                                     |                       | Reduction Re               | sults: rtl_pre_o  | pt.red                                                                                                                            |                              |                 |
|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|----------------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------|------------------------------|-----------------|
|                         | Eile View Implement                                                                                                                                                                                                                                                                                                                                 |                       |                            |                   |                                                                                                                                   |                              |                 |
| Total Power<br>_34.4mW  | Internal Power<br>–33.8m₩                                                                                                                                                                                                                                                                                                                           | Clock Power<br>–627u₩ | Area Impact<br>27.4K(um)^2 | Implement Red     |                                                                                                                                   |                              | •               |
| □164u<br>⊕              | N 8.06u₩                                                                                                                                                                                                                                                                                                                                            | –172u₩                | –254 (um)^2                | Loca              | l explicit clock enable<br>nce: stats (top.core1.s1.#5)                                                                           | Prioritized                  | power savings   |
| 🗆 📃 –91.4u              | N –91.4u₩                                                                                                                                                                                                                                                                                                                                           | 0₩                    | 3.36K(um)^2                | Duty              | oath operator isolation<br>cycle: 99%                                                                                             |                              |                 |
|                         | Instance; top.core1.t1.l1.#0<br>Nets: 128                                                                                                                                                                                                                                                                                                           |                       |                            |                   |                                                                                                                                   |                              |                 |
|                         | /rtl/txchan//lencntr.v                                                                                                                                                                                                                                                                                                                              |                       |                            |                   | Datapath operator isol                                                                                                            | ation                        |                 |
| input [`W               | <pre>instruction // load the counter<br/>enable; // enable the counter<br/>input ['WIDTH-1:0] din; // input data bus fi<br/>output onecount; // set when counter<br/>reg ['WIDTH-1:0] count;<br/>reg onecount;<br/>always 0 (posedge clk or negedge nreset)<br/>begin<br/>if (nreset == 0)<br/>begin<br/>count = 0;<br/>onecount = 0;<br/>end</pre> |                       |                            | C                 | Patapath operator is                                                                                                              |                              |                 |
| reg [`WID               |                                                                                                                                                                                                                                                                                                                                                     |                       |                            |                   | Definition                                                                                                                        |                              |                 |
| always @<br>begin<br>if |                                                                                                                                                                                                                                                                                                                                                     |                       |                            | explicit clock en | chematic shows datapath operators going<br>able. When the enable is off, the datapath<br>tapath is consuming power to compute a r | n output is ignored. In this |                 |
|                         |                                                                                                                                                                                                                                                                                                                                                     |                       |                            | eloc              | k                                                                                                                                 | Detailed ru                  | ule description |
| s probing               | to == 1                                                                                                                                                                                                                                                                                                                                             | >                     |                            | enabl             | e                                                                                                                                 |                              |                 |
| source                  | bunt =                                                                                                                                                                                                                                                                                                                                              |                       |                            |                   |                                                                                                                                   |                              |                 |
|                         | <pre>if ((enable == 1) &amp;&amp; (count !=<br/>begin<br/>count = count - 1;<br/>else<br/>begin<br/>count = count; // (<br/>end<br/>if (count == 1)<br/>Display module definition<br/>Edit Close</pre>                                                                                                                                              |                       |                            |                   |                                                                                                                                   |                              |                 |
|                         |                                                                                                                                                                                                                                                                                                                                                     |                       |                            | Z                 | Implementation                                                                                                                    |                              |                 |
|                         |                                                                                                                                                                                                                                                                                                                                                     |                       |                            |                   | thematic shows a way to reduce power for<br>ath result is not being used, the datapath                                            |                              | SEQUEN          |



### **Ubicom Slashes Power 25%**



#### **Communications and Media Processor standout Ubicom**

- StreamEngine 5000<sup>™</sup> family
- Chip includes
  - 10 MPUs
  - Commercial IP
  - Memory
  - 350K gates of standard cell logic

#### Reduced power in multi-core IC logic arrays by 25%

- RTL power analysis and optimization
- RTL clock power analysis
- Automated power reduction (wasted power)

#### RTL power analysis correlated to within 5% of gates



# Airgo Networks (now Qualcomm)



- Single chipset that supports 802.11 a/b/g
  - 802.11 MIMO chip sets (baseband and RF)
    - First commercial/consumer MIMO systems
- First cost-effective True MIMO products with 2x max data rate
  - 108Mbps in one RF channel

Des

2007 Sequence

- 6x to 8x rate/range performance
- 2x3 MIMO System Architecture
- Target: Acceptable power consumption levels





0

700

600

500

300

200 100

0

Power (mW) 400 605

PowerTheater

(RTL)

POWER

THEATER





airgo

#### - Derrick Lin Senior Director ASIC engineering Airgo Networks

AGN100BB chip. We achieved this level of accuracy by leveraging

**Wireless** 

PowerTheater predicted power usage within 10% of silicon for our

RTL Data, RTL Clock 12% of silicon RTL Data, Gate clock ~5% of silicon

Enabling Power-Aware SoC Design<sup>sh</sup>

SEQUEN



# Airgo Methodology



- RTL: fastest, but not the most accurate
  - Gate: good accuracy, but too slow and a new flow for projects that don't do gate level sims
  - 40% of dynamic power consumed in the clock tree
    - Estimate clock tree accurately (SDF back annotated simulations along with SPEF data)
    - Estimate "data" power using RTL simulations
    - Add the two numbers up (using clock activity factors from RTL) to get the "average" dynamic power
  - Determine a good "power vector" based on design knowledge
    - Modeling requirements
      - ASIC library has to be characterized for power
      - RAMs and IOs also have to be characterized for power
      - Wireload models increase data power accuracy
- Taped out several chips with comfortable margins





### **Popular Power Management Techniques**

Clock gating



Enabling Power-Aware SoC Design<sup>sm</sup>

# **Clock Power and Clock Gating**

- Clock power is significant
  - Frequently 40-50% of total active power
  - Clock and clock tree

### **Clock gating**

2007 Sequence Des

Explore Power Savings: >25% of clock power



# **Reporting Clock Domain Power**

0





### **Hierarchical Clock Gating**







### Multi-Vt

- The good:
  - Multi-Vt libraries can save approximately 5-15%
  - Multi-Vt is not a challenge; easy to do

### The bad:

- Cost of multi-Vt libraries
- Timing issues
- Signal integrity
- The tradeoffs are generally obvious...and tools exist...





# Voltage Islands

The good

Voltage islands can save approximately 10-50%

### The bad

- Cost of characterization of libraries
- Area penalties
  - Power grid
  - Level shifters
- Performance degradation
  - Timing
- Complexity
  - Timing, SI, voltage drop

### The ugly

 Design, verification, implementation tools are not integrated or robust



# **Power Gating...Power vs Penalty**

### The good

Power gating can save approximately 10-1000X leakage

#### The bad

- Increases complexity
  - Rush current problems; wake-up time, switch sizing and sequencing
  - Timing, SI, voltage drop
- Area penalties
  - Power grid
  - Level shifters, isolation cells
- Impacts performance and creates timing issues

### The ugly

2007 Sequence De:

 Again, design, verification, implementation tools are not integrated or robust

"Leakage will become a major industry crisis, threatening the survival of CMOS itself" ITRS2005 Executive Summary







### **Cradle Architecture**



- Capable of real-time encoding
  - 16 channels of MPEG4 SP@L3
  - 4 channels of MPEG4 ASP@L5
  - or 1 channel of H.264 Main Profile D1
- 55 Million Transistors (In-House RISC and DSP processor design), 180K Flops, 4 clock domains
- 24 processing cores
  - 16 DSPs and 8 General-Purpose Processors (GPPs)
- A smart I/O subsystem providing up to 144 fully programmable I/O pins
- A DDR SDRAM interface to support high data throughputs for high-definition video processing



# Cradle Architecture



- Low power consumption across the family
  - As low as 1.5 W

2007 Sequence Desi

0

- Loosely coupled multiprocessor architecture enables more efficient system level performance:
  - Megapixel sensor interfaces, image enhancement, data encryption, video/audio Codecs, complex network stacks and system





### **Power DSE Results**



| Test                        | Total/Clock<br>(mW) | sub-module power consumption (mW) |         |          |         |  |  |
|-----------------------------|---------------------|-----------------------------------|---------|----------|---------|--|--|
|                             |                     | IF                                | MAC     | RF       | DF      |  |  |
| 1<br>Directed Test          | 174 / 61            | 35.09                             | 27.5    | 20.92    | 8.26    |  |  |
| Directed Test               |                     | (20.13%)                          | (15.8%) | (12.45%) | (4.91%) |  |  |
| 2<br>Directed Test          | 173 / 61            | 34.42                             | 27.3    | 20.68    | 8.61    |  |  |
| 3<br>Directed Test          | 169 / 61            | 34.65                             | 23.1    | 20.88    | 8.38    |  |  |
| 4<br>MAC usage              | 194 / 61            | 34.7                              | 41.5    | 21.29    | 11.05   |  |  |
| 5<br>MAC power<br>down mode | 138 / 40            | 25.01                             | 4.84    | 21.99    | 7.15    |  |  |
| 6<br>Application            | 159 / 61            | 33.76                             | 19.9    | 20.09    | 6.64    |  |  |





# **Power Estimation at RTL: DSP at Cradle**

### **PT Accuracy**



| Design        | PT Power | Actual Silicon Power |            |  |
|---------------|----------|----------------------|------------|--|
| DSE<br>Block  | 160 mW*  | 150 mW*              | (8% less)  |  |
| Full Chip     | 5.3 W**  | 4.9 W**              | (8% Less)  |  |
| Clock<br>Tree | 2.17 W   | 1.95 W               | (10% less) |  |

\* Block power depends on test and activity. It ranges from 138mW to 194mW

\*\*Full Chip power is based on specific vectors and does NOT represent overall power in applications

Sequence Low Power Seminar

10

55 Million Transistors (In-House RISC and DSP processor design) , 180K Flops, 4 clock domains 24 processing cores, including 16 DSPs and 8 General-Purpose Processors (GPPs)



### Design for Power! Simple Power Saving Techniques



- Power debug
  - Apply power reduction schemes first to the sub-modules that consume power mostly
  - Determine and eliminate "hot spots"

#### Clocks

Clock tree consumes 40 – 50% of total power; reduction scheme is very important

#### Vectors

Develop accurate power vectors that exercise all possible nodes

#### Write RTL and optimize for power

- Shut off data switching when in idle state
- Provide chip enables and output enables effectively
- Use gated clocks to all data flops
- Use multiplexed flops to change data when enables are set
- Use "Grey Code" scheme for FIFO pointers and memory addresses

#### Memories

2007 Sequence Des

- Power-efficient memory selection is key
- Replace flop cluster with custom or standard compiled memories
- Try different memory configurations (1 bank, 2 bank, different aspect ratio etc).
- Shut off clocks to all memories when not in use: Memory Chip Enable
- Splitting memory can reduce power as much as 30%
- Multi-VT libraries reduce power and leakage





### **Power Regression Testing Methodology**



Tensilica Configurable and extensible microprocessor cores for embedded SOC designs You Can't Fix

What You Can't Measure!



**SEQUENCE** Enabling Power-Aware SoC Design<sup>54</sup>



Design Goal



- Reduce power dissipation by 25% compared to the previous generation design
- Previous generation already optimized for low power operation
  - Must work on lowering power during early design phase
- Special challenges for configurable IP cores
  - Configurable cores have numerous combinations to test
  - Soft IP characterized for various fabrication processes
  - Requires database of area, timing, and power numbers
- Need a methodology for monitoring power dissipation on a regular basis, with meaningful feedback to designers
- RTL advantages over gate
  - Debug visibility

De

Sequence

2007

Performance and capacity for long simulations





### **Power Regressions**

- Measure effectiveness of clock gating
  - Xtensa processor employs global & functional clock gating

### Tune DSP extensions for low power

- Profile power dissipated executing common DSP kernels
- Tune assembly code and hardware implementation to meet aggressive power goals
- Guard against any undue increase in power
- Generate characterization data for the Xtensa "Processor Generator"





# **Weekly Power Regression Tracking**



Goal met: 25% power reduction in 15 weeks

Configurable, extensible processor cores for embedded SOC designs

2007 Sequence Des

0



# Verify and Refine Power at Gate Level

Eliminate power creep

2007 Sequence Des

- Enhance productivity and ease of debug
- Gate-level power analysis using RTL simulations
- Power vs time waveforms correlated to event waveforms





SEQUENCE

Enabling Power-Aware SoC Design<sup>™</sup>



# **Emerging Low Power Standards**

- Design for power intent
  - Enabling innovation



Enabling Power-Aware SoC Design<sup>™</sup>

CE

# Best Practices Summary

- Design for Low Power Intent!
- Architectural exploration has great impact
- Verify power early and often
- Prioritize "power offenders" and take corrective action during the RTL design phase
- Choose worst case power vectors from each mode of operation
- Eliminate "wasted" power

2007 Sequence

- Run "power regressions" throughout RTL to tapeout
  - A single low power standard: to unify the methodology flow
- Control the true costs of power.....

