

# Challenges and Opportunities of ESL Design Automation

### Zhiru Zhang\*, Deming Chen+

### \*AutoESL Design Technologies, Inc. +ECE/University of Illinois, Urbana-Champaign





# Outline

- Introduction
- Opportunities and Challenges
- Modeling
- Synthesis and Optimization
  - Advanced Memory Synthesis
  - Effective Power Analysis and Optimization
  - Variation-Aware High-Level Synthesis
- Conclusions

## Introduction

- The rapid increase of design complexity urges the design community to raise the level of abstraction beyond RTL.
- Electronic system-level (ESL) design automation has been widely identified as the next productivity boost for the semiconductor industry.
- High-level synthesis (HLS) is a key cornerstone of ESL design automation.
- However, the transition to ESL design will not be as well accepted as the transition to RTL in the early 1990s unless
  - robust analysis and synthesis technologies can be built to produce high-quality architectures
  - highly optimized implementations can be automatically generated

# **Opportunities**

- ESL models and tools offer
  - early embedded software development
  - architecture modeling
  - design space exploration
  - rapid prototyping
- HLS fits in nicely for architecture exploration and rapid prototyping
  - early performance/area/power estimations & analyses
  - allows system architects explore different architectures efficiently
  - automated flows to map to an FPGA-based system for system emulation, functional validation and real-time debugging

### **Challenges - Modeling**

- Most efficient virtual platform modeling may not be fully synthesizable
- How to maintain a single synthesizable model as the golden reference for both simulation and synthesis?



### Challenges - Analysis and Optimization (1)

- Efficient support of the memory hierarchy and memory optimization
  - limited memory ports often become the performance bottleneck
  - oversized memory blocks would create wiring detours and routability problem
- Accurate high-level power and performance analysis
  - sophisticated activity propagation
  - clock tree with clock gating
  - multi-voltage islands, dynamic voltage frequency scaling, and power gating
  - low-level physical implementations
  - interconnect

### Challenges - Analysis and Optimization (2)

- Effective power and performance optimization
  - large design space
  - most of the problems are NP-hard
  - scheduling, binding and resource allocation are interdependent
  - parallelism extraction
  - quality convergence of layout-driven synthesis
- Process variation
  - variation modeling at high level
  - yield analysis and optimization

# **Challenges - Others**

- HLS for reliability
- HLS for thermal optimization
- ECO
- Verification
- IP integration

### Modeling – Dynamic Behavior and Standardization

- The synthesis tool shall continue to improve to handle a broader class of language constructs.
  - support dynamic behaviors in certain restricted forms.
  - extract the static binding and connectivity from the seemingly dynamic specifications.
  - extend and enhance the predominant static analysis methods.
- The design community and synthesis tool providers shall converge to a standard synthesizable subset.
  - On top of the standard, industry and academia shall collaborate to make available a set of reusable templates and libraries as references for efficient synthesis of common design patterns.
  - The reference templates and libraries should be relatively efficient in execution time and memory footprint.

### Modeling - Separation of Functionality and Constraints

- Synthesize hardware details from targetneutral source code
  - Golden functional spec for reuse
  - Technology/platformdependent RTLs
  - Synthesis influenced by separated constraints & directives

Source code (What)

void DUT(int in[N], int out[N])
{ ... }

Constraint/directive (How)



set\_interface -type stream -port {in out}



# **Advanced Memory Synthesis**

- On-chip memory partitioning for throughput optimization [Cong, et al., ICCAD'09]
- Support of efficient memory hierarchies including automatic caching and prefetching [Putnam, et al. ISCA'09]
- Communication overlapping with computation
- Efficient access to external memories shared by the host processor and accelerator

## A Case Study: Loop Pipelining

- Computation kernels are captured by perfect loop nests
- Loop pipelining allows a new iteration to begin processing before the previous iteration completes
  - Initiation interval (II) : number of time steps before the next iteration begin processing
  - Performance limitation
    - Loop carried dependence
    - Resource constraints

for (i = 2; i < N; i++) sum += A[i] + A[i-1] + A[i-2];



Pipelining with II=1 is infeasible using a dual-port memory

Courtesy: [Cong, et al., ICCAD'09]

### **Motivation Example**



Scheduling can affect memory partitioning



Generates optimal memory partitioning solutions integrated with scheduling problem

Courtesy: [Cong, et al., ICCAD'09]

### Experimental Results (Throughput)

### Platform: xilinx Virtex-4 FPGA

|            | Original<br>II | AMP<br>II | Original<br>Slices | AMP<br>Slices | СОМР |
|------------|----------------|-----------|--------------------|---------------|------|
| fir        | 3              | 1         | 241                | 510           | 2.12 |
| idct       | 4              | 1         | 354                | 359           | 1.01 |
| litho      | 16             | 1         | 1220               | 2066          | 1.69 |
| matmul     | 4              | 1         | 211                | 406           | 1.92 |
| motionEst  | 5              | 1         | 832                | 961           | 1.16 |
| palindrome | 2              | 1         | 84                 | 65            | 0.77 |
| avg        |                | 5.67x     |                    |               | 1.45 |

#### Average 6x performance improvement with 45% area overhead

Courtesy: [Cong, et al., ICCAD'09]

### **Effective Power Analysis and Optimization**

- Three case studies
  - FPGA power estimation and optimization
     [Chen, et al., ASPDAC'07]
  - Scheduling with Soft Constraints, [Cong, et al., ICCAD'09]
  - Variation-Aware, Layout Driven HLS for Performance Yield Optimization [Lucas, et al., ASPDAC'09]

### **Case 1: Area Characterization**

#### **FPGA** power estimation relies on area characterization

| Operation                            | Resource | Usage                                                |  |  |
|--------------------------------------|----------|------------------------------------------------------|--|--|
| Add/Subtract                         | LE       | N                                                    |  |  |
| Bitwise and/or/xor                   | LE       | N                                                    |  |  |
| Compare (=, >, ≥)                    | LE       | <i>round</i> (0.67* <i>N</i> +0.62)                  |  |  |
| Shift (with variable shift distance) | LE       | <i>round</i> (0.045 <i>*№</i> +3.76* <i>№</i> -8.22) |  |  |
| Multiply                             | DSP9x9   | N ≤ 18: 「N/9]<br>N ≤ 36: 「N/18]                      |  |  |
| Multiplexer                          | LE       | <i>N* round</i> (0.67* <i>K</i> )                    |  |  |

N and K represent the bitwidth and the number of input operands, respectively.

**Target Altera Stratix FPGAs in this work** 

## **Delay Characterization**

#### Delay characterization to study power/delay tradeoff

| Operation                            | Delay ( <i>ns</i> )                                                                                                     |  |  |  |
|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Add/Subtract                         | 0.024* <i>N</i> +1.83                                                                                                   |  |  |  |
| Bitwise and/or/xor                   | < 2                                                                                                                     |  |  |  |
| Compare (=, >, ≥)                    | 0.014* <i>N</i> +2.14                                                                                                   |  |  |  |
| Shift (with variable shift distance) | 4.3 <i>*</i> 10 <sup>-5</sup> * <i>N</i> <sup>3</sup> –5*10 <sup>-3</sup> * <i>N</i> <sup>2</sup> +0.24* <i>N</i> +0.93 |  |  |  |
| Multiply                             | N ≤ 9: 3.05<br>N ≤ 18: 3.83<br>N ≤ 36: 7.69                                                                             |  |  |  |
| Multiplexer (8-to-1)                 | 9.8*10 <sup>-5</sup> * <i>N</i> <sup>8</sup> -7.4*10 <sup>-3</sup> * <i>N</i> <sup>2</sup> +0.2* <i>N</i> +1.07         |  |  |  |

# **Design Space Exploration**



....

### **Power and Performance Comparison**





- Schedule to maximize the gating/shutdown opportunities.
- Use constraints to enforce node orders?

### **Slack Optimization**



- Slack within a clock cycle is desirable.
  - Add a constraint to separate nodes when slack is too small?
    - What if latency constraint is very tight?

7



Our approach provides

- **33.9% power reduction** compared to baseline on average
- 17.1% power reduction compared to Chen's method on average

27

Close result to the ILP method

### Case 3: Process Variation and Its Effect

- Process variation increases as device and interconnect feature sizes are scaled down
  - 30% performance variation and 5X leakage variation
- Traditional guard-banding uses pessimistic worst-case process corners
  - Inefficient as the variability increases with scaling





# **Timing Driven Floorplanner**

- Modified version of the simulated annealing based Parquet floorplanner
- A statistical timing analysis is performed after 5 SA moves
   Minimize the sum of the mean and standard deviation
- Cost function:

 $Z \sim N(\mu_z, \sigma_z) = \max(reg_1(\mu_1, \sigma_1), reg_2(\mu_2, \sigma_2), ..., reg_n(\mu_n, \sigma_n))$  $T_R = \frac{\mu_z + \sigma_z}{\mu_{best} + \sigma_{best}}$  $Cost = \alpha * area + \beta * T_R$ 

- PCA based SSTA
  - Interconnects modeled as 2 pin nets with Manhattan distance length.
  - Unit correlation model

## **Unit Correlation Model**

- Correlation is based on the distance between the unit centerpoints
- Matches high level unit characterization
- Correlation matrix used in PCA SSTA with  $\sigma_{\text{inter}}$



**One benchmark -** *chem* 



- Improvement of FastYield comes from two factors:
  - the mean of the pdf has been shifted to a lower clock value.
  - the variance has been reduced.
- A significant PY jump for a relatively minor change in the mean clock period

## **FastYield Results**

|               | BindBWM                     |                                      | FastYield<br>Initial        |                                      | FastYield<br>Rebind         |                                  | Comparison                                              |                                                    |                                                            |                                                       |
|---------------|-----------------------------|--------------------------------------|-----------------------------|--------------------------------------|-----------------------------|----------------------------------|---------------------------------------------------------|----------------------------------------------------|------------------------------------------------------------|-------------------------------------------------------|
| Bench<br>mark | 85%<br>Yield<br>Clk<br>(ns) | PY at FY<br>Rebind<br>85%<br>Clk (%) | 85%<br>Yield<br>Clk<br>(ns) | PY at FY<br>Rebind<br>85%<br>Clk (%) | 85%<br>Yield<br>Clk<br>(ns) | Total<br>FY Run<br>Time<br>(min) | FY Rebind<br>reduction in<br>Clk over<br>BindBWM<br>(%) | FY Rebind<br>85% PY<br>Gain over<br>BindBWM<br>(%) | FY Rebind<br>reduction<br>in Clk over<br>FY Initial<br>(%) | FY Rebind<br>85% PY<br>Gain over<br>FY Initial<br>(%) |
| chem          | 6.9                         | 12.5                                 | 6.1                         | 67.7                                 | 6.0                         | 75                               | 14.17                                                   | 72.5                                               | 2.35                                                       | 17.3                                                  |
| dir           | 5.8                         | 1.5                                  | 4.9                         | 70.9                                 | 4.8                         | 43                               | 16.71                                                   | 83.5                                               | 1.76                                                       | 14.1                                                  |
| honda         | 5.7                         | 8.1                                  | 4.9                         | 82.6                                 | 4.9                         | 28                               | 14.39                                                   | 76.9                                               | 0.32                                                       | 2.4                                                   |
| mcm           | 4.9                         | 11.4                                 | 4.3                         | 78.0                                 | 4.2                         | 40                               | 14.57                                                   | 73.6                                               | 3.34                                                       | 7.0                                                   |
| pr            | 5.2                         | 0.1                                  | 4.5                         | 70.1                                 | 4.3                         | 24                               | 16.47                                                   | 84.9                                               | 3.04                                                       | 14.9                                                  |
| steam         | 6.2                         | 7.6                                  | 5.5                         | 76.3                                 | 5.5                         | 64                               | 11.88                                                   | 77.4                                               | 1.14                                                       | 8.7                                                   |
| wang          | 5.3                         | 1.6                                  | 4.7                         | 80.8                                 | 4.6                         | 16                               | 13.29                                                   | 83.4                                               | 0.95                                                       | 4.2                                                   |
| Avg.          |                             |                                      |                             |                                      |                             |                                  | 14.50                                                   | 78.9                                               | 1.84                                                       | 9.8                                                   |

## Conclusions

- This paper identified a set of critical needs and key challenges in ESL design automation with special focus on HLS
  - software-centric ESL modeling
  - optimizations of memory hierarchy and access
  - power and performance analysis and optimization
  - process variation-aware HLS
- These needs and challenges have created many new and important research directions as well as business opportunities in the EDA community

## Acknowledgement

- Students at UIUC and UCLA
- Researchers at AutoESL

Various funding agencies
 – NSF, SRC, GSRC, Altera, Intel, Magma, Xilinx

