# **Myths of Multicore**

lan Rickards CPU Product Manager ARM Inc, San Diego

EDP Seminar Monterrey April 16th 2008

THE ARCHITECTURE FOR THE DIGITAL WORLD®

# Myth #1: Multicores are Tomorrow's Technology





## 2<sup>nd</sup> Generation: Cortex-A9 MPCore



## Myth #2: Aren't two cores twice the size?

THE ARCHITECTURE FOR THE DIGITAL WORLD  $^{\otimes}$ 

ARM

# Multicore for Implementation Efficiencies



### Increasing MHz increases power and area exponentially

- Synthesis area increases significantly for the last few MHz
- High MHz requires high-speed libraries/process and require more dynamic/leakage power
- Higher voltages for higher MHz
- Complicates SoC design and extends time to market

THE ARCHITECTURE FOR THE DIGITAL WORLD®

## **Multicore: Physical Implementation**



Multiprocessing and library choice together provide huge implementation flexibility

|                     | Area opt | Speed opt |
|---------------------|----------|-----------|
| CPUs                | 1        | 1         |
| Std cells           | Metro    | Adv-HS    |
| Freq/MHz            | 320      | 620       |
| Area with cache/mm2 | 1.46     | 2.54      |

| ARM11 MPCore |       |        |       |                      |
|--------------|-------|--------|-------|----------------------|
| CPU #        | 1     | 1      | 2     | 2                    |
| Priority     | Power | Speed  | Power | Speed                |
| Library      | Metro | Adv-HS | Metro | Adv-HS               |
| DMIPS        | 380   | 760    | 760   | 1520                 |
| MHz          | 304   | 608    | 304   | 608                  |
| mW/MHz       | 0.23  | 0.32   | 0.46  | 0.64                 |
| Static mW    | 17    | 27     | 34    | 55                   |
| Total mW     | 87    | 222    | 174   | 444<br>eakage at 85° |

ARM 🕑 Graphics

HE ARCHITECTURE FOR THE DIGITAL WORLD®

# Myth #3: Multiple cores use more power

### **Controlling Power Consumption**

| K FFplay | M Monitor                                                       | Single CPU                                                                          |
|----------|-----------------------------------------------------------------|-------------------------------------------------------------------------------------|
|          | 6 Processor rev 0 (v61)<br>inux 2 6 7 anul constant<br>CPU Info | For a given workload requirement                                                    |
|          | 903                                                             | Unused processors are 'turned off'                                                  |
| CP       | PU J Manual                                                     | Single CPU @ 260MHz,<br><u>Testchip</u> consuming ~160mW                            |
|          | RM Monitor                                                      | Dual CPU (same MHz, same voltage)                                                   |
|          | Linux 2,6,7-arm1-smp<br>CPU Info                                | <ul> <li>Same workload level leaves<br/>headroom on CPUs</li> </ul>                 |
|          | 2922                                                            | Alternatively, up to 50% energy saving potential with voltage and frequency scaling |
|          |                                                                 |                                                                                     |

Multiprocessing offers more performance at lower MHz

HE ARCHITECTURE FOR THE DIGITAL WORLD®

## **Performance and Power Scalability**





# Myth #4: Multicores have More Overhead

#### THE ARCHITECTURE FOR THE DIGITAL WORLD®

### **MPCore: Reduction in bus contention**

Traditional bus based coherence scheme CPU-0 (1) CPU-n (1) CPU-n (2) SoC Interconnect (3) Primary CPU snoop into other CPUs (2) Existence of data causes write-back (3) Primary CPU may retrieve data by 'watching' the writeback, or from main memory

 Sharing of data causes increased loading on system bus due to 'unnecessary' cache write-backs



bandwidth within processor core to shield system bus from any additional loads due to sharing of data

THE ARCHITECTURE FOR THE DIGITAL WORLD



ARM

### **Enhanced Accelerator SoC Integration**

### ARM MPCore: Accelerator Coherence Port (ACP)

- Sharing benefits of the ARM MPCore optimized coherency design
- Accelerators gain access to CPU cache hierarchy, increasing system performance and reducing overall power
- Uses AMBA<sup>®</sup> 3 AXI<sup>™</sup> technology for compatibility with standard un-cached peripherals and accelerators



## **ACP - Access to Shared Caches**

- Example: CRC engine for TCP packet forwarding on 64 byte packet
  - Using typical system latency, ignoring common processing overhead
  - Assumed writes are fully buffered

| Algorithm Stage                                 | Approximate Cycle Counts                             |                                                    |  |
|-------------------------------------------------|------------------------------------------------------|----------------------------------------------------|--|
| Design style                                    | Traditional shared memory with mailbox communication | ACP attached accelerator<br>with synchronous event |  |
| Packet received and processed by CPU            | 0                                                    | 0                                                  |  |
| Flush cache to make data visible to accelerator | 20                                                   | 0                                                  |  |
| Accelerator notified of data availability       | > 4<br>[write to mailbox GPIO]                       | 1<br>[Send Event]                                  |  |
| Accelerator Reads data                          | 120<br>[read data from off-chip]                     | 10<br>[read from L1/L2]                            |  |
| Accelerator Write data (assuming buffered)      | 8                                                    | 8                                                  |  |
| Processor reads data                            | 120                                                  | 12 [from L2]                                       |  |
| Total latency overhead                          | ~272 cycles                                          | ~31 cycles                                         |  |

ACP solution is appropriate for cycle-offload accelerators executing in 100's of cycles with cache resident workloads. For example in low latency situations required by audio echo cancelation

THE ARCHITECTURE FOR THE DIGITAL WORLD®



# Myth #5: Multicores are More Work

(for the software engineers)

THE ARCHITECTURE FOR THE DIGITAL WORLD®

## **ARM Cortex Family of Processors**

Bringing the benefits of architectural innovation across the spectrum

- ARM Cortex-A Series:
  - Applications processors for complex OS and user applications
- ARM Cortex-R Series:
  - Embedded processors for real-time systems
- ARM Cortex-M Series:
  - Deeply embedded processors optimized for microcontroller and low-power applications



THE ARCHITECTURE FOR THE DIGITAL WORLD®



ARM

## **Cortex: ARMv7 Architecture (A profile)**

#### Thumb-2: Power Efficient Integer Execution

- 30% smaller when starting from ARM code
- 30% faster when starting from Thumb code

### TrustZone: Trusted Secure Environment

- Device integrity, Digital Rights Management, Electronic payment, etc
- Wide industry support

### Jazelle-RCT: Run Time Compilation Target

- Efficient target for Java, Microsoft .NET MSIL, Perl, Python etc
- Optional DBX Java byte code accelleration
- Early Adopters include Sun Microsystems, Aplix and Esmertec

### NEON: Multimedia and Signal Processing Architectory

- Significant performance uplift from ARMv6 SIMD
- Supports both Integer and Floating Point SIND penMAX.
- Accelerated software development with



compiler,

ARM

THE ARCHITECTURE FOR THE DIGITAL WORLD®

### What is a 'thread"

- Term "thread" applies to both MP and MT systems means 'thread of execution'
- Key to parallelism is finding independent operations
  - SMP OS will naturally run separate processes on different cores
  - SMP OS processes and device drivers runs on any core
  - Application can be "threaded" if required



Both examples above make full use of dual core system



### **Threading 1: Task decomposition**

- Application is a pipeline, with each stage in a separate thread
- E.g. video recompress for PMP, or laser printer



Shared data and semaphores used to pass data around

| THE ARCHITECTURE FOR THE DIGITAL WORLD® 19 |  | AF |  |  |  |
|--------------------------------------------|--|----|--|--|--|
|--------------------------------------------|--|----|--|--|--|

## **Threading 2: Data decomposition**

- Subdividing a data processing operation into several threads executing in parallel on smaller chunks of data
  - Block by block 1 block per thread
  - Line by line 1 line per thread
  - Section by section 1/4 of picture per thread





### **ARM MPCore technology dispels the Myths**

- #1: Multicore is here now for embedded
- #2: Two cores are not necessarily twice as big as one
- #3: Multiple cores can use less power
- #4: Multicores can have low overheads
- #5: Multicore software is not difficult



THE ARCHITECTURE FOR THE DIGITAL WORLD®