



"Threads are dead! Long Live threads"

Many Core, Massively Threaded Processing:

A Revolution in High Performance Computing

### **EDA Requirements Exceed Moore's Law**





2

## **EDA: Poor Scaling with CPU Cores**



Synopsys: HSPICE Scaling Relative Performance 4 3 2.1x for 4 cores 1.4x for 2 2 cores 1 0 2 4

# EDA applications are not scaling with multi-core CPU

Memory bandwidth limited on CPUs

Takes hours to days to work on large designs

### What's a Programmer to do ?





## The "New" Reality



Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Power sets bounds on what's possible

Performance = parallelism

Efficiency = locality

#### **Moore's Law**



#### Moore, Electronics 38(8) April 19, 1965



- In 1965 Gordon Moore predicted the *number of transistors* on an integrated circuit would double every year.
  - Later revised to 18 months
- Also predicted L<sup>3</sup> energy scaling
- No prediction of processor performance
- Advances in architecture turn device scaling into *performance* scaling
- Applications turn performance into value

#### Discontinuity 1 The End of ILP Scaling





Dally et al. The Last Classical Computer, ISAT Study, 2001

## **Single-Thread Processor Performance**





Source: Hennessy & Patterson, CAAQA, 4th Edition



### Discontinuity 2 The End of L<sup>3</sup> Power Scaling



#### Gordon Moore, ISSCC 2003



## Performance = Parallelism Efficiency = Locality



## An optimist's view – finding opportunity in adversity



Dally et al. The Last Classical Computer, ISAT Study, 2001

VIDIA

#### **Stream Processing –**

#### **Efficient Throughput Computing**

- Technology, applications, and discontinuities show
  - Performance = parallelism
  - Efficiency = locality
  - Latency-optimized processors won't improve much anymore
- Stream processing
  - Many ALUs exploit parallelism
  - Rich, exposed storage hierarchy enables locality
  - Simple control and synchronization reduces overhead
- Stream programming explicit movement, bulk ops
  - Exposes parallelism (bulk operations) and locality
  - Enables strategic optimization
  - Predictability enables static optimization
  - **Result: performance and efficiency** 
    - TFLOPs on a chip
    - 20-30x efficiency of conventional processors.
    - Performance portability







## **Generic Stream Processing Architecture**





**Global Memory** 

- Many processors each supporting many hardware threads
- On-chip memory near processors (cache, RAM, or both)
- Shared global memory space (external DRAM)

© NVIDIA Corporation 2009

### **Modern GPUs: Unified Design**





Vertex shaders, pixel shaders, etc. become *threads* running different programs on a flexible core

#### **Modern GPU Architecture**





#### Processors tion Accelera Memory L1 L1 L1 L1 L1I L1 L1 Processor Function **Communication Fabric** æ L1 L1 L1 L1 L1 L1 5 Fixed Processors

#### NVIDIA Tesla 10-Series GPU

Massively parallel, many core architecture

240 Processor Cores

1 Teraflops - 1,000 times Cray X-MP

**IEEE Compliant Double Precision Floating Point** 

### Why is this different from a CPU?



- Different goals produce different designs
  - GPU assumes work load is highly parallel
  - CPU must be good at everything, parallel or not
- CPU: minimize latency experienced by 1 thread
  - Iots of big on-chip caches
  - extremely sophisticated control

#### GPU: maximize throughput of all threads

- Iots of big ALUs
- multithreading can hide latency ... so skip the big caches
- simpler control, cost amortized over ALUs via SIMD

© NVIDIA Corporation 2009

## **Heterogeneous Computing**





#### **Computing with CPU + GPU**

## Not 2x or 3x : Speedups are 20x to 150x





#### **HPC: Accelerating Time to Insight**





## FFT Performance: CPU vs GPU (8-Series)





Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm EDP 2009

## **BLAS: CPU vs GPU (10-series)**





#### **Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA**





CPU Results from "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007 23

**EDP 2009** 

#### **Different Programming Styles**



#### Two Programming models available today for CUDA

- High level programming model
  - Native support for C, C++, Fortran, Java, Python, Perl, OpenCL, DlrectX Compute
- Device level (low level) programming model
  - Use CUDA Driver API directly

#### OpenCL Programming model (to be available soon)

- A new compute API for parallel programming of heterogeneous systems
- Allows developers to harness the compute power of BOTH the GPU and the CPU
- A multi-vendor standards effort managed through the Khronos Group

#### DirectX Compute

- New GPU computing model introduced by Microsoft
- Integrated with Direct3D under Win7
- Enables more general constructs, more general data structures and more general algorithms





#### A Revolutionary Parallel Computing Architecture for NVIDIA GPUs

## Supports standard languages and APIs

- C
- C++
- Fortran
- OpenCL
- DX Compute

# Supported by standard operating systems

- Windows
- Mac OS
- Linux

| Application       |     |         |                      |  |  |  |  |  |  |  |
|-------------------|-----|---------|----------------------|--|--|--|--|--|--|--|
| С                 | C++ | Fortran | rtran OpenCL Compute |  |  |  |  |  |  |  |
| CUDA Architecture |     |         |                      |  |  |  |  |  |  |  |



## **NVIDIA: Leadership in GPU computing**



#### 200+ Apps on CUDA Zone



#### 30+ CUDA GPU clusters



#### 110 M+ CUDA enabled GPUs 60,000+ active developers



#### 150K CUDA compiler downloads



#### 115+ Universities Teaching CUDA 900+ research papers

Duke Erlangen ETH Zurich Georgia Tech Grove City College Harvard **IISc Bangalore IIIT Hyderabad** Illinois INRIA lowa ITESM Johns Hopkins Kent State Kyoto Lund Maryland McGill MIT North Carolina

Northeastern **Oregon State** Pennsylvania Polimi Purdue Santa Clara Stanford Stuttgart Sunv Tokvo **TU-Vienna** USC Utah Virginia Washington Waterloo Western Australia Williams College Wisconsin Yonsei

## 5000+ Customers / ISVs



| Life Sciences &<br>Medical Equipment                                                                                                                                                                    |                                                                                                                                                                                                           | Productivity<br>/ Misc                                                                                                                                                                                                      | Oil and<br>Gas                                                                                                                                  | EDA                                               | Finance                                                                                  | CAE /<br>Mathematical                                                                                                                       | Communi<br>cation                                                                                                                                      |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
| Max Planck<br>FDA<br>Robarts Research<br>Medtronic<br>AGC<br>Evolved machines<br>Smith-Waterman<br>DNA sequencing<br>AutoDock<br>NAMD/VMD<br>Folding@Home<br>Howard Hughes<br>Medical<br>CRIBI Genomics | GE Healthcare<br>Siemens<br>Techniscan<br>Boston Scientific<br>Eli Lilly<br>Silicon Informatics<br>Stockholm<br>Research<br>Harvard<br>Delaware<br>Pittsburg<br>ETH Zurich<br>Institute Atomic<br>Physics | CEA<br>NCSA<br>WRF Weather<br>Modeling<br>OptiTex<br>Tech-X<br>Elemental<br>Technologies<br>Dimensional<br>Imaging<br>Manifold<br>Digisens<br>General Mills<br>Rapidmind<br>Rhythm & Hues<br>xNormal<br>Elcomsoft<br>LINZIK | Hess<br>TOTAL<br>CGG/Veritas<br>Chevron<br>Headwave<br>Acceleware<br>Seismic City<br>P-Wave<br>Seismic<br>Imaging<br>Mercury<br>Computer<br>ffA | Synopsys<br>Nascentric<br>Gauda<br>CST<br>Agilent | Symcor<br>Level 3<br>SciComp<br>Hanweck<br>Quant<br>Catalyst<br>RogueWave<br>BNP Paribas | AccelerEyes<br>MathWorks<br>Wolfram<br>National<br>Instruments<br>Ansys<br>Access Analytics<br>Tech-x<br>RIKEN<br>SOFA<br>Renault<br>Boeing | Nokia<br>RIM<br>Philips<br>Samsung<br>LG<br>Sony<br>Ericsson<br>NTT DoCoMo<br>Mitsubishi<br>Hitachi<br>Radio<br>Research<br>Laboratory<br>US Air Force |

## **Current Status at EDA Companies**



Numerous examples of shipping products based on CUDA / GPU computing

All major EDA companies are exploring CUDA for accelerating tools

Several EDA startups forming around CUDA / GPU computing

More tool announcements coming soon

## **The Opportunity**



- We are at a historic inflection point
- Heterogenous computing
  - Works
  - Saves money and energy
  - Is available broadly
  - Will be pervasive
  - Solves a real problem
- For EDA, represents trully disruptive technology

#### Customers with **\$\$** want it !!





## **More Information**



#### Tesla main page

- http://www.nvidia.com/tesla
- Product Information
- Industry Solutions

#### CUDA Zone

- http://www.nvidia.com/cuda
- Applications, Papers, Videos

#### Hear from Developers

http://www.youtube.com/nvidiatesla