

MOFFETT AI

# Challenge and Opportunities to Accelerate ML Inference with Sparsity

Dr. Zhibin Xiao Co-founder and Chief-Architect Moffett Al

EDPS 2023, Oct 5<sup>th</sup>, 2023



### Outline

- Introduction to ML Inference
- Sparsity in ML Inference
- Hardware Software Co-design for Sparsity
- Case Studies: Sparsity Support in CPU, GPU and AI Chips
- Summary



## **A Brief History of AI Models**





## **Characteristics of Vision and Language Models**



#### **Vision Models**

- + Small models (millions of parameters)
- + Large Input Size (4k/8k images)
- + Throughput-sensitive within latency constraint
- + From convolution to transformer
- + Non-Al functions (image/video/pre-post processing)
- + Higher Parallelism and Computation-bounded

#### Language Models

- + Large Models (billions of parameters)
- + Small input size (context window 128 32K)
- + Latency and Throughput Sensitive
- + Data-dependent computation (token by token)
- + Pre-post processing: Tokenizer, Beam Search (etc.)
- + Single-card to multiple-card inference
- + Memory or I/O bounded

**Sparsity benefits both Vision and Language Models:** Reduce memory capacity and bandwidth requirement Faster Computation



### Outline

- Introduction to ML Inference
- Sparsity in ML Inference
- Hardware Software Co-design for Sparsity
- Case Studies: Sparsity Support in CPU, GPU and AI Chips
- Summary



## **Introduction to ML Inference**



- + ML Model Operations Converges to a small subset of operators
  - ONNX v1.15.0 (192 Operators)
  - Key operators:

- >90% of Number of Parameters and Computation FLOPS
- Convolution, Matrix Multiplication, Inner Product, Element-wise Addition, Mean, Reshape, etc.



ResNet50: Conv, Matrix Multiplication, Pooling, ReLU



Figure 1: The Transformer - model architecture.

**Transformer:** Matrix Multiplication, Elementwise Operations, GELU, Softmax, Embedding Lookup, etc.

## **Sparsity in ML Inference**

- + The core of ML inference is **Tensor Algebra**
- + Zeros naturally exist or can be induced in Tensors
- + No need to store zero or compute zero in a tensor
  - Huge benefits: less storage, computation time, memory bandwidth, reduce power
  - "Sparsity Tax": Extra HW cost for compression, decompression, schedule (limit the throughput and extra power/area overhead)





Weight/Activation





**Sparse Matrix Multiplication** 

### **Sparsity is an Active Algorithm Research Area**





Google & Deepmind paper, "Fast Sparse ConvNets"

• The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (MIT) – ICLR 2019 Best Paper

 $\rightarrow$  Any dense neural network contains one sparse neural network

## **Type of Sparsity in ML Inference**

- + Static and Dynamic Sparsity
  - + Static Sparsity
    - + Static Weight Sparsity (Pruning)
  - + Dynamic Sparsity
    - + Activation Sparsity
    - + Conditional Sparsity
      - + Contextual/Attention Sparsity
      - + Mixture of Experts (MoE)
- + Sparsity Granularity
  - + Coarse-granularity Sparsity
  - + Fine-grained Sparsity
- + Sparsity Patten
  - + Structured sparsity
  - + Unstructured sparsity







**Conditional Sparsity** 



<u>0</u>

6

<u>0</u>

## **Sparse Matrix Storage Format**



| dense<br>[0,2,0,0,3,4,0,0,0,0,0,5] |     | bitmap<br>[010011000001 2345] | runlength / delta<br>[1 2,2 3,0 4,5 5] | compressed sparse row<br>[1] [1 2,2 3,0 4,5 5 |               | umn coordinate offset<br>[1 2, 5 3, 6 4, 12 5] |  |  |
|------------------------------------|-----|-------------------------------|----------------------------------------|-----------------------------------------------|---------------|------------------------------------------------|--|--|
| 1                                  |     |                               |                                        |                                               |               |                                                |  |  |
| 0%                                 | 10% |                               | 70%                                    | 90%                                           | 99.9%         | 99.99999%                                      |  |  |
| dense                              |     | low sparsity                  | medium sparsity                        | moderate sparsity                             | high sparsity | extreme                                        |  |  |

- + Bitmap
- + Run-length /delta
- + Compressed Sparse Row / Column (CSR/CSC)
- + Coordinate Offset (index, value)
- + Hierarchical Hybrid Sparse Format

## Sparse Matrix Format: CSR and CSC Format





#### **CSR Format**

• Data: an array for all non-zero values

- Column\_offsets[i]: records the actual column index of the data[i]
- Row\_pointers[i]: records the number of non-zero of of all (i-1) rows

#### **CSC Format**

- Data: an array for all non-zero values
- Row\_offsets[i]: records the actual row index of the data[i]
- Column\_pointers[i]: records the number of nonzero of of all (i-1) columns







**Coordinate Index** Structured and Unstructured Sparsity

#### **Hierarchical Hybrid Format**

**Top-level:** bit-vector format: (0, 1, 1, 0) **Block-level:** CSR/CSC/Coordinate Offset



### Outline

- Introduction to ML Inference
- Sparsity in ML Inference
- Hardware Software Co-design for Sparsity
- Case Studies: Sparsity Support in CPU, GPU and AI Chips
- Summary



### **Sparsity Support on Hardware Devices**





Highly-sparse Matrix/Vector HPC field





Coarse-grained sparsity Fine-grained 2:4 Structure Sparsity

•

All sparsity type (Dynamic, Static, Structured, non-structured, finegrained, coarse-grained, conditional execution)

## **Challenges in Designing Sparse Accelerators**



- TCO Saving
- Wall-clock speedup
- Power saving
- Area Saving (memory, die size)



**General AI Accelerator Architecture** 

Sparse AI Accelerator Design Trade-off

### **Example of End-to-End Compile-Aware Architecture Simulator**





- + Rapid architectural exploration
  - Hardware Architecture Models
  - Workload Models
- + Adaptive compiler and simulator
- Sparsity ratio and overhead in the loop
  - Results in seconds

| Arch Simulator                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | ;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| File Actions Edit Utility View Settings                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| New Open Save Save As Run Script                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Script X Config X System X Workloads X                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| system * 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Wordwrap Wordwrap                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Entry 899<br>Node 957 (copy)<br>Node 958 (copy)<br>Node 958 (copy)<br>resnetv17 conv0 fwd 900 (nn.conv2<br>resnetv17 stagel conv1 fwd 902 (n<br>resnetv17 stagel conv2 fwd 902 (n<br>resnetv17 stagel conv2 fwd 902 (n<br>resnetv17 stagel conv3 fwd 903 (n<br>resnetv17 stagel conv3 fwd 904 (n<br>resnetv17 stagel conv5 fwd 911 (n<br>resnetv17 stagel conv5 fwd 913 (n<br>resnetv17 stage2 conv1 fwd 913 (n<br>resnetv17 stage2 conv5 fwd 915 (n<br>resnetv17 stage2 conv5 fwd 915 (n<br>resnetv17 stage2 conv5 fwd 915 (n<br>resnetv17 stage2 conv5 fwd 917 (n<br>resnetv17 stage2 conv5 fwd 917 (n<br>resnetv17 stage2 conv5 fwd 917 (n | <pre>onv2d) vrite_vars: {     result:resnetvl7_stagel_relul_fwd     result:resnetvl7_stagel_relul_fwd     result:resnetvl7_stagel_relul_fwd     result:resnetvl7_stagel_relul_fwd     result:resnetvl7_stagel_relul_fwd     result:resnetvl7_stagel_relul_fwd     ronv2d) dilations: [1, 1],     onv2d  paras: {     relu: True,     ronv2d  strides: [1, 1],     ronv2d  prior_nodes: (895, 901, 902),     nov2d  prior_nodes: (904, 905, 906, 911, 915, 918, 921,     onv2d  924, 929, 930, 932, 933, 935, 936, 938, 939, 941,     onv2d  side_nodes: (), </pre> |



### Outline

- Introduction to ML Inference
- Sparsity in ML Inference
- Hardware Software Co-design for Sparsity
- Case Studies: Sparsity Support in CPU, GPU and AI Chips
- Summary



## An Overview of Mainstream Al Accelerator Architecture



| Popular AI<br>Accelerators               | <ul> <li>CPU (X86, RISC-V): Vector/Matrix Instruction Extension</li> <li>Nvidia Tensor Core: 4x4 GEMM</li> <li>Huawei Ascend: 16x16 GEMM + VPU</li> <li>Google TPU: Systolic Array + VPU</li> <li>Graphcore: Massively Parallel BSP Cores</li> <li>SambaNova: Dataflow RDU</li> <li>Cerebras: Wafer-scale many-core architecture</li> <li>Habala Labs/Intel Spring Hill: DSP Array + GEMM</li> <li>Cambricon/Hanguang 800/NVDLA/Tesla FSD: DSA Accelerators</li> </ul> |
|------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Special<br>Technology<br>Al Accelerators | <ul> <li>Spiking Neural Nets and Neuromorphic Architectures</li> <li>Resistor/Memristor matrices and Analog Computing</li> <li>Optical and Spintronics Implementations</li> </ul>                                                                                                                                                                                                                                                                                      |

**Key buzz words:** Systolic Array, Tensor Core, Vector Core, Many-core, DSA, Dataflow

> Special Technology Al Accelerators: Very efficient for specific applications, limited operator support

## **Sparsity Support on CPUs**



- + CPU offers thread-level parallelism and dense vector/matrix extension
  - Limited by low peak MAC performance of CPUs
  - Limited by SW for sparse matrix compress and decompress
    - Limit speedup for sparse matrix
    - Available Intel Sparse BLAS support



Neuralmagic's DeepSparse Inference Runtime on CPU

## **Nvidia Ampere/Hopper Sparse Tensor Core**



#### + Fine-grained Structured 2:1 Sparsity

- Minimum change on Tensor-core Design
- Strong constraint on the sparsity distribution
- 1.44x 1.85x speedup on Matmul/Conv
   Kernels
- 1.3x 1.5x speedup for end-to-end applications (BERT/ResNetXt)



## MIT Eyeriss Project – Eyeriss v1 (2016 ISSCC)



#### + One of the earliest AI Accelerator chip

- A Spatial Multi-PE architecture
- Support Weight Sparsity by reducing memory footprint and bandwidth
- Saving power by clock gating PE for zero operands
- No wall-clock speedup





Fig. 12. PE architecture. The datapaths in red show the data gating logic to skip the processing of zero ifmap data.

## MIT Eyeriss Project – Eyeriss v2 (2018)

### + Compared to Eyeriss v1

- A Scalable Architecture
- Change of matrix compressed format
- Dual-sparsity Support
- Wall-clock speedup





| data vector:    | { <b>a</b> , b, <b>c</b> , d, e, <b>f</b> , <b>g</b> , <b>h</b> , i, <b>j</b> , k, l} |
|-----------------|---------------------------------------------------------------------------------------|
| count vector:   | $\{1, 0, 0, 0, 1, 2, 3, 1, 1, 0, 0, 0\}$                                              |
| address vector: | {0, 2, 5, 6, 6, 7, 9, 9, 12}                                                          |



MOFFETT AI

### EIE: Efficient Inference Engine on Compressed Deep Neural Network (2016)



#### + One of the earliest AI Accelerator research

- A Spatial Multi-PE architecture
- Support dual sparsity by reducing memory footprint and bandwidth and save wall-clock speedup
- Weight matrices: CSC format
- Proposed an activation buffer before different
   PEs for workload balance
- Use activation to lookup compressed weight

| $\vec{a}$ | ( 0         | 0          | $a_2$      | 0          | $a_4$      | $a_5$      | 0          | $a_7$ )    |   |                                     |                                |                                                 |
|-----------|-------------|------------|------------|------------|------------|------------|------------|------------|---|-------------------------------------|--------------------------------|-------------------------------------------------|
|           |             |            |            | ×          | <          |            |            |            |   | $ec{b}$                             |                                |                                                 |
| PE0       | $(w_{0,0})$ | 0          | $w_{0,2}$  | 0          | $w_{0,4}$  | $w_{0,5}$  | $w_{0,6}$  | 0          |   | $\begin{pmatrix} b_0 \end{pmatrix}$ |                                | $\begin{pmatrix} b_0 \end{pmatrix}$             |
| PE1       | 0           | $w_{1,1}$  | 0          | $w_{1,3}$  | 0          | 0          | $w_{1,6}$  | 0          |   | $b_1$                               |                                | $b_1$                                           |
| PE2       | 0           | 0          | $w_{2,2}$  | 0          | $w_{2,4}$  | 0          | 0          | $w_{2,7}$  |   | $-b_2$                              |                                | 0                                               |
| PE3       | 0           | $w_{3,1}$  | 0          | 0          | 0          | $w_{0,5}$  | 0          | 0          |   | $b_3$                               |                                | $b_3$                                           |
|           | 0           | $w_{4,1}$  | 0          | 0          | $w_{4,4}$  | 0          | 0          | 0          |   | $-b_4$                              |                                | 0                                               |
|           | 0           | 0          | 0          | $w_{5,4}$  | 0          | 0          | 0          | $w_{5,7}$  |   | $b_5$                               |                                | $b_5$                                           |
|           | 0           | 0          | 0          | 0          | $w_{6,4}$  | 0          | $w_{6,6}$  | 0          |   | $b_6$                               |                                | $b_6$                                           |
|           | $w_{7,0}$   | 0          | 0          | $w_{7,4}$  | 0          | 0          | $w_{7,7}$  | 0          | = | $-b_{7}$                            | $\stackrel{ReLU}{\Rightarrow}$ | 0                                               |
|           | $w_{8,0}$   | 0          | 0          | 0          | 0          | 0          | 0          | $w_{8,7}$  | - | $-b_{8}$                            | -                              | 0                                               |
|           | $w_{9,0}$   | 0          | 0          | 0          | 0          | 0          | $w_{9,6}$  | $w_{9,7}$  |   | $-b_9$                              |                                | 0                                               |
|           | 0           | 0          | 0          | 0          | $w_{10,4}$ | 0          | 0          | 0          |   | $b_{10}$                            |                                | $b_{10}$                                        |
|           | 0           | 0          | $w_{11,2}$ | 0          | 0          | 0          | 0          | $w_{11,7}$ |   | $-b_{11}$                           |                                | 0                                               |
|           | $w_{12,0}$  | 0          | $w_{12,2}$ | 0          | 0          | $w_{12,5}$ | 0          | $w_{12,7}$ |   | $-b_{12}$                           |                                | 0                                               |
|           | $w_{13,0}$  | $w_{13,2}$ | 0          | 0          | 0          | 0          | $w_{13,6}$ | 0          |   | $b_{13}$                            |                                | $b_{13}$                                        |
|           | 0           | 0          | $w_{14,2}$ | $w_{14,3}$ | $w_{14,4}$ | $w_{14,5}$ | 0          | 0          |   | $b_{14}$                            |                                | $b_{14}$                                        |
|           | 0           | 0          | $w_{15,2}$ | $w_{15,3}$ | 0          | $w_{15,5}$ | 0          | 0          |   | $\left(-b_{15}\right)$              |                                | $\left( \begin{array}{c} 0 \end{array} \right)$ |

Figure 2. Matrix W and vectors a and b are interleaved over 4 PEs. Elements of the same color are stored in the same PE.

| Virtual<br>Weight     | <b>W</b> <sub>0,0</sub> | W <sub>8,0</sub> | W <sub>12,0</sub> | <b>W</b> <sub>4,1</sub> | W <sub>0,2</sub> | W <sub>12,2</sub> | <b>W</b> <sub>0,4</sub> | W <sub>4,4</sub> | <b>W</b> <sub>0,5</sub> | W <sub>12,5</sub> | <b>W</b> <sub>0,6</sub> | W <sub>8,7</sub> | W <sub>12,7</sub> |
|-----------------------|-------------------------|------------------|-------------------|-------------------------|------------------|-------------------|-------------------------|------------------|-------------------------|-------------------|-------------------------|------------------|-------------------|
| Relative<br>Row Index | 0                       | 1                | 0                 | 1                       | 0                | 2                 | 0                       | 0                | 0                       | 2                 | 0                       | 2                | 0                 |
| Column<br>Pointer     | 0                       | 3                | 4                 | 6                       | 6                | 8                 | 10                      | 11               | 13                      |                   |                         | , , ,            |                   |

Figure 3. Memory layout for the relative indexed, indirect weighted and interleaved CSC format, corresponding to  $PE_0$  in Figure 2.

## Alibaba Hanguang-800 Sparsity Engine (2020)



### + A High-performance Commercial Data-center Inference Chip

- DSA architecture
- Support weight compression in memory to reduce memory footprint
- No external DDR and all-onchip Memory
- Weight matrices: bit-vector representation for low to medium sparsity
- No wall-clock speedup



## **Compressed and Quantized Storage/Processing**

Source: Hanguang 800 NPU – The Ultimate AI Inference Solution for Data Centers, Hotchips 2020

## SambaNova RDU Sparsity Support



### + A Reconfigurable Dataflow tiled **Architecture (RDU series)**

- Scalable design with on-chip switch connect array of RDUs and memory units
- Scale-out support
- Support CSR-like matrix compression
- Wall clock-time speedup

### Sparse Matrix Multiply on RDU

Logical



## **Cerebras Sparsity Support**



- + A commercial data-flow wafer-scale spatial architecture
- Fine-granularity fully unstructured sparse MatMul
- + 10x sparse utilization vs. GPU
- Not clear on the weight
   sparse storage format

## GEMM with Sparse Input

### Dataflow scheduling enables fully unstructured sparse MatMul with low overhead

- · Executed as a series of AXPY operations per row
- · Row of non-zero weights broadcast over columns of cores
- Each individual weight triggers FMACs

H₁

- No compute for zero weights, not streamed in at all
- · No memory used for weights, not even stored temporarily

Weights

H<sub>0</sub>



### **Moffett Deep-Sparse AI Inference Cards**



- Complete system-on-chip with deep sparse processing units supporting up to **32x sparsity**
- One chip multiple PCIe products
- Complete end-to-end software toolchain (SparseOPT, SparseRT, SOLA runtime), please check http://docs.moffettai.com
- AI benchmark MLCommon validated performance results, please check http://mlcommons.org



## **Summary**



### + Sparsity is an active research area

- Promising direction for both Vision and LLM
- Save computation, memory bandwidth/capacity and power
- Reduce TCO

### + The memory storage format is the key

- Affected by algorithm (sparsity ratio, accuracy)
- Impact on Memory/Datapath/Scheduler Design
- + Sparse AI Accelerator needs trade off on more dimension
  - Model Accuracy, Sparsity Overhead & Sparsity Benefits
- + Research and Commercial AI accelerators are embracing sparsity



MOFFETT AI

## Thank you and Questions?