![](_page_0_Picture_0.jpeg)

### DARKSIDE: 2.6GFLOPS, 8.7mW Heterogeneous RISC-V Cluster for Extreme-Edge On-Chip DNN Inference and Training

<u>A. Garofalo</u>, M. Perotti, L. Valente, Y. Tortorella, A. Nadalini, L. Benini, D. Rossi and F. Conti

DEI, University of Bologna, Italy & IIS, ETH Zurich, Switzerland

angelo.garofalo@unibo.it

![](_page_1_Picture_1.jpeg)

### Introduction and Motivation

- Darkside: Heterogeneous SoC Architecture
  - RVNN Cores
  - Depth-wise Engine
  - DataMover
  - Tensor Product Engine
- □ Chip Results Summary
  - Implementation Results
  - Benchmarking
  - Comparison with State-of-the-Art
- Conclusion

# Introduction and Motivation

### □ TinyML: Deploy DL and ML at the Extreme-Edge of IoT

- AI-enhanced IoT Applications;
- Reduced privacy issues, lower transmission power ...

### **Challenges**

- High computational and memory requirements (ML + DL);
- On-Chip Inference and Training within a power budget typical of MCU-class of devices (few hundreds mW).

### Opportunities

- Reduced precision ML & DL models (both integer and floating-point);
- Low-Bitwidth Mixed-Precision integer computation;
- Specialized acceleration solutions (MAC, SIMD, Vectors, systolic arrays..).

![](_page_2_Picture_13.jpeg)

![](_page_2_Picture_14.jpeg)

![](_page_2_Picture_15.jpeg)

3 of 20

![](_page_2_Picture_16.jpeg)

# **Reduced Precision DL Models**

![](_page_3_Picture_1.jpeg)

![](_page_3_Figure_2.jpeg)

### **DNN Inference**: <u>Mixed-precision integer arithmetic</u>

| Quantization<br>Method | Top1 Accuracy |   |  | Weight Memory<br>Footprint |           |    |            |
|------------------------|---------------|---|--|----------------------------|-----------|----|------------|
| Full-Precision         | 70.9%         | T |  |                            | 16.27 MB  | I. |            |
| INT-8                  | 70.1%         |   |  | 0.8%                       | 4.06 MB 🔶 | Γ  | <b>4</b> x |
| INT-4                  | 66.46%        |   |  | 4.4%                       | 2.35 MB   | ¥  | <b>7</b> x |
| Mixed-Precision        | 68%           |   |  | 2.9%                       | 2.09 MB   |    | <b>8</b> x |

Courtesy of Rusci M. «Example on MobilenetV1\_224\_1.0.»C

#### □ On-Chip Training (Continual, Federated Learning..):

![](_page_3_Figure_7.jpeg)

□ <u>Floating-Point Arithmetic (16-bits) required</u>

(\*) Bianco, Simone, Remi Cadene, Luigi Celona, and Paolo Napoletano. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277.

# Extreme-Edge AI Computing Platforms

|                             | ASICs        | FPGAs    | MCUs    |
|-----------------------------|--------------|----------|---------|
| Throughput [Gop/s]          | 1 K – 50 K   | 10 - 200 | 0.1 – 2 |
| Energy Efficiency [Gop/s/W] | 10 K – 100 K | 1 - 10   | 1 – 50  |
| Flexibility/Programmability | Low          | Medium   | High    |

#### IoT End-Nodes scenario (MCUs):

- Lack of support for mixed-precision integer arithmetic at ISA level (RISCV, ARM); → Huge Overhead!
- Missing-low power specialized solutions to speed-up low-reuse kernels, compute-intensive floating-point workloads.

#### This Work:

- Heterogeneous Compute Cluster:
  - □ Enhanced RISC-V cores with advanced integer mixed-precision capabilites;
  - □ Tightly-Coupled Specialized accelerators to boost heavy kernels dominating the workload;

#### Mixed-Precision kernel: RISC-V Assembly

#### p.lw x10,4(x4!)

- p.lw x11,4(x5!)
- $\rightarrow$  p.extract x5, x11, 4, 0
- $\rightarrow$  p.extract x6, x11, 4, 4
- $\rightarrow$  p.extract x7, x11, 4, 8
  - → p.extract x8, x11, 4, 12
- pv.packlo.b x15, x5, x6
  - → pv.packhi.b x15, x7, x8

**pv.sdotsp.b** x20, x15, x10

![](_page_5_Picture_1.jpeg)

Introduction and Motivation

### **Darkside: Heterogeneous SoC Architecture**

- RVNN Cores
- Depth-wise Engine
- DataMover
- Tensor Product Engine
- □ Chip Results Summary
  - Implementation Results
  - Benchmarking
  - Comparison with State-of-the-Art

## Darkside Architecture

![](_page_6_Picture_1.jpeg)

![](_page_6_Figure_2.jpeg)

- □ 8 **RVNN** cores (32-b custom RISC-V ISA);
- Depth-wise Engine (**DWE**) to boost low-reuse depthwise;
  - Tensor Product Engine (**TPE**) to boost IEEE FP16 MatMuls;
- DataMover for efficient data marshalling;
- Accelerators encapsulated within standardized Hardware Processing Engine (HWPE) interface;
- Heterogeneous Cluster
  Interconnect (HCI) for
  tightly-coupled integration;
- DMA Controller for Double Buffering/DNN model tiling;
- HW synchronization unit for efficient parallelization and event-based execution.

## Darkside Execution Model

![](_page_7_Picture_1.jpeg)

![](_page_7_Figure_2.jpeg)

- TCDM as tightly-coupled memory buffer for all the compute units of the cluster:
- Efficient cooperation among hardware compute units;
- Support complex ML and DL execution models (e.g. full MobileNetV2, FC Autoencoder);

![](_page_7_Figure_6.jpeg)

## **RVNN Cores µArchitecture**

![](_page_8_Picture_1.jpeg)

![](_page_8_Figure_2.jpeg)

#### Mac-Load: MatMul Inner Kernel ESSDER

![](_page_9_Picture_1.jpeg)

□ Up to 94% of SIMD Dotp Unit Utilization on MatMul kernels

Up to **12.7x** performance improvements over RI5CY on mixed-precision MatMul kernels

MILAN 2022

### MILAN 2022 Depth-Wise Engine & DataMover

![](_page_10_Figure_1.jpeg)

#### **DataMover:**

- □ 1b-32b configurable precision on-the-fly efficient data transposition;
- Up to **100x** less transposition time than SW (scales with precision of data to transpose);

#### 54 kGE.

#### **Depth-Wise Engine (DWE):**

- Boost low-reuse 8-bit (integer) 3x3 Depth-wise convolutions;
- Weight-Stationary Data Flow to maximize data reuse;
- Fully exploit memory bandwidth of 36B/cyc through shallow HCI branch;
- Peak performance of **30 MAC/cycle**;
- 131 kGE.

![](_page_10_Figure_12.jpeg)

Angelo Garofalo

# **Tensor Product Engine**

![](_page_11_Picture_1.jpeg)

![](_page_11_Figure_2.jpeg)

Boost IEEE FP16 Matrix Multiplications;

- array of 32 FMA Units, organized over 8 rows and 4 columns;
- □ FMAs cascaded along the rows;
- □ Trade-off between performance and area;

#### Execution scheduling optimized:

- □ Streaming always overlaps computation;
- □ Data reuse is maximized;
- **98%** of FMAs Utilization;
- Near-to-ideal performance: **31.6 GMAC/cycle** (ideal is 32 GMAC/cycle);
- Up to 22x Speed-Up over SW MatMuls execution;

## Bottleneck Layer

![](_page_12_Picture_1.jpeg)

![](_page_12_Figure_2.jpeg)

![](_page_13_Picture_1.jpeg)

- Introduction and Motivation
- Darkside: Heterogeneous SoC Architecture
  - RVNN Cores
  - Depth-wise Engine
  - DataMover
  - Tensor Product Engine

### Chip Results Summary

- Implementation Results
- Benchmarking
- Comparison with State-of-the-Art

# Chip Results Summary

![](_page_14_Picture_1.jpeg)

![](_page_14_Figure_2.jpeg)

## Performance vs. Energy Efficiency

![](_page_15_Picture_1.jpeg)

![](_page_15_Figure_2.jpeg)

- ASIC-like efficiency on low-bitwidth integer workloads;
- IEEE FP16 MatMuls on TPE delivers **17.7x** better performance, **21.8x** better energy efficiency than SW execution.

![](_page_16_Picture_0.jpeg)

## TinyML Benchmarks

#### **End-to-End Inference**

- □ Mixed-Precision MobileNetV2:
  - $\Box$  ~1MB footprint;
  - □ 69.4% Top-1 Accuracy;
- Optimized execution flow for efficient L2-L1 data movements [Burr21];
- Performance: 20 frame/s (@290 MHz);
- □ Energy per Inference: **9.1 mJ** (65nm).

### **On-Chip Training**

- □ FC TinyML AutoEncoder
- One training epoch benchmarked
  - Full forward, backward steps and weight updates;
- □ Latency: **1.8 ms** (@290 MHz);
- **Ε**nergy: **345 μJ**;

[Burr21]: Burrello, A., et al.. Dory: Automatic end-to-end deployment of real-world dnns on low-cost iot mcus. *IEEE Transactions on Computers* 

![](_page_16_Figure_16.jpeg)

### Comparison with SoA

![](_page_17_Picture_1.jpeg)

|                                                                       | SleepRunner                       | SamurAI                              | VEGA                                        | Dustin                               | This work                                   |
|-----------------------------------------------------------------------|-----------------------------------|--------------------------------------|---------------------------------------------|--------------------------------------|---------------------------------------------|
| Technology                                                            | <u>28nm</u>                       | <u>28nm</u>                          | <u>22nm</u>                                 | <u>65nm</u>                          | 65nm                                        |
| CPU                                                                   | 1xCM0DS                           | 1xRI5CY                              | <b>10x</b> RI5CY                            | <b>16x</b> MPIC (RV)                 | 8xRV-NN (RV)                                |
| INT Precision                                                         | 32b                               | 8b-32b                               | 8b-32b                                      | 2b-32b<br>mixed-precision            | 2b-32b<br>mixed-precision                   |
| FP Precision                                                          |                                   |                                      | FP32, FP16,<br>bfloat                       |                                      | FP32, FP16 🗡                                |
| Best Int Perf.<br>Best.Int Eff.<br>@ Perf. ( <b>8-bit</b> )           | 31 MOPS,<br>97MOPS/mW<br>@18 MOPS | 1.5 GOPS,<br>230 GOPS/W<br>@110 MOPS | 15.6 GOPS,<br>614 GOPS/W<br>@7.6 GOPS       | 15 GOPS,<br>303 GOPS/W<br>@ 4.4 GOPS | 17 GOPS,<br>191 GOPS/W<br>@2.4 GOPS         |
| Best FP32 Perf.<br>Best. FP32 Eff.<br>@ Perf.                         |                                   |                                      | 2 GFLOPS,<br>79 GFLOPS/W<br>@ 1 GFLOPS      |                                      | 1.03 GFLOPS,<br>12 GFLOPS/W<br>@ 0.4 GFLOPS |
| Best IEEE <b>FP16</b> Perf.<br>Best. IEEE <b>FP16</b> Eff.<br>@ Perf. |                                   |                                      | 3.3 GFLOPS,<br>129 GFLOPS/W<br>@1.27 GFLOPS |                                      | 18.2 GFLOPS,<br>300 GFLOPS/W<br>@2.6 GFLOPS |

![](_page_18_Picture_1.jpeg)

- Introduction and Motivation
- Darkside: Heterogeneous SoC Architecture
  - RVNN Cores
  - Depth-wise Engine
  - DataMover
  - Tensor Product Engine
- □ Chip Results Summary
  - Implementation Results
  - Benchmarking
  - Comparison with State-of-the-Art

![](_page_19_Picture_1.jpeg)

- Darkside: Low-Power Heterogeneous Compute Cluster for in **65nm**;
- RVNN cores with 2b-to-32b mixed-precision integer computing capabilities and Mac-Load instructions (2x to 12.7x speed-up over RI5CY on linear kernels);
- □ Tightly-Coupled Accelerators to boost depth-wise kernels (up to **10x** speed-up over SW) and data marshalling operations (up to **100x** speed-up over SW);
- Tensor Product Engine (TPE) to boost FP16 MatMuls, achieving 300 GFLOPS/W within only 8.7 mW;
- End-to-end inference and training workloads (full MobileNetV2, FC AutoEncoder) with better or comparable metrics than SoA solutions;
- Darkside is competitive with IoT end-nodes using much more scaled technology nodes (Peak Integer Perf. 65 GOPS, En. Eff. 835 GOPS/W; Peak FP Perf. 18.2 GFLOPS, En. Eff. 300 GFLOPS/W);