

## A 1.15 TOPS/W, 16-Cores Parallel Ultra-Low Power Cluster with 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

<u>A. Garofalo</u>, G. Ottavi, A. Di Mauro, F. Conti, L. Benini, D. Rossi DEI University of Bologna, Italy angelo.garofalo@unibo.it

September 6-9, 2021







# Introduction & Motivation

- Dustin Architecture Overview
  - Tunable Mixed-Precision Computation
  - Vector Lockstep Execution Mode
- □ Chip Results Summary
- Comparison with the State-of-the-art
  Conclusion

# Introduction & Motivation

## **Extreme Edge AI and TinyML**

- Low latency and network load compared to cloudML;
- Eases privacy concerns

## Challenges

- High computational and memory requirements (ML + DL);
- Limited resources on IoT End-Nodes (microcontrollers).

## Opportunities

- Reduce DL/ML model size;
- Low-Bitwidth Mixed-Precision computation.
- Reduce instruction fetch and decode overhead exploiting data-parallel execution









# Quantized Neural Networks (QNNs)





| Quantization<br>Method | Top1 Accura | асу   | Weight Memory<br>Footprint |             |  |
|------------------------|-------------|-------|----------------------------|-------------|--|
| Full-Precision         | 70.9%       |       | 16.27 MB                   | 11          |  |
| INT-8                  | 70.1%       | 0.8%  | 4.06 MB                    | <b>4</b> x  |  |
| INT-4                  | 66.46%      | 4.4%  | 2.35 MB                    | • 7x        |  |
| Mixed-Precision        | 68%         | *2.9% | 2.09 MB                    | <b>♦ 8x</b> |  |

Courtesy of Rusci M. «Example on MobilenetV1\_224\_1.0.»C

Mixed-precision Quantized Neural Networks (QNNs) are the natural target for execution on constrained edge platforms.

(\*) Bianco, Simone, Remi Cadene, Luigi Celona, and Paolo Napoletano. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277.

# Edge AI Computing Platforms



|                             | ASICs        | FPGAs    | MCUs    |
|-----------------------------|--------------|----------|---------|
| Throughput [Gop/s]          | 1 K – 50 K   | 10 - 200 | 0.1 – 2 |
| Energy Efficiency [Gop/s/W] | 10 K – 100 K | 1 - 10   | 1 – 50  |
| Flexibility                 | Low          | Medium   | High    |
| Cost                        | High         | Medium   | Low     |

#### □ IoT End-Nodes scenario:



- Must be inexpensive and software programmable (MCUs);
- SoA ISAs (RISCV (\*), ARM (\*\*)) support only integer uniform arithmetic (with SIMD);
- Huge overhead to perform mixed-precision computation for data casting and packing;

### This Work:

- Low-Power IoT End-Node with a fully programmable RISCV accelerator cluster;
- Mixed precision 2b-to-32b SIMD instructions in the ISA of the cores;
- Vector Lockstep Exec. Mode to boost the Efficiency on data-parallel DL/ML algorithms.

(\*) Garofalo et al. "XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V based IoT End Nodes." IEEE Transactions on Emerging Topics in Computing (2021). 5 of 23

(\*\*) D. E. Joseph Yiu, "Introduction to the arm cortex-m55 processor. : <u>https://pages.arm.com/cortex-m55-introduction.html</u>," Feb. 2020.





# Introduction & Motivation Dustin Architecture Overview Tunable Mixed-Precision Computation

- Vector Lockstep Execution Mode
- □ Chip Results Summary
- Comparison with the State-of-the-art
  Conclusion

## Dustin: Architecture Overview





#### The SoC is a Low-Power IoT End-Node with AI edge computing capabilities

- □ Microcontroller
  - □ 1 RISCV core
  - □ 112 kB of L2 memory
  - Rich sets of peripherals (UART, I2C, CAM itf..)
  - □ JTAG (Debug), GPIOs, ROM
  - □ Interrupt Controller
- Parallel cluster accelerator of fully programmable RISC-V cores

## Dustin: Cluster





#### □ Accelerator Cluster

- 16 RI5CY (\*) cores augmented with 2b-to-32b SIMD instructions;
- Software Configurable Vector Lockstep Execution Mode (VLEM);
- Single-cycle latency TCDM interco.
  leveraging a req/gnt protocol,
  word-level interleaved scheme.
- 128 kB of Shared Tightly-Coupled L1 Data Memory;
- □ Hierarchical Instruction Cache;
- $\Box$  High performance DMA (L2 <-> L1);
- Event Unit supporting efficient synchronization among the cores;

# **Core Enhancements**





- □ **RI5CY**: 4-stage in order single-issue pipeline
  - **ISA**: RV32IMCXpulpV2
  - XpulpV2 extensions:
    - HW Loops;
    - Post-Increment LD/ST;
    - 16-/8-bit SIMD insns;
    - Bit Manip. insns.
  - Goal
    - HW support for mixedprecision SIMD instructions;
- □ Challenge
  - Enormous number of instructions to be encoded in the ISA;
- Solution
  - Dynamic Bit-Scalable Precision

## Extended Dot-Product Unit





## Mixed-Precision Controller





Mixed-Precision operations require a controller: selection of the correct subword of the lowest precision operand (Vector B) to be used in current SIMD op.

I The controller is programmed by control status registers (CSRs).

# **Dynamic Bit-Scalable Execution**



| Standard Instructions                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                      |                                                                                                                                                                                                                                                                                                                                           | Virtua               | /irtual Mistuationstructions                                    |                                                                                                                                     |                                                                                                                                                           |                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-----------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
| pv:dotsp.h<br>pv:dotsp.b<br>pv:dotsp.n<br>pv:dotsp.c<br>pv:dotsp.m4x2<br>pv:dotsp.m8x2<br>pv:dotsp.m8x4<br>pv:dotsp.m16x8<br>pv:dotsp.m16x8<br>pv:dotsp.m16x2<br>pv:dotsp.sc.h<br>pv:dotsp.sc.b<br>pv:dotsp.sc.c<br>pv:dotsp.sc.n                                                                                                                    | pv.dotup.h<br>pv.dotup.b<br>pv.dotup.n<br>pv.dotup.c<br>pv.dotup.m8x2<br>pv.dotup.m8x2<br>pv.dotup.m8x4<br>pv.dotup.m16x8<br>pv.dotup.m16x4<br>pv.dotup.m16x2<br>pv.dotup.sc.h<br>pv.dotup.sc.c<br>pv.dotup.sc.n                                                                                               | pv.dotusp.h<br>pv.dotusp.b<br>pv.dotusp.n<br>pv.dotusp.c<br>pv.dotusp.m4x2<br>pv.dotusp.m8x2<br>pv.dotusp.m8x4<br>pv.dotusp.m16x8<br>pv.dotusp.m16x4<br>pv.dotusp.m16x2<br>pv.dotusp.sc.h<br>pv.dotusp.sc.b<br>pv.dotusp.sc.c<br>pv.dotusp.sc.n                                                                               | pv.sdotsp.h<br>pv.sdotsp.b<br>pv.sdotsp.n<br>pv.sdotsp.c<br>pv.sdotsp.m4x2<br>pv.sdotsp.m8x2<br>pv.sdotsp.m8x4<br>pv.sdotsp.m16x8<br>pv.sdotsp.m16x4<br>pv.sdotsp.m16x2<br>pv.sdotsp.sc.h<br>pv.sdotsp.sc.b<br>pv.sdotsp.sc.c<br>pv.sdotsp.sc.c                                                                                                  | pv.sdotup.h<br>pv.sdotup.b<br>pv.sdotup.n<br>pv.sdotup.c<br>pv.sdotup.m4x2<br>pv.sdotup.m8x2<br>pv.sdotup.m8x4<br>pv.sdotup.m16x8<br>pv.sdotup.m16x4<br>pv.sdotup.m16x2<br>pv.sdotup.sc.h<br>pv.sdotup.sc.b<br>pv.sdotup.sc.c<br>pv.sdotup.sc.n                                                                                                      | pv.sdotusp.h<br>pv.sdotusp.b<br>pv.sdotusp.n<br>pv.sdotusp.c<br>pv.sdotusp.m4x2<br>pv.sdotusp.m8x2<br>pv.sdotusp.m8x4<br>pv.sdotusp.m16x8<br>pv.sdotusp.m16x4<br>pv.sdotusp.m16x2<br>pv.sdotusp.sc.h<br>pv.sdotusp.sc.b<br>pv.sdotusp.sc.c<br>pv.sdotusp.sc.n                                                                             | ]                    | No e<br>Reus                                                    | pv.dotsp.v<br>pv.dotsp.sc .v<br>pv.dotsp.sci .v<br>pv.dotup .v<br>pv.dotup.sc .v<br>pv.dotup.sci .<br>pv.dotusp .v<br>pv.dotusp.sci | pv.sdotsp .vpv.sdotsp.sc .vvpv.sdotsp.sci .vvpv.sdotup .vvpv.sdotup.sc .vvpv.sdotup.sci .vpv.sdotusp .vvpv.sdotusp .v.vpv.sdotusp.sc .v.vpv.sdotusp.sc .v | ed at<br>ID insn<br>ats<br>Dde   |
| pv.dotsp.sc.m4x2<br>pv.dotsp.sc.m4x2<br>pv.dotsp.sc.m8x2<br>pv.dotsp.sc.m8x4<br>pv.dotsp.sc.m16x8<br>pv.dotsp.sc.m16x4<br>pv.dotsp.sc.m16x2<br>pv.dotsp.sci.h<br>pv.dotsp.sci.c<br>pv.dotsp.sci.c<br>pv.dotsp.sci.n<br>pv.dotsp.sci.m4x2<br>pv.dotsp.sci.m8x2<br>pv.dotsp.sci.m8x4<br>pv.dotsp.sci.m16x8<br>pv.dotsp.sci.m16x4<br>pv.dotsp.sci.m16x2 | pv.dotup.sc.m4x2<br>pv.dotup.sc.m8x2<br>pv.dotup.sc.m8x4<br>pv.dotup.sc.m16x8<br>pv.dotup.sc.m16x4<br>pv.dotup.sc.m16x2<br>pv.dotup.sci.h<br>pv.dotup.sci.c<br>pv.dotup.sci.n<br>pv.dotup.sci.m8x2<br>pv.dotup.sci.m8x2<br>pv.dotup.sci.m8x4<br>pv.dotup.sci.m16x8<br>pv.dotup.sci.m16x4<br>pv.dotup.sci.m16x2 | pv.dotusp.sc.m4x2<br>pv.dotusp.sc.m8x2<br>pv.dotusp.sc.m8x4<br>pv.dotusp.sc.m16x8<br>pv.dotusp.sc.m16x4<br>pv.dotusp.sc.m16x2<br>pv.dotusp.sci.h<br>pv.dotusp.sci.c<br>pv.dotusp.sci.n<br>pv.dotusp.sci.m4x2<br>pv.dotusp.sci.m8x2<br>pv.dotusp.sci.m8x4<br>pv.dotusp.sci.m16x8<br>pv.dotusp.sci.m16x4<br>pv.dotusp.sci.m16x2 | pv.sdotsp.sc.m4x2<br>pv.sdotsp.sc.m8x2<br>pv.sdotsp.sc.m8x4<br>pv.sdotsp.sc.m16x8<br>pv.sdotsp.sc.m16x4<br>pv.sdotsp.sc.m16x2<br>pv.sdotsp.sci.h<br>pv.sdotsp.sci.c<br>pv.sdotsp.sci.c<br>pv.sdotsp.sci.n<br>pv.sdotsp.sci.m4x2<br>pv.sdotsp.sci.m8x2<br>pv.sdotsp.sci.m8x4<br>pv.sdotsp.sci.m16x8<br>pv.sdotsp.sci.m16x4<br>pv.sdotsp.sci.m16x2 | pv.sdotup.sc.m4x2<br>pv.sdotup.sc.m8x2<br>pv.sdotup.sc.m8x4<br>pv.sdotup.sc.m16x8<br>pv.sdotup.sc.m16x4<br>pv.sdotup.sc.i.6x2<br>pv.sdotup.sci.b<br>pv.sdotup.sci.c<br>pv.sdotup.sci.n<br>pv.sdotup.sci.m4x2<br>pv.sdotup.sci.m8x2<br>pv.sdotup.sci.m8x4<br>pv.sdotup.sci.m16x8<br>pv.sdotup.sci.m16x8<br>pv.sdotup.sci.m16x4<br>pv.sdotup.sci.m16x2 | pv.sdotusp.sc.m4x2<br>pv.sdotusp.sc.m8x2<br>pv.sdotusp.sc.m8x4<br>pv.sdotusp.sc.m16x8<br>pv.sdotusp.sc.m16x4<br>pv.sdotusp.sc.m16x2<br>pv.sdotusp.sci.h<br>pv.sdotusp.sci.h<br>pv.sdotusp.sci.c<br>pv.sdotusp.sci.n<br>pv.sdotusp.sci.m4x2<br>pv.sdotusp.sci.m8x4<br>pv.sdotusp.sci.m16x8<br>pv.sdotusp.sci.m16x4<br>pv.sdotusp.sci.m16x2 | 10<br>10<br>11<br>p. | <b>c Bit-</b><br><b>Exec</b><br>0,4(x4!)<br>1,4(x5!)<br>v x20,x | ution<br>11,x10                                                                                                                     | int main()<br>{<br><br>SIMD_FMT(I<br>convolution(<br><br>SIMD_FMT(I                                                                                       | M8x4);<br>(A, W, Res);<br>M8x2); |
| pv.packl<br>pv.packl<br>pv.sdots                                                                                                                                                                                                                                                                                                                     | lo.b x15, x<br>hi.b x15, x<br>sp.b x20, x                                                                                                                                                                                                                                                                      | 5, x6<br>7, x8<br>15, x10                                                                                                                                                                                                                                                                                                     | instru<br>per fo<br>and t                                                                                                                                                                                                                                                                                                                        | iction<br>ormat<br>ype                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                           | inst<br>per          | ruction<br>type                                                 |                                                                                                                                     | convolution(<br>}                                                                                                                                         | (A, W, Res);                     |

## Vector Lockstep Exec. Mode (VLEM)





# VLEM: Broadcast Unit









Introduction & Motivation
 Dustin Architecture Overview
 Tunable Mixed-Precision Computation
 Vector Lockstep Execution Mode
 Chip Results Summary
 Comparison with the State-of-the-art
 Conclusion

## Chip Results Summary





## Voltage vs. Frequency





- □ Measurements of the Cluster;
- □ Maximum frequency 205 MHz @ 1.2 V;
- □ ~45% energy saving with the Cluster in VLEM wrt MIMD mode.

## Performance on MatMul kernels





## Energy Efficiency on MatMul kernels









Introduction & Motivation
 Dustin Architecture Overview

 Tunable Mixed-Precision Computation
 Vector Lockstep Execution Mode

 Chip Results Summary
 Comparison with the State-of-the-art
 Conclusion

# Comparison with the SoA



|                                          | SleepRunner [6]                 | SamurAI [7]                               | Mr.Wolf [8]                   | Vega [9]                       | Dustin<br>(this work)                                                               |
|------------------------------------------|---------------------------------|-------------------------------------------|-------------------------------|--------------------------------|-------------------------------------------------------------------------------------|
| Technology                               | CMOS 28nm<br>FDSOI              | CMOS 28nm<br>FDSOI                        | CMOS 40nm<br>LP               | CMOS 22nm<br>FDSOI             | CMOS<br>65nm                                                                        |
| Die Area                                 | 0.68 mm <sup>2</sup>            | 4.5 mm <sup>2</sup>                       | 10 mm <sup>2</sup>            | 12 mm <sup>2</sup>             | 10 mm <sup>2</sup>                                                                  |
| Applications                             | IoT GP                          | IoT GP + DNN                              | IoT GP + DNN                  | IoT GP + NSA+DNN               | IoT GP + DNN + QNNs                                                                 |
| CPU/ISA                                  | CM0DS<br>Thumb-2 subset         | 1x RI5CY<br>RVC32IMFXpulp                 | 9 x RI5CY<br>RVC32IMFXpulp    | 10 x RI5CY<br>RVC32IMFXpulp+SF | 16 x MPIC CORES (RISC-V)                                                            |
| Int Precision<br>(bits)                  | 32                              | 8, 16, 32                                 | 8, 16, 32                     | 8, 16, 32                      | 2, 4, 8, 16, 32<br>(plus Mixed-Precision)                                           |
| Supply Voltage                           | 0.4 - 0.8 V                     | 0.45 - 0.9 V                              | 0.8 - 1.1 V                   | $0.5-0.8~\mathrm{V}$           | 0.8 - 1.2 V                                                                         |
| Max Frequency                            | 80 MHz                          | 350 MHz                                   | 450 MHz                       | 450 MHz                        | 205 MHz                                                                             |
| Power<br>Envelope                        | 320 μW                          | 96 mW                                     | 153 mW                        | 49.4 mW                        | 156 mW                                                                              |
| <sup>1</sup> Best Integer<br>Performance | 31 MOPS (32b)                   | 1.5 GOPS (8b) <sup>2</sup>                | 12.1 GOPS (8b)                | 15.6 GOPS (8b)                 | 15 GOPS (8b)<br>30 GOPS (4b)<br>58 GOPS (2b)                                        |
| <sup>1</sup> Best Integer<br>Efficiency  | 97 MOPS/mW @<br>18.6 MOPS (32b) | 230 GOPS/W<br>@110 MOPS (8b) <sup>2</sup> | 190 GOPS/W<br>@ 3.8 GOPS (8b) | 614 GOPS/W<br>@ 7.6 GOPS       | 303 GOPS/W @4.4 GOPS (8b)<br>570 GOPS/W @8.8 GOPS (4b)<br>1152 GOPS/W@17.3 GOPS(2b) |

Dustin supports Mixed-Precision computation in HW

Better efficiency wrt solutions in 28nm and 40nm tech node

Comparable efficiency wrt Vega (22 nm)

 $^{1}$  2 OPs = 1 8-bit (or 4-bit or 2-bit) MAC on MatMul benchmark unless differently specified.

<sup>2</sup> For fair comparison we consider the execution on software programmable cores.





- Introduction & Motivation
- Dustin Architecture Overview
  - Tunable Mixed-Precision Computation
  - Vector Lockstep Execution Mode
- Chip Results Summary
- Comparison with the State-of-the-art
  Conclusion

## Conclusion



- Dustin SoC: IoT end-node with AI computing capabilities in tsmc
  65 nm tech node;
- RISC-V cores featuring 2b-to-32b bit-precision instruction set architecture (ISA) extensions enabling fine-grain tunable mixedprecision computation (**2x** to **7x** speed-up w.r.t. RI5CY);
- Software reconfigurable cluster in Vector Lockstep Execution Mode (~45% energy saving w.r.t. MIMD mode);
- Dustin is competitive with IoT end-nodes using much more scaled technology nodes (Peak Perf. 58 GOPS, Peak Eff. 1.15 TOPS/W).
- □ Despite a less scaled tech node, we reach energy efficiency in the order of **TOPS/W**  $\rightarrow$  Comparable with ASIC solutions.