

**PULP PLATFORM** Open Source Hardware, the way it should be!

# **In-Sensor Machine Learning**

Heterogeneous computing in a mW

### Luca Benini <lbenini@iis.ee.ethz.ch,luca.Benini@unibo.it>



Horizon 2020 European Union funding for Research & Innovation



Fonds national suisse Schweizerischer Nationalfonds Fondo nazionale svizzero Swiss National Science Foundation





Prof. of Digital Circuit and Systems @ ETHZ and UNIBO. h-index=109, 53'000+ citations, 1'000+ publications, fellow IEEE, ACM, Chief Architect in STMicroelectronics (2009-2012) Group of 80+ people



European Research Council





Cloud  $\rightarrow$  Edge  $\rightarrow$ Near-Sensor Al a.k.a. TinyML

**Cloud Computing** 







E. Gousev, Qcomm research THE SILENT INTELLIGENCE

#### #1 Customer Question on Amazon.com (out of 1.000+):

1. I don't want any of my (private, personal) videos on any servers not in my control. Is this possible?

#2 Customer Question on Amazon.com (out of 1,000+):

ever were amazine com/add/marchines/wile/9001MT046/87/

urrar www.amanon.com/add/auntions/ada/N01M306287

CAGR

27.3%

2. How long does the battery charge last?

**Near-Sensor AI challenge** AI capabilities in the power envelope of an MCU: 100mW peak (10mW avg)

# Al Workloads - DNNs

H Pham 2021(Google) arXiv:2003.10580v3



# Energy efficiency @ GOPS is THE Challenge



ETH Zürich

# **RI5CY – An Open MCU-class RISC-V Core for EE-AI**

3-cycle ALU-OP, 4-cyle MEM-OP→IPC loss: LD-use, Branch





V3

40 kGE

Lightweight fixed point

Hzürich

#### XPULP extensions: 25 kGE $\rightarrow$ 40 kGE (1.6x)

### **PULP-NN: Xpulp ISA exploitation**



P↑ T↓↓↓ so, E=P\*T↓↓ Nice! But what about the GOPS? Faster+Superscalar is not efficient!

M7: 5.01 CoreMark/MHz-58.5 μW/MHz M4: 3.42 CoreMark/MHz-12.26 μW/MHz

ETHZürich

# ML & Parallel, Near-threshold: a Marriage Made in Heaven

- As VDD decreases, operating speed decreases
- However efficiency increases → more work done per Joule
- Until leakage effects start to dominate
- Put more units in parallel to get performance up and keep them busy with a parallel workload

ML is massively parallel and scales well (P/S 个 with NN size)



ETH ZULICH



ETH Zürich

# **Multiple RI5CY Cores (1-16)**

RISC-V core RISC-V core RISC-V core core

**CLUSTER** 

# **Low-Latency Shared TCDM**



EHZürich



### DMA for data transfers from/to L2



ETHZürich

10

## Shared instruction cache with private "loop buffer"





11

# Results: RV32IMCXpulp vs RV32IMC

- 8-bit convolution
  - Open source DNN library
- **10x** through xPULP
  - Extensions bring real speedup
- Near-linear speedup
  - Scales well for regular workloads.
- 75x overall gain
  - Sub-byte: x2-4x better
  - Mixed precision supported



ETHZürich

### An additional I/O controller is used for IO



ETHZürich

13

# Successful product development: GWT's GAP8



#### The evolution of the PULP species

2017: GAP-8 55nm (TSMC): 2018: Wolf(8) 40nm (TSMC): 2019: Vega(8) 22FDX: 2020: Marsellus(16) 22FDX:

50 MOPS/mW (20pJ/OP @32bit 3.5GOPS) 120 MOPS/mW (8pJ/OP @32bit +FP 7GOPS) 500 MOPS/mW (2pJ/OP @32bit, +FP, 10GOPS) 500+ MOPS/mW (pre-tapeout, **30GOPS**)

**2**x **GOPS/W** Y/Y

TECHNOLOGIES

# specification & dataset selection training quantization & pruning graph optimization memory-aware deployment optimized DNN primitives optimized HW & architecture

15

QuantLab Quantization Laboratory Automatic Mixed Prec **NEMO** 

**NE**ural **M**inimization for pyt**O**rch

DORY **D**eployment **O**riented to memo**RY** 

PULP-NN **PULP Neural Network backend** 

https://github.com/pulp-platform/nemo https://github.com/pulp-platform/dory https://github.com/pulp-platform/pulp-nn



Deploying DNNs on PULP ONNX O' PyTorch

# What's next? Architecture: Sub-pJ/OP Accelerators



# Sub-pJ/W Accelerator; Tightly-coupled HW Compute Engine



ETH zürich

THE STORE STORE

17

# Hardware Processing Engines (HWPEs)



#### **HWPE efficiency**

zürich

- 1. Specialized datapath (e.g. systolic MAC) & internal storage (e.g. linebuffer, accum-regs)
- 2. Dedicated control (no I-fetch) with shadow registers (overlapped config-exec)
- 3. Specialized high-BW interco into L1 (on data-plane)

### More HWPE Efficiency: Extreme Quantization

| Model         | Bit-width   | Top-1 error | SOA INO retraining                             |
|---------------|-------------|-------------|------------------------------------------------|
| ResNet-18 ref | 32          | 31.73%      | e e, three reaching                            |
| INQ           | 5           | 31.02%      |                                                |
| INQ           | 4           | 31.11%      |                                                |
| INQ           | 3           | 31.92%      |                                                |
| INQ           | 2 (ternary) | 33.98%      | 2.2% loss $\rightarrow$ 0% with 20% larger net |

#### Low(er) precision: $8 \rightarrow 4 \rightarrow 2$



1 MAC Op = 2 Op (1 Op for the "sign-reverse", 1 Op for the add).

# From +/-1 Binarization to XNORs

$$\begin{split} \mathbf{y}(k_{out}) &= \mathrm{binarize}_{\pm 1} \left( \mathbf{b}_{k_{out}} + \sum_{k_{in}} \left( \mathbf{W}(k_{out}, k_{in}) \otimes \mathbf{x}(k_{in}) \right) \\ & \\ \mathrm{binarize}_{\pm 1}(t) = \mathrm{sign} \left( \gamma \frac{t - \mu}{\sigma} + \beta \right) \end{split}$$

binarize<sub>0,1</sub>(t) = 
$$\begin{cases} 1 \text{ if } t \ge -\kappa/\lambda \doteq \tau, \text{ else } 0 \quad (\text{when } \lambda > 0) \\ 1 \text{ if } t \le -\kappa/\lambda \doteq \tau, \text{ else } 0 \quad (\text{when } \lambda < 0) \end{cases}$$

$$\mathbf{Binary product \rightarrow XOR}$$

$$A \xrightarrow{B} \text{ out}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 + 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{0 - 1 - 1 - 1 + 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1 - 1}$$

$$\frac{A \xrightarrow{B} \text{ out}}{1 - 1 - 1 - 1 - 1 - 1}$$

### **XNE: XNOR Neural Engine**





BINCONV: Binary dot-product and thresholding logic array

### **XNE Energy Efficiency**



L1 SCM, L2 high-density, low leakgage SRAM (activations), MRAM (weights)



n zürich

But... Accuracy Loss is high even with retraining (10%+) Need flexible precision tuning!



- Many  $M \times N$  bits products...
- ... but one  $M \times N$  product is the superposition of  $M \times N$  1-bit products!

$$\mathbf{y}(k_{out}) = quant\left(\sum_{i=0..N}\sum_{k_{in}} 2^{i}2^{j}\left(\mathbf{W}_{bin}(k_{out},k_{in}) \otimes \mathbf{x}_{bin}(k_{in})\right)\right)$$
  
Q-bit output fmaps  
1-bit weights

One quantized NN can be emulated by superposition of power-of-2 weighted  $M \times N$  binary NN

# **Reconfigurable Binary Engine**

ETH Zürich



# Towards In-Sensor: Achieving sub-mW average power?

1mW average power with 10mW active power (10GOPS @ 1pJ/OP) → sub mW sleep



Duty cycling not acceptable when input events are asynchronous → watchful Sleep

Log(P)

Detect&Compress→1-10mW

Stream→ 100mW

Watchful sleep  $\rightarrow$  <1mW

### Need µW-range always-on Intelligence



ETHZürich

26









# Not Only CNNs: Hyper-Dimensional Computing



Highly parallel, fault-tolerand binary operators, assoc-min-distance search

-

Merge storage & computation i.e. **In-memory computing** 

# **In-memory Hyperdimensional Computing**





 $N_{CLASS}$  cycles

05.02.21



| Ξ}                               | Design (post P&R) |                 |          |               |      |                                                       |
|----------------------------------|-------------------|-----------------|----------|---------------|------|-------------------------------------------------------|
| Tech<br>Area                     |                   | nology          | GF22 UHT |               | E In | mplemented with<br>owest leakage cell<br>brary (UHVT) |
|                                  |                   | 670kG           |          | E             |      |                                                       |
|                                  | Max. Frequency    |                 | 3 MHz    |               |      |                                                       |
| f <sub>clk</sub>                 | 32kHz             |                 |          | 200kHz        |      |                                                       |
| max. sampling rate               |                   | 150 SPS/Channel |          | 1kSPS/Channel |      |                                                       |
| P <sub>SWU, dynamic</sub>        |                   | 0.99uW          |          | 6.21uW        |      | To be open<br>sourced in a few<br>days!!              |
| P <sub>SWU, leakage</sub>        |                   | 0.7uW           |          | 0.7uW         |      |                                                       |
| P <sub>SPI, dynamic</sub>        |                   | 1.28uW          |          | 8.00uW        |      |                                                       |
| P <sub>SWU, total</sub> Measured |                   | 2.97uW          |          | 14.9uW        |      |                                                       |

ETH zürich

33

few

# When you count mWatts, everything matters!

What about IO power? (Mem, Sensor)

- SPIs
  - I/O VDD=1.8V
  - fspi-max=50MHz,
  - Assuming duty-cycled operation @ various bandwidths
- ULP serial link (duty-cycled)
  - 10.2x less energy and 15.7x higher maximum BW compared to single SPI
  - 2.56x higher efficiency than the DDR Octal SPI @787Mbps
  - 5 → 3pJ/bit
  - However it's still 2mW@ 500Mbps
- 3D integration: 0.15pJ/bit and below





# Closing thoughts – Open Platform for near-sensor Al

#### **Open Platform**

- For science ... fundamental "research infrastructure" Reduce "getting up to speed" overhead for partners Enables fair and well controlled benchmarking
- For Business ... it is truly disruptive Reduces the NRE, faster innovation path for startups New business models (for profit and non-for profit) Helps exchange of information across NDA walls Great for Marketing & Training More Secure, safe, auditable HW Exemplary collaboration with GF (Quentin, Arnold, Vega...)



#### Posh Open Source Hardware (<u>POSH</u>):

An open source System on Chip (SoC) design and verification eco-system that enables cost effective design of ultra-complex SoCs

#### Heterogeneous & Flexible

1-2 orders of magnitude improvement by acceleration

Various flavors: number-crunching, always-on, reconfigurable

 2 orders of magnitude improvement on IO energy (memory, sensor) needed to achieve pJ/OP @ full platform
 3D-IC technology is a key enabler



# Parallel Ultra Low Power

Luca Benini, Davide Rossi, Andrea Borghesi, Michele Magno, Simone Benatti, Francesco Conti, Francesco Beneventi, Daniele Palossi, Giuseppe Tagliavini, Antonio Pullini, Germain Haugou, Lukas Cavigelli, Manuele Rusci, Florian Glaser, Renzo Andri, Fabio Montagna, Bjoern Forsberg, Pasquale Davide Schiavone, Alfio Di Mauro, Victor Javier Kartsch Morinigo, Tommaso Polonelli, Fabian Schuiki, Stefan Mach, Andreas Kurth, Florian Zaruba, Manuel Eggimann, Philipp Mayer, Marco Guermandi, Xiaying Wang, Michael Hersche, Robert Balas, Antonio Mastrandrea, Matheus Cavalcante, Angelo Garofalo, Alessio Burrello, Gianna Paulin, Georg Rutishauser, Andrea Cossettini, Luca Bertaccini, Maxim Mattheeuws, Samuel Riedel, Sergei Vostrikov, Vlad Niculescu, Frank K. Gurkaynak, and many more that we forgot to mention



http://pulp-platform.org



@pulp\_platform