

**PULP PLATFORM** Open Source Hardware, the way it should be!

#### **RISC-V** for IoT, the **PULP** experience



Frank K. Gürkaynak <kgf@ee.ethz.ch>

Digital Circuits and Systems Group ETH Zürich









### How did we start in 2013?

- We wanted to design energy efficient computing systems
  - Equally efficient for IoT and HPC over a wide range

#### Key points

**TH** zürich

- Parallel processing
- Near threshold computing
- Efficient switching between operating modes
- Making best use of technology
- Heterogeneous acceleration
- Parallel Ultra Low-Power (PULP) platform was born







### IoT design @ Digital Circuits & Systems Group



2015 VivoSoC SMIC 130



### Who is behind PULP?



Keylet

#### Prof. Luca Benini

In total about 60 people work on projects related to PULP in Zurich and Bologna https://pulp-platform.org/team.html

• Criter Architect in STMicroelectronics (2009-2012)



Europe

2016 Patronus UMC65



# Too much to do and not enough resources SIGNAL DATA KNOMLEDGE







Er

ETH ZURICH

#### **Cluster based PULP systems**



2018 Atomario UMC65



ETH Zürich

### ST28 FDSOI and GF22FDX designs with BB

- SoC partitioned in separate clock, power and body bias regions
- Cluster 1 Vdd, 10 BB regions
  - **Boost mode**: active + FBB
  - Normal mode: active + NO BB
  - Idle mode: clock gated + NO BB (in LVT) RBB (in RVT)
- SoC has 3 Vdd regions
  - Cluster, L2, Always on, IOs

D. Rossi et. al., «A 60 GOPS/W, -1.8V to 0.9V body bias ULP cluster in 28nm UTBB FD-SOI technology», in Solid-State Electronics, 2016.



2014 Diana UMC65

#### Scaling proportional to computing demand

1 to 50 µW

Мm

50

5

S

**Duty Cycling** 

**Coarse Grain** Classification



**Full Blown Analysis** 

Hzürich



Ultra fast switching time from one mode to another Ultra fast voltage and frequency change time

#### MCU sleep mode

- Low guiescent LDO ✓
- Real Time Clock 32kHz only ✓
- L2 Memory partially retentive

#### **MCU** active mode

- Embedded DC/DC, high current
- Мш S Voltage can dynamically change 0.5 to
  - One clock gen active, frequency can
  - dynamically change
    - Systematic clock gating

#### MCU + Parallel processor active mode

- Embedded DC/DC, high current
- Voltage can dynamically change
- Two clock gen active, frequencies can
- dynamically change
  - Systematic Clock Gating

Highly optimized system level power consumption

2017 GAP8 TSMC55

ETH zürich

### How PULP and RISC-V come together

- Initially we did not want to design our own processors
  - Wanted to use available processors (ARC, ARM..)
  - It proved difficult to design systems that we could share with our collaborators
- Then we used OpenRISC cores (2013-2015)
  - We had to completely redesign and optimize these cores
- We moved to RISC-V starting in 2015
  - Adapted the decoder of our optimized OpenRISC core
  - Make use of a growing SW development environment
  - ETH is one of the founding members of the RISC-V foundation





2013 Sir10s UMC180

### Our research is not implementing RISC-V cores

- We develop efficient programmable architectures
  - Processor cores of various capabilities are required for that
- We need efficient implementations of cores for our research
  - To produce relevant results, our cores have to perform as good as other solutions
  - We ended up spending quite an effort to make sure they perform really well
  - Processor core alone is not enough
    - You need peripherals, interconnect solutions, programming support...
- PULP Platform provides us a playground for our research
  - And we share it as open source

2019 PLINK UMC6



**ETH** zürich

#### **RISC-V** cores developed by PULP team 32 bit 64 bit Low Cost **Core for Core with** Linux capable Core Core **DSP** support Streaming **RI5CY IBEX Snitch** Ariane Zero-riscy **RV32-ICMXF** RV32-IMAFD RV64-ICMAF SIM Micro Small core Very Mature Frequent 2 o Lation 60 aming point Cortex-M4F novel ortex-M0+ Cortex-A55

2018 Kosmodrom GF22FDX



for (i = 0; i < 100; i++)
 d[i] = a[i] + b[i];</pre>

#### Baseline

#### Auto-incr load/store HW Loop

#### Packed-SIM

| <pre>mv x5, 0 mv x4, 100 Lstart:     lb x2, 0(x10)     lb x3, 0(x11)     addi x10,x10, 1     addi x11,x11, 1</pre> | <pre>mv x5, 0 mv x4, 100 Lstart:     1b x2, 0(x10!)     1b x3, 0(x11!)     addi x4, x4, -1     add x2, x3, x2</pre> | <pre>lp.setupi 100, Lend     lb x2, 0(x10!)     lb x3, 0(x11!)     add x2, x3, x2 Lend: sb x2, 0(x12!)</pre> | <pre>lp.setupi 25, Lend     lw x2, 0(x10!)     lw x3, 0(x11!)     pv.add.b x2, x3, x2 Lend: sw x2, 0(x12!)</pre> |
|--------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| add x2, x3, x2<br>sb x2, 0(x12)                                                                                    | sbx2, 0(x12!)bnex4, x5, Lstart                                                                                      |                                                                                                              |                                                                                                                  |
| addi x4, x4, -1<br>addi x12,x12, 1                                                                                 |                                                                                                                     |                                                                                                              |                                                                                                                  |
| bne x4, x5, Lstar                                                                                                  | <sup>t</sup> 8 cycles/output                                                                                        | 5 cycles/output                                                                                              | 1,25 cycles/output                                                                                               |
| <b>11 cycles/output</b>                                                                                            |                                                                                                                     |                                                                                                              | 2015 Eulmine UMC65                                                                                               |



**ETH** zürich

### The extensions translate to real speed-ups

- 8-bit convolution
  - Open source DNN library

#### 10x through xPULP

- Extensions bring real speedup
- Near-linear speedup
  - Scales well for regular workloads.
- 75x overall gain





#### **RISC-V ISA Extensions for extreme quantization**



- Overheads (28nm FDX PULPissimo impl.):
- Area:
- Timing Overhead:
- 8-bit MatMul power overhead:
- GP-app power overhead:
- ~11% (vs. Ri5CY)
  negligible (integrated in PULPissimo)
  1.8% (integrated in PULPissimo)
  3.5% (integrated in PULPissimo)

2014 Selene UMC65



ETH Zürich

### PULPino: Our first single core platform

- Simple design
  - Meant as a quick release

#### Separate data and inst. mem

- Makes it easy in HW
- Not meant as a Harvard arch.
- Can use all our 32bit cores
  - RI5CY, Zero/Micro-Riscy (Ibex)
- Peripherals from other projects
  - Any AXI and APB peripherals could be used



### What kind of acceleration?

- Standard peripheral talking over AXI/APB (standard)
- Instruction set extensions (already discussed)
- Shared functional units
  - Amortizes expensive extensions (FPU/DIV) between multiple units
  - Shared memory accelerators
    - Our bread and butter, PULPopen, NTX
- Cluster as an accelerator
  - HERO, BigPULP, etc

**ETH** Zürich





ETH Zürich

### **PULPissimo: The better single core platform**

- Shared memory
  - Unified Data/Instruction Memory
- Support for Accelerators
  - Direct shared memory access
  - Programmed through APB bus
  - uDMA for I/O subsystem
    - Can copy data directly from I/O to memory without involving the core
- Used as controller in larger systems





2019 Xavier UMC65



ETH zürich

### Mr. Wolf (TSMC 40): 8+1 core IoT Processor

- One cluster with
  - 8 RISC-V cores
  - 2x shared FPU units
  - 64 kByte of TCDM

#### One controller with

- 512 kByte L2 RAM
- Peripherals
- On chip voltage regulators
  - By Dolphin Integration



2017 Mr. Wolf TSMC40

### **PULP uses a permissive open source license**

#### All our development is on GitHub

HDL source code, testbenches, software development kit, virtual platfo

https://github.com/pulp-platform

#### PULP is released under the permissive Solderpad license

Allows anyone to use, change, and make products without restrictions.

| *                                                                                                                                                      |                                                                   |                                                                                                                                                                                              |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Repositories 159 Trackages                                                                                                                             | 💄 People 35 🛛 🕅 Teams 9 🕅 Projects                                | Settings                                                                                                                                                                                     |  |
| ned repositories                                                                                                                                       |                                                                   | Customize pinned repositories                                                                                                                                                                |  |
| pulpissimo =                                                                                                                                           | 📮 ariane                                                          | ≡ 📮 pulp ≡                                                                                                                                                                                   |  |
| his is the top-level project for the PULPissimo<br>atform. It instantiates a PULPissimo open-<br>purce system with a PULP SoC domain, but no<br>uster. | Ariane is a 6-stage RISC-V CPU capable of<br>booting Linux        | This is the top-level project for the PULP<br>Platform. It instantiates a PULP open-source<br>system with a PULP SoC (microcontroller) domain<br>accelerated by a PULP cluster with 8 cores. |  |
| SystemVerilog 🚖 80 💱 33                                                                                                                                | ● SystemVerilog 🔺 558  🖞 118                                      | SystemVerilog ★ 75 <sup>9</sup> 26                                                                                                                                                           |  |
| ] riscv ≡                                                                                                                                              | Digpulp                                                           | ≡ 📮 pulp-dronet ≡                                                                                                                                                                            |  |
| SCY is an in-order 4-stage RISC-V<br>V32IMFCXpulp CPU                                                                                                  | RISC-V manycore accelerator for HERO, bigPUL<br>hardware platform | A deep learning-powered visual navigation engine<br>to enables autonomous navigation of pocket-size<br>quadrotor - running on PULP                                                           |  |



2016 VivoSoC v2 SMIC 130



#### **Open source collaboration scheme explained**





#### **Open source collaboration scheme explained**





#### PULP cluster+MCU+HWCE: GWT's GAP8

Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V



| What                | Freq MHz | Exec Time ms | Cycles     | Power mW        |
|---------------------|----------|--------------|------------|-----------------|
| 40nm Dual Issue MCU | 216      | 99.1         | 21 400 000 | 60              |
| GAP8 @1.0V          | 15.4     | 99.1 11 X    | 1 500 000  | 3.7 <b>16 X</b> |
| GAP8 @1.2V          | 175      | 8.7 🔶        | 1 500 000  | 70              |
| GAP8 @1.0V w HWCE   | 4.7      | 99.1         | 460 000    | 0.8             |

#### 4x More efficiency at less than 10% area cost

2018 Scarabaeus UMC65

GREENWAVES

#### **Complete Application: DroNET on NanoDrone**



Only onboard computation for a conomous flight + obstacle avoidance no human operator, no ad-hoc external signals, and no remote base-station!

EHZUrich

2019 Baikonur GF22FDX



TH Zürich

#### PULP has released a large number of IPs





#### **Shared Memory Accelerators**



ETH Zürich

#### **Accelerator: HW Convolution Engine**

5. Fine-grain1clock gatingny olve-accumulate 4. 1 to minimize dynamic poweaming fashion

4. Weights for each convolution filter are stored privately



F. Conti and L. Benini, "A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters," Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, pp. 683-688.

2015 Diego ALP180

#### HW Convolution Engine Performance

**Cluster performance and energy efficiency on a 64x64 CNN layer (5x5 conv)** 

Scaled to ST FD-SOI 28nm @ Vdd=0.6V, f=115MHz



2016 VivoSoC 2.001 SMIC130

#### Arnold: a Collaboration with Quicklogic



- Chip in 22nm FDX
  - Combines e-FPGA (Quicklogic)
  - with PULPissimo (single core uC)

#### Multiple operation modes

- Configurable peripheral
- Accelerator for core
- Accelerator for independent I/O

2018 Arnold GF22FDX

**TH**Zürich

#### Cluster as heterogenous accelerator (HERO)





**H**züric

### **FPGA Prototyping Platforms**

#### Available:

#### Digilent Genesys2

- \$999 (\$600 academic)
- 1-2 cores at 66MHz

#### Xilinx VC707

- **\$**3500
- 1-4 cores at 60MHz

#### Digilent Nexys Video

- \$500 (\$250 academic)
- 1 core at 30MHz



#### In progress:

# Xilinx VCU118, BittWare XUPP3R

- **\$7000-8000**
- >100MHz
- Amazon AWS F1
  - Rent by the hour









### How about SW support?

#### **PULP Software Development Kit**

- Package for compiling, running, debugging and profiling applications on PULP platforms
- Supports all recent and upcoming PULP chips: Mr.Wolf, GAP, Vega, ...
- Supports all targets: Virtual Platform, RTL platform, FPGA
- RISC-V GCC with support for PULP extensions
- Basic OpenMP support

ETH Zürich

https://github.com/pulp-platform/pulp-sdk



### What is PULP doing for maintain our cores?

#### • We (ETHZ and University of Bologna) are research groups

- Motivated to develop new architectures and systems
- We needed efficient RISC-V cores (and peripherals) for our work
- Not so good (or interested) in providing industrial level support for these cores

#### We need help to

**TH** zürich





ETH Zürich

### Academic open source $\rightarrow$ Industrial open source

- **PENH** Rick O'Connor (OpenHW CEO, former RISC-V foundation director)
- OpenHW Group is a not-for-profit, global organization (EU,NA,Asia) driven by its members and individual contributors where HW and SW designers collaborate in the development of open-source cores, related IP, tools and SW such as the Core-V family of cores.
- OpenHW Group provides an infrastructure for hosting high quality open-source HW developments in line with industry best practices.



## A vertical, application-focused open approach



- OpenTitan is the first open source silicon project building a transparent, high-quality reference design for silicon root of trust (RoT) chips.
- Founding Partners

**TH** zürich





#### Feel the momentum!

Commits Per Month

Ibex RISC-V core, flash interface, communications ports, cryptography accelerators, and more.

#### **Vibrant repository**









Luca Benini, Davide Rossi, Andrea Borghesi, Michele Magno, Simone Benatti, Francesco Conti, Francesco Beneventi, Daniele Palossi, Giuseppe Tagliavini, Antonio Pullini, Germain Haugou, Lukas Cavigelli, Manuele Rusci, Florian Glaser, Renzo Andri, Fabio Montagna, Bjoern Forsberg, Pasquale Davide Schiavone, Alfio Di Mauro, Victor Javier Kartsch Morinigo, Tommaso Polonelli, Fabian Schuiki, Stefan Mach, Andreas Kurth, Florian Zaruba, Manuel Eggimann, Philipp Mayer, Marco Guermandi, Xiaying Wang, Michael Hersche, Robert Balas, Antonio Mastrandrea, Matheus Cavalcante, Angelo Garofalo, Alessio Burrello, Gianna Paulin, Georg Rutishauser, Andrea Cossettini, Luca Bertaccini, Maxim Mattheeuws, Samuel Riedel, Sergei Vostrikov, Vlad Niculescu, Frank K. Gurkaynak, and many more that we forgot to mention



http://pulp-platform.org



@pulp\_platform



**FOSSistanbul** will bring together, enthusiasts, members of industry and academia that are working on open source hardware design, in a lively and attractive city.

With keynotes by: Luca Benini, Nele Mentens, Onur Mutlu

#### **Register for FREE**

https://fossi-foundation.org/fossistanbul/

