

### The Parallel Ultra Low Power Platform

RISC-V Tutorial at HotChips 2019

18 Aug 2019



<sup>1</sup>Department of Electrical, Electronic and Information Engineering

> **ETH** zürich <sup>2</sup>Integrated Systems Laboratory

Fabian Schuiki

and the entire PULP team

pulp-platform.org

#### Parallel Ultra Low Power (PULP)

- Project started in 2013 by Luca Benini
- A collaboration between University of Bologna and ETH Zürich
  - Large team. In total we are about 60 people, not all are working on PULP
- Key goal is

# How to get the most BANG for the ENERGY consumed in a computing system

 We were able to start with a clean slate, no need to remain compatible to legacy systems.





#### How we started with open source processors

- Our research was not developing processors...
- ... but we needed good processors for systems we build for research
- Initially (2013) our options were
  - Build our own (support for SW and tools)
  - Use a commercial processor (licensing, collaboration issues)
  - Use what is openly available (OpenRISC,..)
- We started with OpenRISC
  - First chips until mid-2016 were all using OpenRISC cores
  - We spent time improving the microarchitecture
- Moved to RISC-V later
  - Larger community, more momentum
  - Transition was relatively simple (new decoder)



RISC-V





### Motivation: Cloud $\rightarrow$ Edge $\rightarrow$ Extreme Edge AI





Extreme edge Al challenge:

- AI capabilities
  below 1 pJ/op
  (MCU power
  envelope)
- Mops to Tops
- Beyond fp32/fp64



Cost





### 2013: Parallel Ultra Low Power $\rightarrow$ PULP!



Near-Threshold Computing (NTC):

- **1.** Don't waste energy pushing devices in strong inversion
- 2. Recover performance with parallel execution
- 3. Core with 'naked' L1 interface to create cluster coupled at L1 level
- 4. Manage Leakage, PVT variability and SRAM limiting NT!

Need Strong ISA, Need full access to "deep" core interfaces, need to tune pipeline! OPEN ISA: RISC-V RV32IMC + New, Open Microarchitecture  $\rightarrow$  RI5CY!





#### Bespoke ISA needed! Enter Xpulp extensions

# <32-bit precision → SIMD2/4 → x2,4 efficiency & memory size

Risc-V ISA is extensible by construction (great!)

- V1 Baseline RISC-V RV32IMC HW loops
- V2 Post modified Load/Store Mac
- V3 SIMD 2/4 + DotProduct + Shuffling Bit manipulation unit Lightweight fixed point (EML centric)



#### 25 kGE → 40 kGE (1.6x)



M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices," in IEEE TVLSI, Oct. 2017.





#### RI5CY – are Xpulp ISA Extensions (1.6x) worthwhile?

|                                                | i < 100; i=<br>a[i] + b[i],                                           |                                         |                                                                                  | 10x on 2d<br>convolutions                                  |
|------------------------------------------------|-----------------------------------------------------------------------|-----------------------------------------|----------------------------------------------------------------------------------|------------------------------------------------------------|
| Baseline                                       |                                                                       |                                         |                                                                                  | YES!                                                       |
| mv x5, 0<br>mv x4, 100<br>Lstart:              | Auto-in <mark>cr loa</mark>                                           | d/store                                 |                                                                                  |                                                            |
| lb x2, 0(<br>lb x3, 0(<br>addi x10,x1          | mv x5, 0<br>mv x4, 100                                                | HW Loop                                 |                                                                                  |                                                            |
| sb x2, 0(<br>addi x4, x4<br><i>addi x12,x1</i> | 1b    x3, 0(      addi    x4, x4      add    x2, x3      sb    x2, 0( | lb x2,<br>lb x3,<br>add x2,<br>Lend: sb | 0(x10!) <b>Pa</b><br>0(x11!) <i>lp</i> .<br>x3, x2 <b>1</b><br>x2, 0(x1 <b>1</b> | setupi <b>25,</b> Lend<br>.w x2, 0(x10!)<br>.w x3, 0(x11!) |
| bne x4, x5                                     | bne x4, x5                                                            | , Lstart                                | _                                                                                | <b>w.add.b</b> x2, x3, x2<br>nd: <b>sw</b> x2, 0(x12!)     |

11 cycles/output 8 cycles/output 5 cycles/output 1,25 cycles/output





#### Results: RV32IMCXpulp vs RV32IMC

#### **8-bit Convolution Results**



**PULP-NN:** an open Source library for DNN inference on PULP cores





Overall Speedup of 75x

#### The Evolution of the 'Species'









### Mr. Wolf Chip Results: Heterogeneous Computing Works

| Technology         | CMOS 40nm LP       |  |  |
|--------------------|--------------------|--|--|
| Chip area          | 10 mm <sup>2</sup> |  |  |
| VDD range          | 0.8V - 1.1V        |  |  |
| Memory Transistors | 576 Kbytes         |  |  |
| Logic Transistors  | 1.8 Mgates         |  |  |
| Frequency Range    | 32 kHz – 450 MHz   |  |  |
| Power Range        | 72 μW – 153 mW     |  |  |

| Power Managent<br>(DC/DC + LDO) | VDD [V]   | Freq.           | Power            |
|---------------------------------|-----------|-----------------|------------------|
| Deep Sleep                      | 0.8       | n.a.            | 72 µW            |
| Ret. Deep Sleep                 | 0.8       | n.a             | 76.5 - 108<br>μW |
| SoC Active                      | 0.8 - 1.1 | 32 kH<br>450 N  | 0.97 -<br>38 mW  |
| Cluster Active                  | 0.8 - 1.1 | 32 kHz<br>350 N | 1.6 -<br>153 mW  |



A. Pullini, D. Rossi, I. Loi, A. Di Mauro, L. Benini, "Mr.Wolf: a 1 GFLOP/S Energy-Proportional Parallel Ultra Low Power SoC for IoT Edge Processing", ESSCIRC 2018.





### What Kind of Acceleration: Shared memory accelerators

#### **Coarse-Grained Shared-Memory Accelerators**

- DFGs mapped In Hardware (ILP + DLP)  $\rightarrow$  Highest Efficiency, Low Flexibility
- Sharing data memory with processor for fast communication  $\rightarrow$  low overhead
- Controlled through a memory-mapped interface
- Typically one/two accelerators shared by multiple cores







### What About Floating Point Support?

- F (single precision) and
  D (double precision) extension in RISC-V
- Uses separate floating point register file
  - specialized float loads (also compressed)
  - float moves from/to integer register file
- Fully IEEE compliant
- Alternative FP Format support (<32 bit)</p>



#### Packed-SIMD support for all formats

| FP64 |           |     |      |      |      |     |     |
|------|-----------|-----|------|------|------|-----|-----|
| FP32 |           |     |      | FP32 |      |     |     |
| FP   | FP16 FP16 |     | FP16 |      | FP16 |     |     |
| FP8  | FP8       | FP8 | FP8  | FP8  | FP8  | FP8 | FP8 |

Unified FP/Integer register file

- Not standard
- up to **15 %** better performance
  - Re-use integer load/stores (post incrementing ld/st)
  - Less area overhead
  - Useful if pressure on register file is not very high (true for a lot of applications)





### PULP cluster+MCU+HWCE(V1) → GWT's GAP8 (55 TSMC)

#### Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V



| What                | Freq MHz | Exec Time ms     | Cycles     | Power mW        |                 |
|---------------------|----------|------------------|------------|-----------------|-----------------|
| 40nm Dual Issue MCU | 216      | 99.1             | 21 400 000 | <sup>60</sup>   |                 |
| GAP8 @1.0V          | 15.4     | 99.1 <b>11 X</b> | 1 500 000  | 3.7 <b>16 X</b> | GREEN WAVES 📢 🎙 |
| GAP8 @1.2V          | 175      | 8.7 🔶            | 1 500 000  | 70              | TECHNOLOGIES    |
| GAP8 @1.0V w HWCE   | 4.7      | 99.1             | 460 000    | 0.8             |                 |



#### 4x More efficiency at less than 10% area cost





#### New Application Frontiers: DroNET on NanoDrone



Only onboard computation for autonomous flight + obstacle avoidance no human operator, no ad-hoc external signals, and no remote base-station! PULP https://youtu.be/57Vy5cSvnaA

14

# The Cores





#### RI5CY – Our workhorse 32-bit core



- 4-stage pipeline, optimized for energy efficiency
- 40 kGE, 30 logic levels, Coremark/MHZ 3.19
- Includes various extensions (Xpulp) to RISC-V for DSP applications





### Our extensions to RI5CY (with additions to GCC)

- Post-incrementing load/store instructions
- Hardware Loops (lp.start, lp.end, lp.count)
- ALU instructions
  - Bit manipulation (count, set, clear, leading bit detection)
  - Fused operations: (add/sub-shift)
  - Immediate branch instructions
- Multiply Accumulate (32x32 bit and 16x16 bit)
- SIMD instructions (2x16 bit or 4x8 bit) with scalar replication option
  - add, min/max, dotproduct, shuffle, pack (copy), vector comparison

For 8-bit values the following can be executed in a single cycle (**pv.dotup.b**)  $Z = D_1 \times K_1 + D_2 \times K_2 + D_3 \times K_3 + D_4 \times K_4$ 





### Enter Zero/Micro-riscy (Ibex), small core for control



- Only 2-stage pipeline, simplified register file
- Zero-Riscy (RV32-ICM), 19kGE, 2.44 Coremark/MHz
- Micro-Riscy (RV32-EC), 12kGE, 0.91 Coremark/MHz
- Used as SoC level controller in newer PULP systems





#### Finally the step into 64-bit cores

- For the first 4 years of the PULP project we used only 32bit cores
  - Luca once famously said "We will never build a 64bit core".
  - Most IoT applications work well with 32bit cores.
  - A typical 64bit core is much more than 2x the size of a 32bit core.

#### But times change:

- Using a 64bit Linux capable core allows you to share the same address space as main stream processors.
  - We are involved in several projects where we (are planning to) use this capability
- There is a lot of interest in the security community for working on a contemporary open source 64bit core.
- Open research questions on how to build systems with multiple cores.





#### **ARIANE: Our Linux Capable 64-bit core**



#### Main properties of Ariane

- Tuned for high frequency, 6 stage pipeline, integrated cache
  - In order issue, out-of-order write-back, in-order-commit
  - Supports privilege spec 1.11, M, S and U modes
  - Hardware Page Table Walker
- Implemented in GF 22FDX (Poseidon, Kosmodrom, Baikonur), and UMC65 (Scarabaeus)
  - In 22nm: ~1 GHz worst case conditions (SSG, 125/-40C, 0.72V)
  - 8-way 32kByte Data cache and 4-way 32kByte Instruction Cache
  - Core area: 175 kGE







#### Ariane booting Linux on a Digilent Genesys 2 board









#### Extreme FP Performance: The "V" Extension



#### Extreme FP Performance: The "V" Extension



# The Platforms





#### Making PULP: Cores



#### Making PULP: Cores + Peripherals/Acc.





#### Making PULP: Cores + Peripherals/Acc. = Platforms



#### The PULP platforms put everything together



#### **Platforms**



#### **Single Core**

- PULPino
- PULPissimo



### PULPino: Our first single core platform

- Simple design
  - Meant as a quick release
- Separate Data and Instruction memory
  - Makes it easy in HW
  - Not meant as a Harvard arch.
- Can be configured to work with all our 32bit cores
  - RI5CY, Zero/Micro-Riscy (Ibex)
- Peripherals copied from its larger brothers
  - Any AXI and APB peripherals could be used







### PULPissimo: The improved single core platform

- Shared memory
  - Unified Data/Instruction Memory
  - Uses the multi-core infrastructure
- Support for Accelerators
  - Direct shared memory access
  - Programmed through APB bus
  - Number of TCDM access ports determines max. throughput
- uDMA for I/O subsystem
  - Can copy data directly from I/O to memory without involving the core
- Used as a SoC/fabric controller in larger systems







#### The main PULP systems we develop are cluster based



#### **Platforms**



AcceleratorsHWCE<br/>(convolution)Neurostream<br/>(ML)HWCrypt<br/>(crypto)PULPO<br/>(1st order opt)

#### PULP cluster contains multiple RISC-V cores







#### All cores can access all memory banks in the cluster







#### Data is copied from a higher level through DMA







#### There is a (shared) instruction cache that fetches from L2







#### Hardware Accelerators can be added to the cluster









#### Event unit to manage resources (fast sleep/wakeup)









# An additional microcontroller system (PULPissimo) for I/O









#### Finally multi-cluster PULP systems for HPC applications



## Heterogeneous Research Platform



- First released in 2018
- Allows a PULP cluster to be connected to a host system





# OpenPiton and Ariane together, the many-core system

# OpenPiton

- Developed by Princeton
- Originally OpenSPARC T1
- Scalable NoC with coherent LLC
- Tiled Architecture
- Still work in progress
  - Bare-metal released in Dec '18
  - Update with support for SMP Linux will be released soon







# **OpenPiton+Ariane mapped to FPGA**

processor

processor

hart

# Digilent Gene

- Core: 66 MHz processor hart
- Up to 2 cores
- 8 GiB DDR3
- 1 core config:
  - 85k LUT (429<sup>processor</sup>)
  - 67 BRAM (15<sup>15a</sup> mmu uarch

: 0 : rv64imac : sv39 : eth, ariane

- : 1 : 1 : rv64imac : sv39 : eth, ariane
- : 2 : 2 : rv64imac : sv39 : eth, ariane
- : 3 : 3 : rv64imac : sv39 : eth, ariane

#### Xilinx VCU 118

- Core: 100 MHz
- Up to 16 cores
- 32 GiB DDR4
- (Available soon)









| etris     |   |   |     |    |      |     |   |      |   |
|-----------|---|---|-----|----|------|-----|---|------|---|
|           |   |   |     | #1 | 7#.  |     |   |      |   |
| -         |   |   |     | -  |      | ##4 |   |      |   |
| •         |   |   |     |    |      | *** |   |      |   |
| re<br>136 |   |   |     |    |      |     |   |      |   |
|           |   |   |     |    |      |     |   |      |   |
|           |   |   |     |    |      |     |   |      |   |
|           |   |   |     |    |      |     |   |      |   |
| •1 I      |   |   |     |    |      |     |   |      |   |
|           |   |   |     |    |      |     |   |      |   |
| 3         |   |   |     |    |      |     |   |      |   |
|           |   |   |     |    |      |     |   |      |   |
|           |   |   |     |    |      |     |   |      |   |
|           |   |   |     |    |      |     |   |      |   |
| es        |   |   |     |    |      |     |   |      |   |
| es<br>91  |   |   |     |    |      |     |   |      |   |
| 17        |   |   |     |    | -    |     |   |      |   |
|           |   |   |     |    | #    |     |   |      |   |
|           |   |   |     | *  | #    |     |   |      |   |
|           |   |   |     |    |      | 1   | - |      |   |
| KT        |   |   |     |    | #    |     |   |      |   |
|           |   |   |     | Ħ. | ±    |     |   |      |   |
|           |   |   |     | -  | 100  | 1   |   |      |   |
|           |   |   |     |    | #    |     |   |      |   |
|           |   | # |     |    | #    |     |   |      |   |
|           |   |   |     | 44 | 1000 |     |   |      |   |
|           |   |   |     |    | #    |     |   |      |   |
|           | 8 |   |     |    |      |     |   |      |   |
|           |   |   | 11  | -  |      |     |   | 1.00 | 1 |
|           |   |   | 100 |    |      |     |   | 1.00 |   |

# The Chips





# We have designed more than 25 ASICs based on PULP



#### ASICs meant to go on IC Tester

- Mainly characterization
- Not so many peripherals



#### **ASICs** meant for applications

- More peripherals (SPI, Camera)
- More on-chip memory







# You can buy development boards with PULP technology

#### VEGA board from open-isa.org

 Micro-controller board with RI5CY and zero-riscy





#### **GAPUINO** from Greenwaves

 PULP cluster system with 8+1 RI5CY cores





#### Brief illustrated history of selected ASICs



- All are 28 FDSOI technology, RVT, LVT and RVT flavor
- Uses OpenRISC cores
- Chips designed in collaboration with STM, EPFL, CEA/LETI
- PULPv3 has ABB control





## The first system chips, meant for applications



- First multi-core systems that were designed to work on development boards. Each have several peripherals (SPI, I2C, GPIO)
- Mia Wallace and Fulmine (UMC65) use OpenRISC cores
- Honey Bunny (GF28 SLP) uses RISC-V cores
- All chips also have our own FLL designs.





## Combining PULP with analog front-end for Biomedical apps



- Designed in collaboration with the Analog group of Prof. Huang at ETH
- All chips with SMIC130 (because of analog IPs)
- First three with OpenRISC, VivoSoC3 with RISC-V





#### The new generation chips from 2018



- System chips in TSMC40 (Mr. Wolf) and UMC65
- Mr. Wolf: IoT Processor with 9 RISC-V cores (Zero-riscy + 8x RI5CY)
- Atomario: Multi cluster PULP (2x clusters with 4x RI5CY cores each)
- Scarabaeus: Ariane based microcontroller





## The large system chips from 2018



- All are Globalfoundries 22FDX, around 10 mm<sup>2</sup>, 50-100 Mtrans
- **Poseidon**: PULPissimo (RI5CY) + Ariane
- Kosmodrom: 2x Ariane + NTX (FP streaming) accelerator
- Arnold: PULPissimo (RI5CY) + Quicklogic eFPGA





#### The next frontier from 2019



- UMC 65nm and Globalfoundries 22FDX
- Billywig: Streaming-enhanced RV32 cores for max. throughput, 3mm<sup>2</sup>
- Urania: Ariane+PULP Het. SoC, plus custom DRAM controller, 16mm<sup>2</sup>
- Baikonur: 2x Ariane + streaming-enhanced RV32 cores, 10mm<sup>2</sup>





#### We firmly believe in Open Source movement



#### First launched in February 2016 (Github)

#### All our development is on open repositories



#### **Get PULP now!**

- You can get the source code for PULPbased systems released under a permissible SolderPad open-source license from Github now.
- If you want to program with PULP you can get the SDK and use the virtual platform.

#### Contributions from many groups







## Open Hardware is a necessity, not an ideological crusade

- The way we design ICs has changed, big part is now infrastructure
  - Processors, peripherals, memory subsystems are now considered infrastructure
  - Very few (if any) groups design complete IC from scratch
  - High quality building blocks (IP) needed
- We need an easy and fast way to collaborate with people
  - Currently complicated agreements have to be made between all partners
  - In many cases, too difficult for academia and small enterprises
- Hardware is critical for security, we need to ensure it is secure
  - Being able to see what is really inside will improve security
  - Having a way to design open HW, will not prevent people from keeping secrets.





# Silicon and Open Hardware fuel PULP success

- Many companies (we know of) are actively using PULP
  - They value that it is silicon proven
  - They like that it uses a permissive open source license

|                                 |                  | <b>•</b> •               | Direct research collaborators on PULP |              |                         |          |                              |           |
|---------------------------------|------------------|--------------------------|---------------------------------------|--------------|-------------------------|----------|------------------------------|-----------|
|                                 |                  | Google                   | Politecnico di Torino                 | -            | BM Research Zurich      | IBM      | Technische Universität Graz  | -         |
|                                 |                  |                          | University of Cambridge               |              | PF Lausanne             | 297L     | CEA-Leti Grenoble            | Leti      |
| TEM                             |                  | GREENWAVES               | USI Lugano                            | •            | CSEM Neuchatel          | - csem   | Fraunhofer-Geselischaft      |           |
|                                 |                  | TECHNOLOGIES             | TU Kaiserslautern                     | <b>î</b> : ' | Princeton University    | <b>R</b> | Sapienza Università di Roma  | 0         |
|                                 | FLOBAL<br>FLOBAL | EMBECOSM                 | University of Cagliari                |              |                         |          |                              |           |
| life.augmented                  | FOUNDRIES        | EMBECOSIW                | Academic users v                      | we are awa   | re of                   |          |                              |           |
|                                 |                  | 10-ANALOG                | Università di Genova                  | <b>*</b>     | Stanford University     | ٩        | Universitat Bar-Ilan         | <u>20</u> |
| <b>DCPHIN</b><br>INTEGRATION    | CEVA             |                          | Politecnico di Milano                 |              | JC Los Angeles          |          | İstanbul Teknik Üniversitesi | 0         |
|                                 | - •              |                          | Fondazione Bruno Kessler              |              | JC San Diego            | 0        | NCTU Hsinchu                 | •         |
|                                 | cadence          |                          | Lund University                       |              | Columbia University     | d b      | University of Zagreb, FER    |           |
|                                 |                  |                          |                                       |              |                         |          |                              |           |
| onespin                         | Imperas          | SILICON LABS             | TUT Tampere                           | ₩ 1          | TU Darmstadt            | 0        | LIRMM Montpelier             | 0         |
| Onespin                         |                  |                          | RWTH Aachen                           | RWTH         | Jniversität Bremen      | Ŵ        | University of Stuttgart      | ۲         |
|                                 | ASHLING          | Advanced Circuit Pursuit | IST University of Lisboa              | <b>V</b> '   | longik University Secul | ۲        | University of Tübingen       | 2         |
| predictable formal verification |                  |                          | UFRN Rio Grande do Norte              | ۵.           | IT Kharagpur            | Ð        | TU Münich                    | тип       |
| Maller                          |                  | A                        |                                       |              |                         |          |                              |           |
| Valtrix<br>systems              | antmicro         | 00                       | FORTH Hellas                          | 9 a          | halmers Göteborg        | ۲        | NTNU Trondheim               |           |
|                                 |                  | WITTENSTEIN              | Kyoto University                      | 3            |                         |          |                              |           |
|                                 |                  |                          |                                       |              |                         |          |                              | -         |



#### Micro/Zero-riscy is now lbex

- LowRISC has agreed to maintain micro/zero riscy
  - Interested in using the core in their projects
  - They have a team that can provide support
  - ETH Zürich and University of Bologna will continue to contribute to Ibex
- Our core has grown and left the house
  - Alpine Ibex (Capra Ibex) is a mountain goat

that is typical in the mountains of Switzerland









#### Non-Profit Open Hardware Group



57

# **OpenHW Group Charter**

**OpenHW Group** is a not-for-profit, global organization driven by its members and individual contributors where hardware and software designers collaborate in the development of open-source cores, related IP, tools and software such as the **CORE-V Family of cores**. OpenHW provides an infrastructure for hosting high quality open-source HW developments in line with industry best practices.



R. O'Connor (OpenHW CEO, former RISC-V foundation director)





58

