

**APROPOS Winter School** IBM Research Zurich, February 14-16, 2023

#### **Transprecision Floating-Point Units**

Luca Bertaccini

IIS, ETH Zurich, Switzerland,

**PULP Platform** Open Source Hardware, the way it should be!





TRISTAN has received funding from the Key Digital Technologies Joint Undertaking (KDT JU) under grant agreement nr. 101095947. The KDT JU receives support from the European Union's Horizon Europe's research and innovation programmes and participating states are Austria, Belgium, Bulgaria, Croatia, Cyprus, Czechia, Germany, Denmark, Estonia, Greece, Spain, Finland, France, Hungary, Ireland, Israel, Iceland, Italy, Lithuania, Luxembourg, Latvia, Malta, Netherlands, Norway, Poland, Portugal, Romania Sweden, Slovenia, Slovakia, Turkev



@pulp platform pulp-platform.org 🚸



youtube.com/pulp platform

# Transprecision Computing: More than just approximation





- Floating-Point is everywhere
- Conservative with precision
  - Use largest precision everywhere & always
- Approximate Computing?
- Transprecision Computing!
  - Just right precision anywhere & anytime
  - Potential energy savings & speedup
- Holistic approach necessary

Enable energy-proportional transprecision computing in hardware!









 $BIAS = 2^{e-1} - 1$ 





23



 $(-1)^{s} \times 2^{(exp)-BIAS} \times 1.$  (mant)

 $BIAS = 2^{e-1} - 1$ 



- Precision Tuning
  - "smallFloat" formats







• Less precision than FP16

**#** Representable **Minimum Value Minimum Value** Format **Maximum Value** values (subnormal) (normal)  $\approx 3.40 \times 10^{38}$  $4.29 \times 10^{9}$  $\approx 1.40 \times 10^{-45}$  $\approx 1.18 \times 10^{-38}$ **FP32** (1,8,23)  $\approx 3.40 \times 10^{38}$  $\approx 9.2 \times 10^{-41}$  $\approx 1.18 \times 10^{-38}$ **Bfloat16** (1,8,7) 65536 ≈ 5.96 × 10<sup>-8</sup>  $\approx 6.10 \times 10^{-5}$ **FP16** (1,5,10) 65536 ≈ 65504 ≈ 1.53 × 10<sup>-5</sup>  $\approx 6.10 \times 10^{-5}$ **FP8** (1,5,2) ≈ 57344 256









 $(-1)^{s} \times 2^{(exp)-BIAS} \times 1.$  (mant)

$$BIAS = 2^{e-1} - 1$$

- Precision Tuning
  - "smallFloat" formats

ALMA MATER STUDIORUN Università di Bologni

- SIMD vectors

**ETH** zürich

Native hardware support

- Several benefits:
  - Higher Performance
  - Higher energy efficiency
  - Lower memory footprint



# Boosting the Energy Efficiency through ISA extension

 General purpose: tune precision & performance for high energy-efficiency

- General-purpose **RISC-V** CPUs can be extended with domain-specific ISA extension:
  - Opportunities for deeply optimizating the efficiency of the cores
  - Without comprimising the baseline standardized ISA (still generalpurpose)









# **A Transprecision FPU**

#### General purpose: tune precision & performance for high energy-efficiency





#### An Open-Source Transprecision FPU



What is needed?



- Support standard IEEE 754 FP
- Optimize **performance** and **energy-efficiency** 
  - SIMD vectors
  - Special operations
- Many target domains
  - Operations & Formats
  - Technology
  - Architecture
- Enable new research





- Hierarchical approach
  - Grouped by operation

     (i) ADDitions and MULtiplications carried out on an FMA unit (FMA = a \*b + c)
     (ii) division and square root
     (iii) comparisons
    - (iv) conversions (FP<->FP or INT<->FP)
- Y Datapath width









- Hierarchical approach:
  - 1. Grouped by **operation**
  - 2. Sliced by **format**

Datapath width

ALMA MATER STUDIORUM Università di Bologna

**ETH** zürich

- Any formats (new formats can be defined in a package at design time)
- Stand-alone or multi-format or none



Operation Group Block 🚽

Format

2

Format

**Formats** 

Implementation

Multiple

Formats

Format

n

....

FPU Top Leve

ADD

MUL

- Hierarchical approach:
  - 1. Grouped by operation
  - 2. Sliced by format
  - 3. Lanes for SIMD
  - 4. Execution unit

**ETH** zürich

Datapath width

#### • Any formats

• Stand-alone or multi-format or none

- SIMD support or scalar only
- Pipelining for execution unit

ALMA MATER STUDIORUM Università di Bologna





- Hierarchical approach:
  - 1. Grouped by **operation**
  - 2. Sliced by **format**
  - 3. Lanes for SIMD
  - 4. Execution unit
  - Datapath width
    - Any formats

ETHZÜRICH (1990) ALMA MATER STUDIORUM

- Stand-alone or multi-format or none
- SIMD support or scalar only
- Pipelining for execution unit





- Hierarchical approach:
  - 1. Grouped by **operation**
  - 2. Sliced by **format**
  - 3. Lanes for SIMD
  - 4. Execution unit
    - Datapath width
      - Any formats

**ETH** zürich

- Stand-alone or multi-format or none
- SIMD support or scalar only
- Pipelining for execution unit











- Any formats
- Implementation
- SIMD
- Pipelining
- **Open-Source**
- Synthesizable
- Hackable •



# **CVFPU: Highly Configurable**

CVFPU is highly parametrized to fit into a large variety of architectures:

- Width
- Formats
- Implementation
- SIMD
- Pipelining

Embedded



# HPC





### Programmability





# Embedded Transprecision Multi-Core Cluster



- Edge Computing for signal processing
- Build a multi-core cluster of RI5CY cores
  - FP32, FP16, bfloat16
  - Share TP-FPUs
- Design Space Exploration
  # Cores
  - FPU/Core Ratio
  - # Pipeline stages
- Implemented in 22FDX
  - Post-layout power
- Implemented on FPGA
  - Benchmarks





#### **Multi-Core Cluster: Selected Results**

- 8x 4 Benchmarks, 18 architectures •
- Best architectures .
  - Performance: **3.4 spGflop/s**
  - Energy Efficiency: 99 spGflop/sW
  - Area Efficiency: 2 spGflop/smm<sup>2</sup> —
- Compared to SoA: •
  - Outperforming SoA low-power embedded systems

Our TP-FPU architecture opens a huge design space for embedded TP computing systems. What about high-performance?

1.5

0.5

Efficiency





#### Low-Power Embedded Systems 120 pplication Performance [spGflop/s] Area Efficiency [spGflop/smm<sup>2</sup>] 3.5 100 3 80 2.5 2



/SWI

[spGflop/

Efficiency

Energy

40

20

# Kosmodrom: Application-class transprecision computing





Architecture Precision Voltage

ALMA MATER STUDIORUM

**ETH** zürich



- Dual-core Ariane **ASIC** in 22nm FDX
  - Ariane: 64-bit, 6-stage RISC-V with CVFPU
    - FP64, FP32, FP16, BF16, FP8
    - SIMD for FP32 FP8



• TP-FPU is ~30% of the core, only +9% area



# Kosmodrom: TP-FPU Energy Measurements (in Silicon)



- Scalar: equal performance @ reduced energy
  - FP64: **13.4 pJ/flop**

10.5x more efficient

- FP8: 1.27 pJ/flop
- Superlinear scaling

- Vector: improved performance @ equal energy
  - FP64: **13.4 pJ/flop**
  - FP8: 0.80 pJ/flop
    16.8x more efficient
- Superlinear scaling
- Improved performance at improved energy







# A Transprecision FPU enhanced for NN training







### The Rapid Growth of AI





S. Lie, "Thinking outside the die: Architecting the ML accelerator of the future"

- NN models' memory and compute requirements are growing **rapidly**
- Technology scaling is not sufficient

• Required **algorithmic** and **architectural** advancements





# **Algorithmic Advancements**

- New low-precision data types:
  - 32-bit → 19-bit → 16-bit → 8-bit
    floating-point (FP) data types

- Lower memory requirements
- Opportunities for more efficient hardware architectures
- Wide interest for standardization (RISC-V FP SIG, IEEE P3109)

**ETH** zürich



- New mixed and low-precision training algorithms have been developed to exploit the resilience of NN models to noise
- Expanding/Widening **operations** in which the accumulation is performed in higher

precision



# Efficient RISC-V Compute Clusters: Scalar + Parallel



25

- Manticore cluster: Snitch compute • cluster\*
- Small integer scalar 32b cores coupled with large SIMD FPUs (FP64, FP32) sharing a scratchpad memory
- **ISA extensions** that implicitly encode ٠ loads/stores to register reads/writes + loops handled in HW → ~90% FPU utilization
- Need for narrow FP formats and ٠ expanding instructions to efficiently tackle mixed, low-precision NN training

ALMA MATER STUDIORU

**ETH** zürich

\*https://github.com/pulp-platform/snitch

#### Academic Architecture for HPC - Manticore





architecture for ultraefficient floating-point computing)

- Manticore: a chiplet-based hierarchically-scalable architecture
- Linux-capable host + large manycore accelerator built upon the replication of efficient L1-shared **compute clusters**





#### Without Mixed-Precision Capabilities



In the case of **no mixed-precision** support:

If the application can work on FP16/FP16ALT inputs but accumulation in FP32 required:

- FP16 with accumulation on FP32 requires additional conversion which affects the performance
- An FP32 kernel can be used







# Without Mixed-Precision Capabilities

P D P

• SIMD FP32 FMA



In the case of **no mixed-precision** support:

If the application can work on FP16/FP16ALT inputs but accumulation in FP32 required:

- FP16 with accumulation on FP32 requires additional conversion which affects the performance
- An FP32 kernel can be used





# SIMD Expanding FMA

ALMA MATER STUDIORUM

**ETH** zürich

- SIMD Expanding FMA: Unbalanced
- Consumes half of ft0, ft1 and the whole fa0



| <configure loop=""></configure>       |            |
|---------------------------------------|------------|
| Loop:                                 |            |
| <pre>simd_exfma.a fa0, ft0, ft1</pre> | N/2<br>N/2 |
| <pre>simd_exfma.b fa1, ft0, ft1</pre> | • N/2      |
| EndLoop:                              |            |
| <reduction></reduction>               |            |





# SIMD Expanding FMA

- SIMD Expanding FMA: Unbalanced
- Consumes half of ft0, ft1 and the whole fa0





- Multiple instructions to cover all possible source locations needed
- The unbalanced ExFMA underutilizes the **FPU bandwidth** (2\*32+64 bits instead of 3\*64 bits)
- We can provide the FPU with more data and compute more every cycle





# SIMD Expanding Sum of Dot Product for Mixed Precision

- With SIMD ExSdotp, 2x FLOP per cycle
- Same perf as SIMD FP16 FMA but more accurate









# SIMD Expanding Sum of Dot Product for Mixed Precision

- With SIMD ExSdotp, 2x FLOP per cycle
- Same perf as SIMD FP16 FMA but more accurate







### Why Fused ExSdotp Units?



- A cascade of two ExFMAs computes an expanding dot product
- Non-distributive FP addition

ALMA MATER STUDIORU

**ETH** zürich



- Fused ExSdotp unit
- Single normalization and rounding step
- Opportunity to mitigate issues related to the non-associativity of the two consecutive additions



# **Targeted Floating-Point Formats**



Many formats + mixedprecision would result in a large ISA extension

Alternate formats are enabled by FCSRs to reduce the number of instructions

- A parametric design to enable fast exploration of new FP formats
- ExSdotp source formats:
  - FP16alt (1, 8, 7)
  - FP16 (1, 5, 10)
  - FP8alt (1, 5, 2)
  - FP8 (1, 4, 3)

- ExSdotp destination formats:
  - FP32 (1, 8, 23)
  - FP16alt (1, 8, 7)
  - FP16 (1, 5, 10)





ETH zürich (

ALMA MATER STUDIORUM \*Subnormals handled for all combinations of formats

# Expanding Sum of Dot Product Unit

- w = source format bitwidth; ps = source format mant bits
- 2w = destination format bitwidth; pd = destination format mant bits
- ExSdotp =  $a_w * b_w + c_w * d_w + e_{2w}$

ALMA MATER STUDIORUM Università di Bologna







# Expanding Sum of Dot Product Unit

- w = source format bitwidth
- 2w = destination format bitwidth
- ExSdotp =  $a_w * b_w + c_w * d_w + e_{2w}$
- ExVsum =  $a_w * 1 + c_w * 1 + e_{2w}$
- Vsum  $= a_{2w} + c_{2w} + e_{2w}$
- ExVsum/Vsum to reduce and accumulate the results packed in a register after SIMD ExSdotp executions
- Support for non-expanding three-term sum added by bypassing the multiplications
- All the necessary logic is already present as the targeted ExSdotp operations were expanding









• **CVFPU** is a highly-parameterized open-source **modular** energy-efficient multi-format FPU







- **CVFPU** is a highly-parameterized open-source **modular** energy-efficient multi-format FPU
- SIMD FMA unit
- As proposed in https://iis-git.ee.ethz.ch/smach/smallFloat-spec







- **CVFPU** is a highly-parameterized open-source **modular** energy-efficient multi-format FPU
- SIMD ExSdotp unit integrated into CVFPU as a new operation group block
- SIMD SDOTP: two 16-to-32-bit units and two 8-to-16-bit units
- Up to two 16-to-32-bit ExSdotp and four 8-to-16-bit ExSdotp per cycle





- Parametrizable number of pipeline levels. In our specific case, we selected:
  - SDOTP: 3 levels of pipeline registers
  - ADDMUL: 3 levels of pipeline registers
  - CAST: 2 levels of pipeline registers
  - COMP: 1 levels of pipeline registers

ETHZÜRICH

#### ExSdotp & CVFPU: Area and Timing

- Implemented in GlobalFoundries 12nm FinFET technology
- Max Frequency → 1.24GHz (typ 0.8V, 25 °C)
- The fused ExSdotp unit allows for around **30%** area and critical path reduction with respect to a cascade of ExFMA modules.
- The SIMD SDOTP unit occupies 44.5 kGE, amounting to 27% of the enhanced FPU area (overall FPU area = 165kGE).
- Full extension introduced less than 15% area overhead at a cluster level

**ETH** zürich

















#### Same performance at a higher accumulation precision















More than **7x performance** and **energy efficiency** improvement with respect to FP64 computation (<15% area overhead on the entire cluster) + **reduced memory footprint** 









• Transprecision computing enabling high efficiency and performance gains

• Transprecision support allows for exploiting lower memory footprint

Open-source, highly efficient and flexible FPU enabling transprecision computing on general-purpose architecture

https://github.com/pulp-platform/fpnew





#### Conclusion



#### 10+ ASICs including FPnew









# Not only general-purpose computing...







#### CVFPU Functional Units as Building Blocks for DSA Datapath

- CVFPU is a modular design
- The execution units can be reused as building blocks for domain-specific accelerator (DSA) datapaths.
- Example → RedMulE: floating-point GEMM accelerator (<u>https://github.com/pulp-platform/redmule</u>)
- RedMulE: HW Accelerator for GEMMs for FP8/FP16
  - **2D** Array of Computing Elements (**CE**) operating in **lockstep** and distributed in rows and columns
  - CEs has private copies of their **X**-matrix elements, *W*-matrix elements broadcasted among rows of CEs.
- In/output cast unit casting (FP8 ↔ FP16) for high computing accuracy

 $GEMM: \mathbf{Z} = (\mathbf{X} \times W) + \mathbf{Y}$ 



ALMA MATER STUDIORUM Università di Bologn

**ETH** zürich







## Thank you for your attention!







