## ETHzürich



ALMA MATER STUDIORUM Università di Bologna



# A RISC-V ISA Extension for Chaining in Scalar Processors



Luca Colagrande<sup>1</sup>, Jayanth Jonnalagadda<sup>2</sup>, Luca Benini<sup>1,3</sup> <sup>1</sup>Integrated Systems Laboratory (IIS) ETH Zürich; <sup>2</sup>D-ITET, ETH Zürich; <sup>3</sup>DEI, University of Bologna

#### 1 Introduction

Modern general-purpose accelerators integrate a large number of programmable area- and energy-efficient processing elements (PEs), to deliver high performance while meeting stringent power delivery and thermal dissipation constraints. In this context, PEs are often implemented by scalar in-order cores, which are highly sensitive to **pipeline stalls**. Traditional software techniques, such as loop unrolling, mitigate the issue at the cost of increased register pressure, limiting flexibility. We propose **scalar chaining**, a novel hardware-software solution, to address this issue without incurring the drawbacks of traditional softwareonly techniques. We demonstrate our solution on register-limited stencil codes, achieving >93% FPU utilizations and a 4% **speedup** and 10% higher energy efficiency, on average, over highly-optimized baselines. The RF is augmented with a valid bit per register to implement head-of-line blocking at the consumer's side of the logical FIFO.



#### **3 Results and Discussion**

#### 2 Implementation

We implement **dataflow** (or **FIFO**) **semantics** in the scalar inorder Snitch<sup>[1]</sup> core, to **chain** functional units (FUs) through the register file (RF) and ensure that values from the producer FU are not overwritten until they are used by the consumer FU.

The producer FU's pipeline registers form a **logical FIFO**, which can be effectively used to store intermediate results from **loop unrolling**, without adding **pressure** on the architectural RF.

| Naive                                                                                                      | Loop unrolling                                                                                                                            |
|------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| 4-cycle fadd.d ft3, ft0, ft1<br>RAW stall fmul.d ft2, ft3, %[b]                                            | + Scalar chaining                                                                                                                         |
| addi %[i], %[i], 1<br>bneq %[i], %[len], -12                                                               | 1 li %[mask], 8<br>2 csrs 0x7C3, %[mask]                                                                                                  |
| Loop unrolling                                                                                             | <pre>3 fadd.d ft3, ft0, ft1 &gt; No stall, no register<br/>4 fadd.d ft3, ft0, ft1 &gt; pressure increase<br/>5 fadd.d ft3, ft0, ft1</pre> |
| No stall, increased fadd.d ft3, ft0, ft1<br>register pressure fadd.d ft4, ft0, ft1<br>fadd.d ft5, ft0, ft1 | 6 fadd.d ft3, ft0, ft1<br>7 fmul.d ft2, ft3, %[b]<br>8                                                                                    |
| <pre>fadd.d ft6, ft0, ft1 fmul.d ft2, ft3, %[b] fmul.d ft2, ft4, %[b]</pre>                                | <pre>9 fmul.d ft2, ft3, %[b] 10 fmul.d ft2, ft3, %[b] 11 fmul.d ft2, ft3, %[b]</pre>                                                      |
| <pre>fmul.d ft2, ft5, %[b] fmul.d ft2, ft6, %[b] addi %[i], %[i], 4</pre>                                  | <pre>12 addi %[i], %[i], 4 13 bneq %[i], %[len], -36 14 csrs 0x7C3, x0</pre>                                                              |
| hner & [i] & [len] -36                                                                                     |                                                                                                                                           |

On a Snitch cluster implemented in GlobalFoundries' 12LP+ FinFET technology using Fusion Compiler 2023.12, with a target clock frequency of 1 GHz, our extensions introduce **negligible area and timing overheads**, in the scale of synthesis process variability margins.

We evaluate our implementation on two register-limited stencil codes<sup>[3]</sup>, box3d1r and j3d27pt. By applying chaining, we can free enough registers to fully store the stencil coefficients in the RF, achieving a 4% speedup and 10% higher energy efficiency, on average, over the highly optimized baselines in [3], and >93% FPU utilizations.



bneq %[i], %[len], -36

In this example, the fadd and fmul instructions are chained through ft3 (the FPU is both the consumer and producer FU). Stream semantics<sup>[2]</sup> are assigned to ft0, ft1 and ft2.

#### References

re

 F. Zaruba et al., "Snitch: A tiny pseudo dual-issue processor for area and energy efficient execution of floating-point intensive workloads," IEEE Transactions on Computers, vol. 70, no. 11, pp. 1845–1860, 2021.
 F. Schuiki et al., "Stream semantic registers: A lightweight risc-v isa extension achieving full compute utilization in single-issue cores," IEEE Trans. Comput., vol. 70, pp. 212–227, 2021.
 P. Scheffler et al., "Saris: Accelerating stencil computations on energy-efficient risc-v compute clusters with indirect stream registers," in DAC'24: Proceedings of the 61st ACM/IEEE Design Automation Conference. FPU utilization

Mean power consumption [mW]

### 4 Conclusion

We presented a novel hardware and software solution to hide FU latencies in scalar in-order processors, without incurring increased register pressure, as with traditional software-only techniques. With a negligible area and timing cost, our solution is lightweight and suited for integration into highly area- and energyefficient cores.