

# ETHZÜRICH DATE

## SpikeStream: Accelerating Spiking Neural Network Inference on RISC-V Clusters with Sparse **Computation Extensions**

**S.** Manoni<sup>1</sup>, P. Scheffler<sup>2</sup>, L. Zanatta<sup>3</sup>, A. Acquaviva<sup>1</sup>, L. Benini<sup>1,2</sup>, A. Bartolini<sup>1</sup>

<sup>1</sup>Department of Electrical, Electronic, and Information Engineering (DEI) – University of Bologna, Italy <sup>2</sup>Integrated Systems Lab (IIS) - ETH Zurich, Switzerland

#### 1. Introduction and Motivation

- The pursuit of efficient, low-latency machine intelligence has led to Neuromorphic systems inspired by the brain. These models rely on computations and algorithms based on **Spiking Neural Networks**
- **SNNs** incorporate **spike**-based communication between neurons, activation sparsity, and complex neuronal dynamics while maintaining traditional neural network topologies



- Traditional CPUs and GPUs struggle to deliver high-efficiency in the presence of spikes and sparsity
- Dedicated accelerators are often entirely designed only for SNN models, making them an **expensive** and **inflexible** solution

#### 3. SpikeStream Software Architecture

- We introduce target architecture-aware optimizations
  - **Tensor compression**: CSR-derived fibre tree format storing binary activations as spike positions using indices and spatial pointers
  - Task Parallelization: Computation parallelized across Snitch Cluster cores (receptive field per core). Workload-stealing with atomic tagging balances irregular parallelization.
  - **Data parallelization**: Batched HWC weight layout enables output channel parallelism across FPU lanes
  - **Double Buffering:** DMA core used for sparse activation tiling
  - **Streaming Acceleration (SA)**: Indirect weight loads mapped to Ο indexed streams (SR-managed address gen/memory ops),

decoupling FPU via hardware-loop control



• Stream Registers (SRs) emerged as a CPU extension to overcome memory bottlenecks, enabling hardware-managed data streaming with **support** for **indirect** (gather/scatter) **access** for sparse workloads

### 2. Target Platform: Snitch Cluster

- 8 RV32G Cores enhanced with
  - Double-precision FPU
  - Indirect SRs
  - Floating-point HW loops
  - **FP SIMD extension**
- 1 DMA Core
- 128KiB Low-Latency Scratchpad Memory (SPM)



#### s.manoni@unibo.it