Introduction

Overview of Saturn
Figure 1. A high-level overview of the Saturn Vector Unit

This manual describes the Saturn Vector Unit, a parameterized and extensible vector microarchitecture executing the RISC-V vector extension. Saturn was developed to address an academic need for a representative, compliant, and flexible generator of RISC-V vector units targeting deployment in domain-specialized cores. Saturn is implemented as a parameterized Chisel RTL generator, enabling a range of possible Saturn configurations across many target deployment scenarios. This document discusses the microarchitectural details of all Saturn components.

  • Chapter 1 describes the motivation for Saturn and compares Saturn’s design approach to those of existing data-parallel microarchitecture archetypes

  • Chapter 2 discusses the system organization of Saturn

  • Chapter 3 describes the microarchitecture of Saturn’s vector frontend unit

  • Chapter 4 describes the microarchitecture of Saturn’s vector load-store unit

  • Chapter 5 describes the microarchitecture of Saturn’s datapath and vector instruction sequencers

  • Chapter 6 provides guidance on writing efficient vector code for Saturn

  • Chapter 7 discusses the historical context of Saturn within past academic vector units

Objectives

Saturn was developed with the following objectives:

  • Provide a representative baseline implementation of the RISC-V Vector specification

  • Support full compliance with the complete RVV specification, including support for virtual memory and precise faults

  • Target performant ASIC implementations, rather than FPGA deployments

  • Be sufficiently parameterized to support configurations across a wide power/performance/area design space

  • Demonstrate efficient scheduling of vector operations on a microarchitecture with a short hardware vector length

  • Implement a SIMD-style microarchitecture, comparable to existing SIMD datapaths in DSP microarchitectures

  • Integrate with existing efficient area-compact scalar cores, rather than high-IPC general-purpose cores

  • Support extensibility with custom vector instructions, functional units, and accelerators that leverage the baseline capability in the standard RVV ISA

  • Target deployment as part of a DSP core or similarly domain-specialized core, instead of general-purpose systems

Questions, bug reports, or requests for further documentation can be made to jzh@berkeley.edu.

1. Background

This chapter presents a background discussion of deployment scenarios for data-parallel systems and the dominant architectures in the commercial space for such systems. Saturn, as an implementation of a modern scalable vector ISA targeting deployment in specialized cores, fills an underexplored niche in the space of open-source data-parallel microarchitectures. A comparison of Saturn’s microarchitecture philosophy to alternative vector approaches is discussed.

1.1. Data-parallel Workloads

We broadly categorize the range of deployment scenarios and workloads for programmable data-parallel systems into four domains. These categories are differentiated by power/area constraints and workload characteristics. Figure 2 outlines the four deployment domains where DLP-enabled architectures are critical, and the salient architectural and microarchitectural archetypes that have come to dominate each domain.

DLP Domains
Figure 2. Taxonomy of data-parallel domains and their dominant architectures

HPC, scientific computing, and ML training workloads are typically characterized by very long application vector lengths, exhibiting a high degree of data-level parallelism. Systems targeting these workloads are generally designed and deployed at datacenter scale, with available data-level-parallelism exploited across the many processors in a multi-node cluster. While single-core efficiency and performance remain important for this domain, scalability towards multi-core, multi-node systems is of more critical concern.

General-purpose server, desktop, and mobile workloads intersperse data-level-parallel codes with sequential, control-flow heavy non-data-parallel blocks. The implications of Amdahl’s Law dictate the design of these systems; exploiting data-level parallelism must not come at the expense of single-thread performance or thread-level-parallelism. Compared to the other domains, compiler and programmer-friendly ISAs are particularly important for these systems, as they must remain performant on the greatest diversity of workloads.

Digital signal processing (DSP) and similar domain-specialized workloads are fundamentally data-parallel algorithms. These workloads are often offloaded in modern SoCs to specialized "DSP cores" optimized towards power and area-efficient exploitation of data-level parallelism, with less consideration towards general-purpose performance. "DSP cores" usually run optimized kernels written by programmers at a low level to maximize efficiency. Large SoCs typically integrate many such "DSP cores" across various design points to provide optimized compute for different subsystems.

Applications executing on embedded devices face extreme constraints around power, area, and code size. Exploiting data-level parallelism can be critical for these workloads as well, but might come as a secondary concern compared to power and area. As these cores often provide a self-contained memory system, code and data size are more critical architectural considerations.

1.2. Data-parallel Architectures

The requirements of the common data-parallel workload domains have induced the commercial space to converge towards several fundamental architectural and microarchitectural archetypes.

1.2.1. Long-vector Implementations

Long-vector ISAs and their many-lane vector-unit implementations trace their heritage most clearly to the early Cray-style vector units. Modern incarnations of such systems primarily target the HPC and datacenter space, where long application vector lengths map cleanly into high-capacity vector register files.

Microarchitecturally, these implementations distribute the vector register file, functional units, and memory access ports across many vector lanes, with scalability towards many lanes being a key requirement. A single core of such a vector unit might have upwards of 32 parallel vector lanes, with a hardware vector length of over 16384 bits.

Notably, a lane-oriented microarchitecture with distributed memory ports and a high-radix memory interconnect enables high performance on codes which require high address generation throughput. Indexed loads and stores can leverage the distributed memory access ports to maintain high memory utilization, while register-register scatters and gathers can leverage a complex yet scalable cross-lane interconnect.

The NEC Aurora [3] is a modern example of a commercial long-vector microarchitecture, while Hwacha [11], Ara [14], and Vitruvius [12] are examples of academic long-vector implementations.

1.2.2. SIMT GPGPUs

GPGPUs execute a specialized SIMT ISA across a multi-core of multi-threaded processors. GPGPUs evolved from the need for increasingly programmable shader pipelines in graphics workloads. Programming models for shaders developed into general-purpose SIMT frameworks like CUDA and OpenCL. GPU microarchitectures similarly became more general-purpose, eventually maturing into modern GPGPU microarchitectures.

Notably, the multi-core SIMT archetype in GPGPUs naturally adapts to a range of deployment domains. By scaling the number of multi-threaded cores, GPGPU microarchitectures can target integrated GPUs in general-purpose SoCs, graphics accelerator cards for desktop/consumer systems, as well as interconnected compute cards for datacenter-scale HPC systems. The rise of GPGPUs is a contributing factor to the decline of specialized long-vector microarchitectures for datacenter HPC systems.

In the commercial space, NVIDIA and AMD dominate the GPGPU market. Their HPC and datacenter products, while originally derived from consumer GPU architectures, have since diverged towards a more differentiated, yet still multi-core-SIMT microarchitecture.

Within the general-purpose mobile consumer space, AMD, Intel, Apple, Qualcomm, and Huawei all deploy embedded GPGPU systems within their SoCs. These vendors all provide developer SDKs for supporting general purpose applications on their GPGPU microarchitectures.

1.2.3. General-purpose ISAs with SIMD

General-purpose ISAs with SIMD extensions enable exploiting data-level parallelism in a microarchitecture already optimized towards scalar IPC. SIMD extensions evolved from subword-SIMD ISAs designed to pack a small fixed-length vector of sub-register-width elements within a single scalar register. Subword-SIMD is a cheap way to integrate limited forms of data-parallel compute, particularly for multimedia applications, into a general-purpose processor, without requiring substantial redesign of complex microarchitectural components. Modern SIMD ISAs have generally moved beyond subword-SIMD (except in the embedded space) towards wider SIMD registers.

Microarchitectures of general-purpose cores perform instruction scheduling of SIMD instructions using similar mechanisms as they do for scalar instructions. Existing features like superscalar issue, out-of-order execution, and register renaming can be reused to maintain high instruction throughput into the SIMD functional units. This paradigm has largely held even in modern general-purpose-SIMD cores, which feature much wider SIMD registers than scalar registers for improved performance on data-parallel workloads.

Practically all widely deployed commercial general-purpose cores ship with SIMD extensions. Intel and AMD out-of-order cores support some form of the SSE or AVX extensions [6], while ARM’s A-profile [1] architecture requires NEON.

1.2.4. VLIW ISAs with SIMD

VLIW ISAs with SIMD extensions have traditionally dominated the space of DSP and similar domain-specialized cores, as well as embedded cores. For these deployment scenarios, power and area constraints preclude the deployment of costly microarchitectural features towards high instruction throughput in a general-purpose ISA, such as speculative execution and register renaming. The structure of DSP kernels presents a good fit for the VLIW paradigm, since regular loops, rather than branch-dense control code, are easier to optimize on a VLIW microarchitecture.

Furthermore, DSP applications often require more regularly behaved memory systems to achieve strict QoS requirements, leading to DSP cores often integrating low-or-fixed-latency software-managed scratchpad memories, rather than caches with dynamic, unpredictable behavior. Applications and microarchitectures which prefer statically predictable memory systems are especially well-suited for VLIW ISAs.

However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs. Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops. Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units.

Cadence, Synopsys, CEVA, and Qualcomm all ship commercial VLIW DSPs with SIMD extensions. Cadence, Synopsys, and CEVA cores are IP products typically integrated into a customer’s SoC as an embedded core, while Qualcomm’s Hexagon DSP Cores are integrated throughout their SoC line to provide DSP compute.

1.2.5. Scalable Vector ISAs

In contrast to the above patterns, modern scalable vector ISAs aspire to provide a common unified ISA that can support a range of microarchitectural implementation styles, supporting long-vector Cray-like machines, general-purpose out-of-order machines with vector extensions, specialized DSP cores with vector extensions, as well as ultra-compact embedded vector units. The dominant examples of such ISAs include ARM’s proprietary SVE and MVE [2] extensions, as well as the open RISC-V Vector extension [4].

Existing academic implementations of RVV have broadly targeted the HPC and general-purpose deployment scenarios with long-vector or out-of-order-core-integrated microarchitectures. Compared to most prior academic implementations, Saturn targets DSP and domain-specialized cores, and represents a class of designs we call "short-vector".

Existing open-source "short-vector" implementations like Spatz [9], Vicuna [15], and RISCV2 [13] require either register-renaming or constricted memory systems. RISCV2 requires register-renaming, while Spatz and Vicuna assume a low-latency memory system. Vicuna further requires a global stall to adapt to variable memory latencies. Unlike these designs, Saturn demonstrates that a "short-vector" design can achieve high performance and efficiency without any architectural or microarchitectural sacrifices.

1.3. The RISC-V Vector ISA

The RISC-V Vector ISA is the standard extension in RISC-V for exploiting data-level parallelism. A full discussion of the ISA design can be found in its specification [4]. This section highlights several properties of RVV that pose notable challenges to implementation or distinguish it from other vector ISAs.

1.3.1. Dynamic Vector Type and Length

Stripmine loops in RVV use vset instructions to dynamically adjust vector configuration state in the body of the loops. These instructions set the dynamic vl vector length register in addition to the vtype register, which sets the element width, register grouping, and mask/tail agnosticity for subsequent operations.

A naive implementation of RVV might treat the vtype as a single system-wide register, owing to its effect on the behavior of many components of the vector datapath. However, such an approach would substantially degrade performance, as vset is used in the inner loops of vector kernels to effect stripmining or to enable mixed-precision kernels.

As a result, performant implementations must maintain copies of the vtype and vl registers, instead of maintaining a single global status. Since neither vtype nor vl require many bits to encode, this state can be renamed into a control bundle that propagates alongside each vector instruction in the datapath.

Furthermore, since vtype and vl affect the generation of precise faults by vector memory instructions, it is insufficient to update these registers only at commit, since precise faults must be generated ahead of commit. Doing so would introduce an interlock between a vset and a subsequent vector memory operation, which must stall until the vset commits before using the updated vtype/vl to check for precise faults. Instead, performant scalar core implementations should bypass updates of vtype and vset to an early stage in the pipeline.

1.3.2. Memory Disambiguation

RVV mandates that vector memory operations appear to execute in instruction order with respect to all other instructions on the same hart, including scalar memory instructions. While an alternative ISA design may have relaxed this ordering requirement, such an approach would necessitate costly and precise programmer-inserted fences to enforce scalar-vector memory ordering.

This requirement for scalar-vector memory disambiguation poses a challenge to decoupled post-commit vector unit implementations, in which vector loads and stores might run behind scalar loads and stores. Stalling scalar loads and stores until the vector loads and stores drain could have costly implications on kernels which naturally would benefit from overlapping scalar and vector memory operations. For instance, in an input-stationary matrix multiplication, the inner loop streams across scalar loads of an input tile and vector loads and stores of the output. This kernel would naturally require efficient scalar-vector memory disambiguation.

Performant implementations should allow concurrent execution of scalar and vector memory operations by performing precise early-stage memory disambiguation of vector memory accesses.

1.3.3. Precise Faults

RVV mandates precise faults for vector memory operations. Faulting vector loads and stores must execute up until the element which causes the fault, report the element index that generated the fault, and block commit of any younger scalar or vector instructions. This implies that implementations must check for precise faults ahead of commit.

However, offloading address generation entirely ahead of-commit would have significant negative performance consequences, as this would stall unrelated scalar instructions even in the common case where instructions do not fault. Performant implementations should expediently commit vector memory instructions in the common case where they do not fault, and only interlock the scalar core in the uncommon case where a fault is present.

1.3.4. Vector Register Grouping

The LMUL (length multiplier) register grouping field of vtype enables grouping consecutive vector registers into a single longer vector register. In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length. Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources. For example, vector memcpy induces no register pressure and can trivially set a high LMUL to reduce dynamic instruction count. Since higher LMUL settings will unroll instructions in hardware, LMUL also reduces static code size byreducing the need for unrolling loops in in software.

Thus, implementations should not penalize code which uses high LMUL to reduce instruction fetch pressure. The general intuition around vector code should be to use the highest LMUL setting while avoiding register spills. Implementations should should strive to support this intuition.

One tempting approach to implementing register grouping behavior would be to crack LMUL > 1 instructions early in the pipeline and implement the backend instruction scheduling around LMUL = 1. While this strategy is straightforward to implement as it simplifies the instruction scheduling, it may cause performance issues due to pressure on datapath scheduling resources from the many micro-ops generated by cracked high-LMUL instructions. Alternatively, the addition of queueing resources to reduce this pressure would add significant area and power overhead.

1.3.5. Segmented Memory Instructions

Segmented memory instructions enable a "transpose" of an "array-of-structs" data representation in memory into a "struct-of-arrays" in consecutive vector registers. Such instructions, while very complex behaviorally, are fundamental to many algorithms and datatypes. For instance, complex numbers and image pixel data are conventionally stored in memory as "arrays-of-structs".

The instructions are critical for repacking data in memory into element-wise format for vector instructions. Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers.

Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution. Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.

1.4. Short-Vector Execution

Saturn’s instruction scheduling mechanism differentiates it from the relevant comparable archetypes for data-parallel microarchitectures. Fundamentally, Saturn relies on efficient dynamic scheduling of short-chime short-vectors, without relying on costly register renaming. When LMUL is short (1 or 2), vector chimes may be only 2-4 cycles long, requiring higher throughput scheduling than a long-chime machine.

Saturn Pipeline
Figure 3. Pipeline diagram of instruction execution with short vector lengths, zero dead time, limited out-of-order execution, and chaining. Instructions in the X and M pipelines can execute out-of-order w.r.t. each other.

Figure 3 depicts a simplified pipeline visualization of a short vector loop, consisting of a load and dependent arithmetic instruction, executing on a simplified Saturn-like short-vector datapath. In this example, each vector chime is 2 cycles.

A short-vector machine should fully saturate both the arithmetic and memory pipelines with such short vector lengths and chimes. Instruction throughput requirements are moderate, but can still be fulfilled with a modest superscalar in-order scalar core. Notably, some degree of out-of-order execution, beyond just chaining, is necessary to enable saturating both the memory and arithmetic pipelines.

Saturn Dead Time Pipeline
Figure 4. Pipeline diagram of instruction execution with short vector lengths, 1-cycle dead time, limited out-of-order execution, and chaining. Instructions in the X and M pipelines can execute out-of-order w.r.t. each other.

Figure 4 highlights the importance of zero dead time for short-vector microarchitectures. Unlike in Figure 3, the machine in this example has 1-cycle dead time for each functional unit, perhaps due to an inefficiency in freeing structural resources as instructions are sequenced. Notably, a single cycle of dead time in a short-chime machine substantially degrades the utilizations of the datapaths, as the dead time cannot be amortized.

Saturn In-order Pipeline
Figure 5. Pipeline diagram of instruction execution with short vector lengths, zero-cycle dead time, strict in-order execution, and chaining. Instructions in the X and M pipelines must begin execution in-order.

Figure 5 highlights the importance of limited out-of-order execution for short-vector microarchitectures like Saturn. Unlike in Figure 3, the machine in this example requires the instructions to enter the datapath in-order. Requiring strict in-order execution would substantially degrade performance for suboptimally scheduled vector code. Despite the same zero-cycle dead time, the restriction on in-order execution prevents the machine from aggressively issuing instructions to hide the latency in the M pipe.

1.4.1. Compared to Long-Vector Units

Long-vector microarchitectures feature very-long-vector-lengths distributed across many parallel vector lanes. Such implementations typically store their vector register files in dense lane-distributed SRAM banks.

Given the very long vector lengths, vector instructions are executed in a deeply temporal manner, even across many parallel vector lanes. Thus, instruction throughput is less critical for maintaining high utilization of functional units. Instead, long-vector microarchitectures typically derive efficiency and high utilization by amortizing overheads over fewer long-chime inflight instructions.

Long-vector Pipeline
Figure 6. Pipeline diagram of instruction execution in a deeply-temporal long-vector machine with zero dead time.

Figure 6 shows an example pipeline diagram of a vector loop in a deeply temporal long-vector machine. Unlike in the short-vector example in Figure 3, Instruction throughput requirements are minimal and strict in-order execution is sufficient for maintaining high utilization of the datapaths.

Long-vector Pipeline
Figure 7. Pipeline diagram of instruction execution in a deeply-temporal long-vector machine with 1-cycle dead time.

Figure 7 highlights how dead time in a deeply-temporal vector-unit is amortized over many cycles of temporal execution per instruction. This contrasts with the Saturn-like short-vector machine, in which short chimes cannot hide dead time.

For DSP deployments, the long-vectors paradigm is particularly ill-suited when compared to short-vectors Saturn-like cores.

  • Many DSP applications feature short and/or widely varying application vector lengths. This makes it difficult for long-vector machines to effectively utilize their resources as it precludes deep temporal execution. Short-vector machines can achieve higher utilization on these shorter application vector lengths.

  • Short-vector machines use an inherently lower capacity vector register file, which has positive implications in terms of silicon area and power consumption.

  • Short-vector machines can still reduce IPC requirements and dynamic instruction counts on applications with long vector lengths by leveraging the register grouping capabilities of modern vector ISAs.

1.4.2. Compared to General-purpose SIMD Cores

SIMD datapaths in general-purpose cores are typically deeply integrated into the scalar instruction execution pipeline. In these designs, existing capabilities for out-of-order execution, speculative execution, superscalar fetch, and register renaming are leveraged to maximize SIMD datapath utilization. While these features are costly in power and area, they are fundamental necessary components of modern general-purpose cores, and thus are also leveraged when executing SIMD code.