Building an open source ZSTD decoder in the XLS Mid-Level Synthesis toolchain

Published:

Topics: Open ASICs, Open FPGA, Open source tools

In a long-running collaboration with Google, Antmicro has been working on demonstrating how the XLS Mid-Level Synthesis toolchain can be used to increase productivity of developing highly parallel ASIC solutions.

The initial focus of the project covered compression algorithms such as Run-Length Encoding (RLE) and Dictionary Based Encoding (DBE) described in the previous article, but more recently, Antmicro has developed a ZSTD decoder illustrating how the toolchain can be used for building real-world hardware blocks. The development of those blocks was done in DSLX, a Rust-like domain-specific language which is the default frontend to XLS.

The following article describes Antmicro’s implementation of the ZSTD decoder facilitated by XLS, and showcases the toolchain’s capabilities to quickly develop advanced cores to be used in state-of-the-art designs, in a software-driven, CI friendly way.

Building a ZSTD decode in XLS

Implementing ZSTD compression

Zstandard (ZSTD) is a lossless data compression mechanism, as documented in the Zstandard Compression and the ‘application/zstd’ Media Type RFC8878. In our XLS DSLX implementation, we have been reusing elements of the previously described RLE and DBE modules.

The ZSTD format follows a hierarchical structure, where each building block has a header containing metadata required for decompressing it correctly. A ZSTD file begins with a magic number used to distinguish it from other formats. This is followed by data organized into frames, each with its own header. Frames are divided into blocks of three types: raw (uncompressed data), RLE (repeated symbols) and compressed (using FSE and optionally Huffman coding). Each block starts with a header indicating its type. Within compressed blocks, literals and sequences store non-repeating and repeating data, similarly to the LZ77 algorithm, but stored in different ways. Having input data structured this way enforces a modular decoder design, where specialized units can be selected independently based on header information.

Another key design consideration was imposed by the FSE algorithm used for storing compressed blocks. This algorithm maintains a state, which is used together with decoding lookups, and as a result enforces compression and decompression to occur in opposite directions. In consequence, the FSE decoder in ZSTD format needs to read data backwards.
To minimize resource usage, instead of reverting data in memory, direct memory access allows data to be read from arbitrary places in memory. These two important factors result in a distinct decoder structure that is presented below:

ZSTD decoder block diagram

Developing the ZSTD decoder design with DSLX

The XLS toolchain allowed Antmicro to create the design using a higher level of abstraction, which helped to focus more on the algorithm and architecture itself rather than the minute hardware implementation details, making the design process more software-driven and iterative than traditional SystemVerilog development. The DSLX language structures the code into multiple processes (procs) communicating with each other using channels, which effectively constitutes a Network on Chip (NoC). The channels are later translated to a streaming interface with a handshake, allowing for backpressure and synchronizing between procs.

The toolchain facilitates creating small processes that encapsulate a simple functionality, which can be used to build up more convenient abstractions used in other parts of the design. In this case, multiple specialized blocks were created for handling different parts of the ZSTD bitstream. To make the integration with other systems easier, the team created blocks that expose simple interfaces for writing and reading the memory, while handling the complexities of the AXI protocol.

The toolchain translates the written DSLX into an intermediate representation (IR) which is then automatically optimized and scheduled. This allows for conceptual separation of the algorithm and the hardware representation, which can be adjusted and fine-tuned separately to a great extent. In places where algorithms need to be redesigned to better match the hardware, XLS provides tools for examining critical paths of the design and understanding the cost of each operation in regard to the target technological process. These features were particularly invaluable when implementing the FSE and Huffman compression algorithms.

An important factor for Antmicro was that XLS allows tests to be written in the same DSLX language used for the design. It also provides internal mechanisms to ensure the equivalence of the IR at later stages, verifying that the behavior of the generated SystemVerilog or Verilog matches the one described in DSLX. Additionally, XLS offers a C++ API that allowed Antmicro to write a variant of the simulation in C++ and compare our results directly with the original ZSTD library.

You can track the current stage of the ZSTD codec implementation in the XLS project in this GitHub Pull Request.

Based on the experience of creating real-world hardware block implementations, the Antmicro team continues to work with Google on improving the XLS toolchain itself – more developments will be described in future blog articles.

Software-driven digital block design with Antmicro

With their software-driven approach to developing I/O and processing cores, Antmicro can help you benefit from the software/hardware co-development capabilities offered by XLS and DSLX, as well as further improve the toolchain to better address the challenges of modern, high-bandwidth ASIC design and your specific use case.

If you would like to learn more about Antmicro’s engineering services for RTL and tooling development at every stage of an ASIC or FPGA project, explore the ASIC & FPGA section of our offering or reach out to us directly at contact@antmicro.com.

See Also: