Open source FPGA NVMe accelerator platform for BPF driven ML processing with Linux and Zephyr


Topics: Open FPGA, Open source tools

Machine learning typically operates on large amounts of data which often has to be moved back and forth between processing nodes and storage. This generates bottlenecks and costs in terms of both power and bandwidth. One trend is seeing compute units (CPUs, GPUs, FPGAs) get more built-in, dedicated memory, but there is a limit to how far this can be pushed; yet another is to try to interconnect compute with remote memory and storage more efficiently using interfaces such as CXL. Yet another one, described in this note, is processing data directly on storage devices - computational storage, or in-situ / in-storage computing, which allows you to process data on the fly, or perform computations directly on the stored data, without consuming compute resources and the associated back-and-forth.

Computational storage is particularly useful for unstructured data, i.e. images, videos or audio. With computational storage devices (CSD), deep learning models can be deployed for incoming data to e.g. detect, annotate and track objects present in images or video files. With the metadata coming from deep learning models, analysis and management of unstructured data can become much easier and faster.

One project Antmicro is involved with in this area has resulted in an open source FPGA in-storage compute development platform based Non-Volatile Memory Express (NVMe) for Berkeley Packet Filter (BPF) driven ML processing, employing Linux and Zephyr. The platform, codenamed Alkali, has already been presented at Linux Plumbers Conference and, now open source, can be a great starting point for developing a custom computational storage device.

Computational storage illustration

The NVMe US+ ML platform

NVMe is a high-performance, highly scalable protocol used for accessing high-speed storage media like SSDs. It was designed from the ground up for non-volatile memory media directly connected to the CPU via a PCIe interface which utilizes high speed lanes, and thus can offer transfer speeds more than two times higher than a SATA interface. The NVMe standard specification describes, among other things, all the basic commands required for communication with the storage device. The standard also supports defining vendor-specific commands that use a separate range of opcodes, which opens the path for adding new application-specific commands.

The framework described in this note targets a couple of Xilinx’s Zynq UltraScale+ MPSoC XCZU7EV based platforms, including the off-the-shelf ZCU106 development board. The XCZU7EV combines Xilinx’s UltraScale+ FPGA fabric with a Linux-capable quad-core ARM Cortex-A53, as well as dual-core ARM Cortex-R5 ideal for running an RTOS such as Zephyr, as well as a PCIe Gen3 endpoint which enables connecting to a PC host. Details about generating hardware description files and bitstreams can be found in a dedicated submodule of the Alkali project.

NVMe system architecture diagram

The framework itself consists of 4 major components:

  • A PCIe core, which enables communication with the NVMe target controller,
  • An NVMe target controller, which consists of software running on a Real-time Processing Unit (RPU) core of the Ultrascale+ MPSoC, NVMe control registers IP and a PCIe DMA. The controller is responsible for handling the base NVMe commands,
  • Software running on an Application Processing Unit (APU) core using a Processing System Inter-processor interrupt (PS IPI) core. This piece of software is responsible for handling custom NVMe commands and running the accelerators,
  • Shared DDR memory that is used by both APU and RPU cores and serves as a buffer needed for the NVMe data to/from the host transfers.

To enable access from the host to the AXI interface of our target controller over PCIe, we needed some PCIe-AXI bridge IPs, useful for projects which require a high performance datapath. One such project, called Corundum, focuses on creating a high performance FPGA-based Network Interface Card (NIC).

At the base of the NVMe target controller is a Zephyr RTOS based application running on a Cortex-R5 RPU core. Since the accelerators are run on the APU, the RPU core needs to communicate with it. To achieve that, Antmicro used Open Asymmetric Multi-Processing (OpenAMP), which is an open source framework for developing software applications for AMP systems, hosted by the Linaro Community Projects division.

OpenAMP schematic

The APU runs a Linux system and is responsible for handling the custom NVMe commands dealing with the actual in-storage processing. To achieve the goal, the base NVMe command set had to be extended by several Admin and I/O commands to enable controlling the process of uploading, selecting, executing and collecting the results from the accelerators.

Accelerator firmware

To simplify the development of the accelerated processing code running on the platform, the APU runs an eBPF virtual machine.

Originally, the Berkeley Packet Filtering (BPF) project was created with the network packets filtering use case in mind - it allowed a user-space program to attach a filter to any socket in order to decide whether it should be allowed to come through that socket at a kernel level. The BPF filter definitions are compiled to bytecode that is later executed in a sandboxed environment, so they cannot harm the kernel.

Extended BPF (eBPF) extends the classic BPF so it is possible to safely add new features to the operating system at runtime, for use cases beyond just packet filtering. EBPF programs are nowadays used in system security, data tracing, metrics’ collection, packet processing, load balancing and more.

BPF programs can be written by hand, generated by packet filtering software, or compiled from C using the Clang compiler.

In this project Antmicro used a user-space BPF virtual machine, uBPF to provide BPF features to the APU core. BPF here is used to filter data coming from the host to the CSD and perform additional computations on it. Computation results are later stored in a specified place in the storage. UBPF can be easily extended with new functions, so the BPF code can delegate computations to the FPGA-based accelerator.

VTA deep learning accelerator

For accelerating deep learning operations on the NVMe US+ ML platform, we used the Versatile Tensor Accelerator (VTA) - an open source deep learning accelerator deployable on FPGAs. It is a parameterizable design that accelerates dense linear algebra operations. The VTA accelerator has four instructions:

  • LOAD - loads a two-dimensional tensor from DRAM into buffers,
  • GEMM (General Matrix Multiplication) - performs matrix multiplications,
  • ALU (Arithmetic Logic Unit) - performs various vector operations, such as MIN, MAX, ADD, MULTIPLY or SHIFT,
  • STORE - stores a two-dimensional tensor from the output buffer to DRAM.

It allows the deep learning runtime to accelerate the most computationally demanding deep learning layers like convolutional layers and fully-connected layers.

Besides the IP core design, VTA also provides drivers for the host to communicate with the accelerator.

VTA is tightly coupled with the Apache TVM deep learning compiler - a framework for deploying optimized deep neural network models on various target devices, including edge devices. Together, TVM and VTA form an end-to-end hardware-software deep learning deployment flow.

TVM takes the deep learning model, optimizes it, and transforms its operations into a set of calls to the driver methods - those calls act as JIT runtime.

Enhancing accelerator firmware with TensorFlow Lite

As a runtime for deep learning models in BPF VM, we decided to use TensorFlow Lite, with which we also work in other contexts.

TensorFlow Lite is a set of tools and libraries for compiling and running machine learning and deep learning models on mobile, embedded and IoT devices. It provides a minimal runtime in various programming languages (C++, Objective C, Java, Swift) with a minimal set of dependencies, aiming toward minimal latency, power consumption, model and runtime binary size, and high extensibility. In TensorFlow Lite, it is possible to move computations of some demanding operations to the appropriate accelerators using delegates.

Owing to TensorFlow Lite being light on dependencies and small in size, it was easy to embed it into the uBPF VM. And with TFLite available, we were able to add new functions to the BPF VM, allowing the user to write calls involving deep learning processing in their BPF C code.

Based on the VTA drivers and an analysis of the compiled TVM models for the VTA target, Antmicro’s team wrote a simple delegate for TensorFlow Lite to execute selected deep learning operations on the VTA accelerator running on the FPGA.

In the end, the incoming data can be processed by the BPF program that runs inference on that data using a TFLite model to e.g. perform classification.

The Alkali framework in practice

To demonstrate the project, Antmicro created a simple demo application that offloads the host machine by delegating the vector addition operation to Alkali CSD.

The demo consists of a C program targeting the uBPF machine on the storage, a binary file with an input data vector, a TFLite model describing the data operations, and a host application responsible for managing the communication with CSD.

First, the host application sends the compiled eBPF binary, input vector, and TFLite model to a dedicated region of the NVMe storage over PCIe. Then, custom NVMe commands are used to launch the accelerator on the storage device. As a result, the eBPF machine on CSD gets initialized with the program binary and fed with the input data and TFLite model. The program uses the TFLite framework to perform the addition operation delegated, in turn, to the VTA accelerator.

Finally, the output obtained from the accelerator is stored back in the CSD memory, from which the host application can retrieve the calculated data without using its own processing power for computations.

The demo is available in the Alkali repository, along with instructions on how to use it, and can be run on both supported boards.

Accelerate your ML solution with Antmicro

The efforts described in this note provide just a small glimpse into the vast array of comprehensive services offered by Antmicro for hardware ML acceleration, which include open source IP cores, software, FPGA and hardware development. Starting with a dedicated AI framework, Kenning, which can be used for optimizing AI models for edge deployment, through optimizing all levels of the FPGA development stack including the tools themselves, dedicated co-simulation tooling for ML-focused silicon development up to dedicated hardware such as the AMD Xilinx Kria UltraScale+ SoM baseboard, we’re always happy to put this knowledge to good use to create advanced, custom solutions for our customers.

If you’re interested in learning more about Antmicro’s experience with ML acceleration, don’t hesitate to contact us at

See Also: