Speeding up SoC interconnects with test-driven FPGA development using Cocotb

Published:

Topics: Open FPGA, Open source tools

Antmicro is working with a number of FPGA and ASIC projects, where the constraints of the hardware platform require efficient data transfer mechanisms. The full control over the entire design process offered by open source IPs and tools that we use and develop allows us to optimize performance as needed. Speeding up FPGA designs is best done using open source test-driven design methodologies which enable us to continuously benchmark and monitor the results. Cocotb is a great Python-based tool offering a software-driven approach to hardware testbench development, and we have recently used it to improve the performance of an FPGA-oriented system used in the CFU Playground project from Google. In this note we will describe how to use test-driven FPGA development with Cocotb to significantly speed up operations on the interconnect bus based on a real life example of implementing burst mode.

SoC illustration

Optimizing communication on the system bus

We have been collaborating with the Google team behind CFU Playground for a while now, helping them build a versatile prototyping framework for power-efficient solutions for ML in FPGA. Especially in such constrained use cases, where the design is limited with regard to the maximum clock speed or energy usage, you need a focused approach to gateware optimization. One way to find potential optimizations is to locate bottlenecks in the most frequently used data flow paths.

The system bus, which is responsible for transferring data between each component of an SoC, is a logical place to look for potential low hanging fruit. One of the standards defining interconnect schemes for SoC components is the open source hardware Wishbone interconnect maintained by the FOSSi Foundation. Due to its permissive license it is popular in many open hardware systems, especially in FPGA. The Wishbone specification defines “Registered Feedback” bus cycles, which enhances the synchronous cycle termination scheme with a pipelining mode. Thanks to using fewer clock cycles for transfers with multiple data words, a higher throughput can be achieved. Typical usage of such pipelined operations includes filling CPU cache lines and DMA transfers between memory and high-speed peripherals.

According to the Wishbone documentation, burst mode needs to be implemented on both sides of the connection (e.g. the CPU and a peripheral such as a memory controller) by adding transfer type signaling, an address generator block inside the peripheral as well as synchronization logic. After starting the transfer, the address generator in the peripheral is used for preparing the next accessed address on each clock cycle, allowing for sequential operations without waiting for each subsequent word transfer to be acknowledged.

So, in our example, while implementing burst mode seems like it might be a good idea, the system needs several changes to get there. How to do it in a structured manner, without introducing bugs to an otherwise functioning system?

Test-driven development for FPGA with Cocotb

Introducing tests before you make implementation changes is a common practice in software, but efficient TDD (or Test-Driven-Development) in FPGA presents some challenges due to several factors: tool licensing, long build times, and the relative complexity of the combination of hardware, FPGA bitstreams and software, where issues can appear on many levels.

Throughout the years, Antmicro has been building up many clever and open elements which together allow our work with FPGAs to be much more efficient and software-driven. This was achieved by creating open source toolchains as part of the CHIPS Alliance-hosted F4PGA Workgroup (the CFU Playground also uses an open toolchain), building cloud based CI systems for FPGA development that allow us to offload processing to CI servers (including video recordings of FPGA boards in our server rooms), as well as adopting new development paradigms such as Cocotb. Cocotb provides an excellent Python-based approach to testbench creation and seamlessly integrates with traditional software development methodologies.

In this effort, we decided that it makes sense to implement burst mode for two peripherals. The first one is an I/O module containing two FIFO queues, which was used as a test module in the process of implementing the Wishbone interface. The second one is an SRAM module, which is used in designs as an internal SoC RAM for storing data and code before initializing larger RAM modules.

During our work on adding burst mode support we needed to verify if the burst mode logic is generating the addresses correctly and that the peripherals are working properly independently of the bus cycle type. For this purpose a set of unit tests was written using Python and the Cocotb library. Cocotb was used for writing test benches for each type of tested peripheral (FIFO transmitter, RAM) and each type of bus cycle (classic, constant address burst, incrementing address burst). Additional libraries, cocotb-test and pytest were used for test parameterization, which allowed for running the same tests with different parameters and input data. For example a parameterized cocotb test for the incremental burst mode support in SRAM looks as follows:

@pytest.mark.parametrize("offset", range(0, 16, 4))
@pytest.mark.parametrize("length", [1, 2, 4, 8, 16])
@pytest.mark.parametrize("bte", range(4))
def test_sram_incrementing(offset, length, bte):
    reg = mem_regs["sram"]
    extra_env = {
        "adr_base": str(reg["base_address"]),
        "adr_offset": str(offset),
        "adr_inc": str(4),
        "length": str(length),
        "bte": str(bte),
        "sram_fill": str(1),
    }
    run(
        verilog_sources=["dut.v", "tb.v"],
        toplevel="tb",
        module="tests.test-incrementing",
        extra_env=extra_env,
        waves=not no_waves,
    )

In this example, the clean and familiar pytest decorator syntax is used to parametrize the test scenario to generate as 80 test cases, and in general the resulting code is easy to read, modify and extend - as well as integrate into a bigger whole such as an automated CI.

An extension called cocotbext-wishbone helps abstract operations on the interface to avoid reimplementing the logic required for driving the Wishbone interface of peripheral modules. Because burst mode depends on additional signals on the Wishbone interface, we modified this extension to drive additional signals and register their state after each operation.

Cocotb testbench diagram

The test suite consists of three test groups, each for a different bus cycle type. Each of these test groups contains unit tests responsible for checking basic read and write operations in different combinations, which was important when testing new bus cycle types. With CI integration, these tests were continuously executed after every commit, allowing for detecting regressions early.

Benchmark results

To be able to measure the impact of these changes in the environment closer to real-world systems, a test environment was created using LiteX. The same software was used both for running on physical FPGAs, as well as simulation in Verilator. This allowed for more reproducible results by benchmarking on real, unmodified software binaries that can also be used for deploying on hardware.

The test SoC was designed around the popular VexRiscv softcore implementing the RV32 instruction set. A configuration with enabled L1 instruction and data caches was chosen to exercise burst mode support in this core, allowing us to focus on developing and testing burst mode support on the peripherals. Interaction with the BIOS and benchmark program was done by UART, while Ethernet was used for debugging the SoC through the Wishbone bus bridge.

The benchmark program was based on memory speed test routines from LiteX libraries, allowing both sequential and random memory reads/writes to be executed on selected memory blocks - the previously modified SRAM module in this case. Memory speed tests running on the VexRiscv core with L1 data cache showed that sequential memory reads were 26% faster due to faster filling of cache lines.

Benchmark results table

The Wishbone Interconnect Burst Mode Benchmark has been released on GitHub so that you can grab it and reproduce the results for yourself in case you are working on a similar problem.

Future developments

Thanks to the generic nature of the address generator module it can be reused in any peripheral using the Wishbone Interconnect for communication. Among the cores from LiteX suite, which would benefit from the burst mode support due to their importance in a typical System on Chip design, are peripherals such as the DRAM controller, DMA core and L2 cache module. Due to the fallback mechanism described in the Wishbone Interconnect specification, changes to the core can be implemented incrementally, falling back to classic bus cycle synchronization scheme in case of burst modes that are not yet supported by the peripheral.

If you are interested in the benefits open source toolchains and Cocotb can bring to FPGA development, or want to create your next SoC using tools such as the ones described in this note, don’t hesitate to contact us at contact@antmicro.com. We offer comprehensive engineering services and help our customers adopt new and optimize existing FPGA/ASIC development flows, providing expertise and support on every step of the way.

See Also: