Scaling Verilator for very large designs

Verilator is a fast, open source simulator widely used in the ASIC and FPGA ecosystem, offering state-of-the-art (or better) results in contexts otherwise dominated by proprietary offerings. Its open source nature and the promise of infinite scaling using cloud resources makes it a high-value part of the open source hardware ecosystem. As such, it’s supported by the CHIPS Alliance and is being actively contributed to and used by Antmicro in a variety of ASIC-related customer projects. Some additions created through those engagements so far include the addition of dynamic scheduling, co-simulation with Renode or integration with cocotb.

The ever-increasing size and complexity of FPGA and ASIC designs requires continuous improvements in Verilator’s capability to handle them, and through its customer projects Antmicro is heavily involved in the effort to enable state-of-the-art design support in the open source simulator. In this note, we’ll describe several memory usage and speed-related improvements Antmicro has introduced to Verilator in collaboration with the project maintainers, allowing us to help our customers optimize and scale it even for very large designs.

Improving Verilator illustration

Analyzing the baseline

The most obvious problem that users may run into when trying to simulate a very large design will most likely be related to memory usage - the bigger the design, the more resources it consumes. While cloud processing instances with substantial memory capacity are available, they are also costly, so decreasing the memory footprint can noticeably reduce cost, especially if you run many simulations, as is necessary for state-of-the-art ASIC designs.

But in order to reduce the memory footprint, we first need to understand the initial state and the implication of any changes to be made, i.e. which objects and operations take up the most memory and processing time.

Verilator itself provides resource usage and performance statistics that can give you an overview of your design, e.g. elapsed time at each stage, memory used by Verilator processes at the end of each stage, or node count. A helpful indicator of memory usage is Resident Set Size (RSS), which provides information about the amount of RAM occupied by a specific process. Peak RSS values can give us insight into the most memory-consuming processes but also serve as a reference point for analyzing the effects of the improvements introduced to Verilator.

Massif, a heap profiler provided with the Valgrind framework, is another tool useful for memory usage analysis - it allows you to view results in a visual form with Massif Visualizer, or to convert them into a textual report using the ms_print command.

Using detailed data about performance, memory usage and peak RSS values, a systematic and measurable approach can be taken towards decreasing Verilator memory consumption, which has been the focus of some of Antmicro’s projects in the last months.

Optimizing memory usage

Among the many improvements Antmicro has introduced to Verilator as part of this effort, one of the most important ones concerns reducing the size of V3Number. V3Number is an object that stores a value (such as a number, bit vector or string), used in SystemVerilog code. Its significance stems from the fact that it’s usually present in multiple nodes. For example, one of the designs we worked on that used about 70GB of RAM included 80 million “Const” nodes, each containing one V3Number. In this case, cutting just a few bytes from every V3Number object noticeably reduces the total memory usage.

The implemented improvement involved two parts: first, two mutually exclusive data members in the V3Number class were identified and instead of always storing both (one of them being empty), the union type was used to store only one of them. The second part included identifying a few properties that can be inferred from the stored value type or from other properties. This allowed for removing data members storing those properties. As a result, we were able to reduce the V3Number size from 96 to 56 bytes - in the example mentioned above this led to a reduction in memory usage by another 3GB.

As described in the previous section, peak RSS is a good indicator of memory usage, and another improvement introduced by Antmicro focused on partially merging stages to avoid node duplication. If the first stage creates multiple nodes from a single node, but the next stage reduces the number of nodes, they can be merged, decreasing the total number of nodes, and hence decreasing the RSS.

Speed optimization

Reducing memory usage is, of course, critical to get large designs to run in the first place, but assuming that the workload you are trying to execute in Verilator fits in the memory available in your machine, the next parameter you will obviously focus on is speed.

There are two stages that can be optimized here, verilation of the HDL code (i.e. compiling it into executable C++/SystemC) and execution of the resulting binary. In this note, we will focus on the former, with more work on the latter to be described in future follow-ups.

As mentioned, Verilator processes the input HDL into simulation code that can later be compiled. While benchmarking one of the larger designs we’re working on, we discovered that as much as 6% of total execution time was taken up by formatting the code to be more human-readable, which isn’t always necessary, especially in case of running Verilator in CI. Allowing users to disable this feature and making output simulation code formatting optional translates to a noticeable drop in verilation time, especially for large designs.

At the end of verilation, each node is converted into a C++ code fragment that will later serve as the source for simulation. This process can be optimized by using a memory buffer. Buffer sizes used by buffered IO functions (e.g. fopen()) are not always optimal for writing large amounts of data. Small buffers require more frequent calls to the write() system call, which are relatively slow, since each call has some fixed overhead. On the other hand, buffers that are too large don’t fit in the CPU cache. Benchmarking has shown that the optimal buffer size on relatively modern hardware is 128kB, which is also the optimal size according to GNU coreutils. In one of the designs we tested, optimizing the buffer size resulted in a ~1.5% shorter verilation time (from 245 to 241 s). On its own, it may not look like much, but multiple optimizations add up, resulting in significant improvements.

Parallelization illustration

While the generated simulation can be multi-threaded, Verilator itself runs in a single thread. Adapting Verilator to run in multiple threads can significantly improve execution time. However, there are some challenges with this approach. Writing code that is multi-thread safe is not always easy. There are some static analysis tools that can help, but they require proper setup. Verilator also requires the functions used in a multi-threaded context to be marked as VL_MT_SAFE. For these reasons, implementation of this improvement involved multiple changes to Verilator, which included removing the usage of user2 in EmitCLazyDecls and user1 in EmitCTrace, preparing EmitCImp for parallel emit, adding build-jobs options, and more.

Customizing and extending Verilator capabilities with Antmicro

The efforts to make even more designs and workflows compatible with Verilator continue - and Antmicro is part of the effort to develop multiple further improvements, including a custom tool to parse function annotations to ensure all functions used in a parallel context are marked properly, which will further speed up the verilation process. There are also many runtime performance optimizations that we have not discussed in this note but definitely are a broad and fascinating topic in themselves, to be tackled at another time.

Antmicro helps customers adopt and expand Verilator’s capabilities for their unique use cases, even those involving very large designs. With broad experience in both ASIC and FPGA, we offer comprehensive engineering support allowing you to have complete control over the entire design workflow. If you would like to adapt Verilator - or another open source tool - for your next project, and benefit from the scalability and customization options it provides in combination with our services, make sure to reach out to us at contact@antmicro.com.

Scaling Verilator for very large designs

Analyzing the baseline

Optimizing memory usage

Speed optimization

Customizing and extending Verilator capabilities with Antmicro

Topwrap: better SystemVerilog support for complex designs, auto-validation, and support for AXI interconnects

Multi-objective optimization in AutoTuner for efficient ASIC design selection in OpenROAD

Implementing automatic clock gating in the OpenROAD ASIC design toolchain

Analyzing the baseline

Optimizing memory usage

Speed optimization

Customizing and extending Verilator capabilities with Antmicro

Related Posts

Topwrap: better SystemVerilog support for complex designs, auto-validation, and support for AXI interconnects

Multi-objective optimization in AutoTuner for efficient ASIC design selection in OpenROAD

Implementing automatic clock gating in the OpenROAD ASIC design toolchain