Verilator model generation performance improvements and initial multithreaded verilation support


Topics: Open source tools, Open ASICs, Open FPGA

Verilator can boast the status of one of the most widely used free and open source digital design tools for ASIC and FPGA development. To stay on top of the ever-increasing complexity of ASIC and FPGA devices, as users and contributors, Antmicro has been actively working on improving the tool and its ecosystem, including adding co-simulation capabilities with Renode, adding support for SystemVerilog UVM testbenches to Verilator, or improving scalability for very large designs.

Even though Verilator is most likely already the fastest open source Verilog/SystemVerilog simulator out there, generating and compiling simulation models for large designs can still be very time-consuming. This note presents the improvements we have introduced to model generation (aka verilation) in terms of memory usage and execution time optimizations, as well as an ongoing effort aimed at parallelizing Verilator passes by enabling multithreading in the verilation process.

Accelerating model generation in Verilator

Execution time and memory usage improvements

In the recent months we have added multiple assorted optimizations across Verilator’s codebase that reduced execution times and memory usage for model generation which we list below along with tables containing data presenting the improvements brought on by each of them.

  • We removed the m_name field from AstVarRef (“variable reference”), one of the most commonly recurring AST nodes. Now it references a variable’s name directly, since the name in AstVarRef’ was always identical to it, making the field redundant. Dropping this single field allowed us to reduce memory usage by 8-12% and reduce verilation time by 4-5% for our test designs:

Performance comparison 1

  • We decided to use a custom runtime type information solution instead of C++’s dynamic_cast for checking graph types. This type info is used for downcasting generic graph vertices/edges to concrete types. Similarly to what was already done for the AST and data-flow graphs. It shows a significant performance boost over dynamic_cast. On top of this change, for graph vertices or edges where the desired downcast type is obvious, we have removed typechecks, resulting in further performance improvements for our test designs:

Performance comparison 2

  • Another improvement was to rework the m_selfPointer field in AstVarRef and AstCCall (C/C++ function call). This field contains a string that needs to be put before a given variable reference or function call, and represents a reference to a “self” object, such as this. Each instance of these nodes contained a copy of such a “self pointer” string. After our change, these strings are shared across multiple instances of these nodes. Apart from bringing slight memory usage and performance benefits, it cleans up the code a bit by limiting the number of ad-hoc string operations:

Performance comparison 3

  • We also replaced the costly VN_AS typecast in child node getters with C++’s reinterpret_cast in release builds (this remains unchanged for debug builds for error checking purposes). The type assertion is unnecessary in this particular case, as most of the time the pointers can only be set through properly typed setters. This resulted in 4-5% time improvements for the test designs:

Performance comparison 4

Path towards multithreaded verilation

While all the changes listed above offer tangible speed and memory usage improvements to the model generation process, the most significant potential for gains lies in introducing multithreading to the currently single-threaded process. Model generation in Verilator consists of multiple stages that mutate a Verilog/SystemVerilog Abstract Syntax Tree (AST) that represents the simulated design and then outputs the model as C++ code, ready to be compiled. The branches of the AST that are independent of each other can, in theory, be processed in-parallel but since Verilator is a very extensive project, introducing multithreading to the entirety of the verilation process at once is not feasible, we are set on introducing the changes incrementally.

As far as Verilator’s functions go, they can be divided into thread-safe and non thread-safe. Each function needs to be thoroughly investigated and appropriately marked by hand, which may lead to human errors detrimental to the overall functioning of the tool. To make sure thread-safe functions only call other thread-safe functions, we introduced a CI check that uses clang annotations and verifies this assumption.

As spawning threads is a relatively slow process, and controlling the number of threads is crucial in this scenario, we introduced a thread pool that accepts jobs and queues them up for execution, so that the CPU doesn’t context switch between them unnecessarily, leading to performance loss.

Since multithreaded programming is complex and sometimes unstable (especially at the development phase), it was crucial that we made sure single threaded verilation is unaffected. To ensure there is no time increase or deadlock while generating models using a single thread, we introduced a switch that can disable thread synchronization if not needed.

We also refactored error reporting to make it thread-safe. It is responsible for error reporting, is used in every part of Verilator and can be called recursively. This change reduced total C++ emit time in our large-core based test design from 144 s to 80 s. Error reporting makes for an interesting example of the type of obstacles one meets when refactoring code to be thread-safe, as reporting cannot be parallelized since it can only have one output to which we need to print in-order (e.g. the terminal).

With the changes listed above already part of mainline Verilator, we can now proceed to focus on the next steps towards making multithreaded verilation reality. We have identified numerous other areas we need to optimize for full parallelization, e.g. the ability to disable multithreading locally so that thread-safe functions can be called only when they are necessary, or analysis and adjustment of subsequent passes for multithreading optimization purposes. We will be reporting on this progress in future blog notes.

Integrate Verilator into your project’s workflow with Antmicro

While Antmicro’s efforts to enable multithreading across Verilator along with other improvements are ongoing, we can help you adopt, adapt, and extend Verilator to integrate it into a workflow tailored to your particular project, including scaling it for very large designs and co-simulation of entire systems with Renode.

Reach out to us at how your use case can benefit from our comprehensive engineering support stemming from years of experience in commercial and R&D FPGA and ASIC based projects and from the flexibility and approachability open source solutions offer.

To learn more about our contributions to Verilator, join us for one of our talks at ORConf on 15-17 September in Munich.

See Also: