10 hrs to 37 mins - Optimizing LLVM for machine-generated C++ code

Published:

Topics: Open software libraries, Open source tools

While helping customers automate and optimize their workflows, especially in complex use cases like ASIC design, Antmicro often finds itself building and enhancing multi-layered code generation infrastructure, HLS tools, transpilers and compilers of various kinds. A good example of such a complex flow is simulating Scala-based Chisel HDL with Verilator, which involves going from Chisel through SystemVerilog to C++ and then compiling this down to machine code.

Verilator can use Clang (LLVM), GCC or MSVC++ for the last step, with LLVM generally being the fastest in our customers’ use cases. In this article, we will describe improvements aimed at speeding up Clang in compiling C++ code generated with Verilator and provide relevant benchmarks. The final outcome of the optimization effort resulted in our verilated C++ model in LLVM compiling 20x faster, making it feasible for use with even the largest Chisel designs.

Verilator and LLVM logos with a speedometer in the middle

Turning Chisel designs into executable models

CHIPS Alliance’s Chisel is a high-level hardware construction DSL (Domain-Specific Language), embedded in Scala. Chisel designs are compiled down by the Chisel toolchain into a lower-level hardware description language (HDL) such as IEEE standard 1364-2005 (Verilog) for the purposes of synthesis and simulation.

Interestingly, the Chisel framework itself transpiles Scala code using another LLVM project, CIRCT; the resulting Verilog is then used by Verilator to produce C++ simulation runtime code. This is in turn handed over to the C++ compiler - LLVM in our case - to obtain the simulation runtime binary, which allows for checking the correctness of designs.

The entire process is multi-layered, as shown below:

Process of changing source code from a high-level language to C++

For this particular workflow, the main problem stems from the fact that LLVM and Verilator were mainly designed to achieve a high level of performance with human-written code, and generated code - generally more long-winded and prioritizing ease of generation over elegance or human readability - is somewhat of a corner case.

Compiling a verilated design with the Clang compiler

For the use case in question (a very large and fairly involved design), when building an executable from the emitted C++ code with LLVM, the compilation process was hanging if Clang was called with the -O1 optimization flag. In particular, the compilation looped for many hours at the Scalar Replacement of Aggregates - SROA part of the LLVM inlining stage. Compiling with GCC did work - proving that the design itself made sense and the problem wasn’t with Verilator or CIRCT - but took a long time, and the assumption was LLVM would be able to do the same thing faster, if only we could fix the underlying problem.

After investigation, it was established that the verilated design generated a lot of LLVM alloca instructions that caused slowdown in the SROA phase, as this pass deals with transformations of struct members into scalar variables.

To illustrate this, consider the following Scala code, created in Chisel:

package xor

import chisel3._
import _root_.circt.stage.ChiselStage

class XOR extends Module {
  val VARS = 16384
  val WIDTH = 1

  val in = IO(Input(UInt(16.W)))
  val out = IO(Output(UInt(WIDTH.W)))

  val test = RegInit(VecInit(Seq.fill(VARS)(WIDTH.U(WIDTH.W))))

  for (i <- test.indices) {
	test(i) := test(i) ^ in
  }

  out := test(in)
}

object XOR extends App {
  ChiselStage.emitSystemVerilogFile(
	new XOR,
	firtoolOpts = Array(
  	"-disable-all-randomization",
  	"-strip-debug-info"
	)
  )
}

After transforming this via Chisel to SystemVerilog, verilating with Verilator to obtain C++ code, and compiling the generated C++ code with the Clang compiler with statistics enabled, a large number of PromotableAllocas was generated. Alloca instructions that may be promoted to registers in the next PromoteMemToReg phase were being inefficiently handled in the SROA phase, as a linear search was being performed on an array of alloca instructions instead of a more informed search.

Optimizing the search process in the LLVM SROA phase

Since we have a guarantee that each alloca is a unique entry, we optimized the search by employing the SetVector data structure, thus optimizing search complexity from O(n^2) to O(nlog n), while preserving the order of entries.

It is a purely algorithmic change, so no functional changes were needed. The PromotableAllocas is just an entity that was being inefficiently handled during the SROA phase. However, in our case it was so large, it had a massive and demonstrable impact on overall compilation performance. The PromoteMemToReg phase wasn’t impacted here, while the PromotableAllocas data structure carries the same entries as before, but simply structured differently.

With the introduction of this change, the compilation time for one of our projects was reduced from 10+ hours to just 37 minutes. To illustrate the improvement, we created a minimal test case in Chisel (described above). In the process of crafting this test case as well as debugging LLVM, we used our open source sv-bugpoint tool which we described in a recent article, once again proving the tool’s utility.

To learn more about the resulting changes Antmicro introduced to LLVM, check out the pull request on GitHub.

Benchmarks and future plans

Using the design code discussed in the previous section, we implemented the following benchmarking scenario:

  1. The design code, written in Scala, is transpiled to SystemVerilog using Chisel
  2. Emitted SystemVerilog is verilated with Verilator
  3. Generated C++ code is compiled with both stock and modified LLVM Clang.

Since checking compilation performance is the main goal of this benchmark, only the last action needs to be measured. To carry out reliable measurements, we used the GNU time utility. We performed 10 measurements for each number of SystemVerilog registries in our parametrized Chisel sample. The results shown below consist of the averaged results.

Benchmarks of compilation time comparing Clang and Clang with SROA optimization

The next step in improving LLVM performance in compilation of Verilator-generated models is underway, with the work focusing on an optimization on the Verilator side to generate even more compact C++ code, thus shortening compilation time even further.

Unlocking seamless hardware design with cutting-edge custom toolchain enhancements with Antmicro

At Antmicro, improvements in the workflow are an integral part of the work itself. We provide commercial support for all sorts of open source tooling, including performance improvements and new features to accelerate the workflows of our customers using our software-based approach.

Get in touch at contact@antmicro.com to learn more about our services.

See Also: