Improved atomic operations for simulating complex multi-core systems in Renode

Renode allows you to emulate multi-core platforms, executing code on each core in isolation from one another, in parallel. However, in Symmetric Multi-Processing (SMP) architectures, the CPUs can communicate using various mechanisms, e.g. atomic instructions. Renode implements these mechanisms, but the extra synchronization between cores comes at an additional cost in processing time.

In this blog article, we discuss improvements to Renode’s emulation and its efficiency by implementing a different Load-Reserve/Store-Conditional (LR/SC) logic. We take a closer look at how we introduced a new CPU synchronization algorithm in Renode, based on the Hash Table-based Store-Test (HST) scheme, which contributed to the increased performance while maintaining the accuracy of emulating multi-core RISC-V and ARMv8-A platforms.

Improved atomic operations support in Renode - main

Handling atomic operations in multi-core simulations

While all modern processor architectures provide atomic instructions, not all of them follow the same approach. Some Reduced Instruction Set Computer (RISC) architectures, like RISC-V, have Load-Reserved/Store-Conditional (LR/SC) instructions, also called Load-Linked/Store-Conditional (LL/SC), while common Complex Instruction Set Computer (CISC) architectures, like x86, have Compare-And-Swap (CAS) instructions.

How LR/SC works:

LR loads a value from a memory address, and marks the address as reserved.
A reservation is invalidated whenever any other core writes to that address, or when the same core executes an SC (regardless of success/failure).
SC conditionally stores a value to the reserved address, if and only if the reservation is still valid.
If SC fails, the entire operation can be retried starting from the LR (as this indicates a new value is available and the calculations must be redone).

In a sample case:

Stage	CPU	Action	Description
1	`CPU0`	`lr 0x1000`	CPU 0 loads a value, reserving its address
2	`CPU1`	`lr 0x1000`	CPU 1 does the same
3	`CPU1`	`add 1`	CPU 1 increments the loaded value
4	`CPU1`	`sc 0x1000`	CPU 1 stores the incremented value, succeeding as the reservation is still valid
5	`CPU0`	`add 1`	CPU 0 now increments the outdated value it loaded before
6	`CPU0`	`sc 0x1000`	CPU 0 tries to store the new value and fails, as the reservation was invalidated (by CPU 1’s `sc`)

This functions as a transaction, similar to transactions in databases, where either it all completes without interference or it fails and must be retried.

CAS, on the other hand, loads a value from memory, checks that it is identical to an expected value, and conditionally stores a new value if it is.

LR/SC semantics are more expressive than CAS, meaning they can be used to implement a wider array of behaviors. LR/SC can be used to emulate CAS, but CAS cannot be used to correctly emulate LR/SC. If CAS is used for this, it gives rise to the ABA problem.

The ABA problem arises because CAS only checks whether a value has changed and does not detect whether the value was changed and then reverted to its original state. The figure below depicts a lock-free stack in one scenario where the ABA-problem manifests.

ABA problem - sample case - graph

In this example, thread 1 performs some stack manipulation, but because it uses a CAS, it does not know about the underlying removal of node B caused by thread 2’s interleaving. If thread 1 would have instead loaded the top in step one with the help of an LR, the subsequent SC would fail, as shown in the table above.

The usage of either CAS or LR/SC semantics can produce different outcomes and this is a clear example where CAS fails to emulate LR/SC correctly.

Emulating LR/SC on CISC platforms with Hash Table-based Store-Test

Our previous approach, a global memory lock, was one way to solve the problem of correctly emulating LR/SC on CISC platforms in Renode, but that came with a large performance hit, as it prevents all memory accesses from other cores while an atomic instruction is executing. Fine-grained hash table entry locking allows locking a very small memory region at a time. This is a much better solution than a global memory lock and results in vastly increased performance.

The Hash Table-based Store-Test (HST) is a scheme proposed by Zhao et al. that enables LR/SC emulation on CISC platforms while preserving good performance. It uses an extremely efficient hash table implementation to keep track of memory accesses.

Before the implementation of HST, to ensure correctness using global memory lock, all memory accesses had to go through the address translation layer (the softMMU), which meant disabled caching and even worse performance. With HST, every store gets instrumented, regardless of whether it passes through the softMMU or not. This ensures that HST is aware of every possible reservation invalidation that can occur. Both RISC-V and ARMv8 guests use HST to efficiently emulate atomics.

In synthetic tests, our CPU microbenchmark shows that the efficiency of this implementation scales with the number of cores:

Performance in CPU microbenchmark - LRSC

The PARSEC benchmarks compare the HST implementation against the previous solution with global serial execution turned on, showing 2.7x faster operation in Renode:

Absolute performance in PARSEC - Facesim line plot

Advanced system simulations in Renode for your use case

With the recent improvements to atomic operations and more efficient emulation of SMP platforms, Renode excels at simulating complex multi-core systems. The HST-based atomic operations are both very reliable and fast, with benchmarks showing improvements in performance of up to three times faster execution speed, which additionally scales with the number of cores involved (the more cores, the greater the improvement), compared to the solutions used previously. The SMP platforms can now be tested in a very time-efficient manner, which further contributes to accelerated development and testing, enabled by Renode.

Antmicro has broad experience using Renode to simulate entire systems for purposes such as design and exploration, code coverage analysis, testing and validation, optimization works, and more; we also integrate Renode with external IDEs and other tools to augment these processes.

Email us at contact@antmicro.com, so we can discuss how Antmicro’s Renode services can aid your development efforts and systems that you design, operate, and expand.

Improved atomic operations for simulating complex multi-core systems in Renode

Handling atomic operations in multi-core simulations

Emulating LR/SC on CISC platforms with Hash Table-based Store-Test

Advanced system simulations in Renode for your use case

An experimental GUI for alternative interaction with the Renode simulation framework

Renode support for the high-performance, low power NXP i.MX RT700 MCU

Initial support for reverse execution in Renode via GDB

Handling atomic operations in multi-core simulations

Emulating LR/SC on CISC platforms with Hash Table-based Store-Test

Advanced system simulations in Renode for your use case

Related Posts

An experimental GUI for alternative interaction with the Renode simulation framework

Renode support for the high-performance, low power NXP i.MX RT700 MCU

Initial support for reverse execution in Renode via GDB