Fast development of AI applications in Zephyr with Linkable Loadable Extensions support in Kenning

Published:

Topics: Edge AI, Open source tools

Kenning and Kenning Zephyr Runtime enable easy, iterative development of ML models deployed on Zephyr RTOS using various AI inference libraries. When testing models on a device with Kenning, switching between models is usually a matter of running Kenning on the host with a new model to optimize and deploy. However, when any changes in the AI inference library are involved, they require recompiling the binary and flashing the board, increasing turnaround time. This includes situations like:

  • switching between inference libraries (e.g. from microTVM to IREE)
  • changing models in non-interpreter-based inference libraries (where the model is compiled into a tailored library, like microTVM)
  • introducing new operators to interpreter-based inference libraries, such as LiteRT for microcontrollers (formerly TensorFlow Lite Micro)
  • developing or optimizing selected operators in the model and testing them on the target platform.

To enable fast-turnaround model and runtime development in such cases, Kenning now takes advantage of Linkable Loadable Extensions (LLEXT), one of the recent features of Zephyr. The extensions allow for loading binaries containing implementations of functions (in a manner similar to loading a library) in the application during execution. This allows developers to load and replace model data alongside implementing operations, and to switch between various runtimes by storing them in and loading from LLEXT, while keeping the rest of the application and basic wrappers around LLEXT modules unmodified and running on the hardware.

With this article, we will go over how this development and its implementation in Kenning Zephyr Runtime enables easy model and runtime updates on hardware without reflashing, both in lab conditions and directly on production devices. We will also describe demo applications for LiteRT for microcontrollers and microTVM running in Zephyr on boards simulated in Antmicro’s Renode open source simulation framework.

Fast development of AI applications in Zephyr with Linkable Loadable Extensions

Improved build configurability

Kenning Zephyr Runtime allows developers to benchmark and evaluate models on platforms supported in Zephyr, test inference implementations and deploy tailored model implementations in production. It can use either microTVM, LiteRT for microcontrollers or IREE for inference, depending on which works best for a given use case (model, HW, etc.), on a broad selection of platforms (also simulated in Renode), as illustrated by the Renode Zephyr Dashboard.

Originally, Kenning Zephyr Runtime required developers to provide models in build configuration in the target format, like TVM generated code or TFLite Flatbuffer. This required developers to perform additional steps in other scripts or in Kenning to convert models in their original format (Keras, PyTorch, ONNX) to the supported variant.

What we’ve introduced now is the automatic conversion of models to the target format within a CMake build flow. This allows to pass models in their original format, like so:

west build -p always -b stm32f746g_disco app -- \
    -DEXTRA_CONF_FILE=tvm.conf \
    -DCONFIG_KENNING_MODEL_PATH=./model.onnx

Here, we build the microTVM runtime, but pass the input model in the ONNX format.
Under the hood, the project’s CMake runs Kenning that optimizes and compiles the model, so that the user can simply load the model from the application under development.

To get an actual working example, once Kenning Zephyr Runtime is configured as described in the project’s README, run west with a URL to a model, e.g.:

west build -p always -b stm32f746g_disco app -- \
    -DEXTRA_CONF_FILE=tvm.conf \
    -DCONFIG_KENNING_MODEL_PATH="https://dl.antmicro.com/kenning/models/classification/magic_wand.h5\"

Initial support for quick model and runtime library updates in Kenning Zephyr Runtime with LLEXT

One particular issue the Linkable Loadable Extensions implementation aims to solve is the use of operations specific to a given model created by optimizers such as TVM. In such cases, you can compile a model to a target device and receive a library and/or sources for the specific operations that are used by the model, or change the entire implementation of the model, including operations. In these scenarios, the user would typically be forced to recompile the entire application and flash the device with the new model.

With LLEXT, you can create an ELF with the model and runtime implementation and deliver a separate LLEXT module which can be easily replaced e.g. via UART. This minimizes the number of steps necessary to test, optimize and deploy models on target platforms and could potentially be used to dynamically update models running on target platforms in production scenarios, without forcing a full restart of the device, and allow for easier dynamic OTA updates e.g. using RDFM, which added Zephyr support earlier this year.

Within LLEXT, you only need to deliver the implementation of the model and associated functions, along with symbols that are used by the application running on the device to call functions responsible for model inference.

Rebuild runtime implementations with simplified deployment using Kenning’s RuntimeBuilder block

The RuntimeBuilder block is a recent addition to Kenning’s core, which supplements the functionality offered by LLEXT by providing a class that allows for building new runtime implementations and delivering them to targets. This allows for seamless jumps between various runtime implementations for, e.g., Zephyr-based solutions or in situations where the runtime implementation is tightly bound to the implementation of the model (like in IREE or microTVM), giving the ability to tinker with the implementation of the runtime, not just optimize the model, and granting more control over model execution.

To enable automatic Zephyr builds, add ZephyrRuntimeBuilder to the scenario, e.g.:

...
"runtime_builder": {
    "type": "kenning.runtimebuilders.zephyr.ZephyrRuntimeBuilder",
    "parameters": {
        "workspace": ".",
        "board": "stm32f746g_disco",
        "extra_targets": ["board-repl"],
        "output_path": "./output",
    }
},
...

To demonstrate the potential offered by LLEXT and the recent changes in Kenning, we can start with setting up the Kenning Zephyr Runtime ecosystem. To do so, follow setup instructions in the Quickstart section of the project’s README (all steps leading up to Building and running demo app). Then, you will have a Docker container with all necessary dependencies for the demo, including Zephyr, Kenning and Renode.

Now, you need to build the base app for Kenning benchmarking with a llext.conf setup — it will build Kenning’s inference server for the target device with LLEXT enabled. It will not contain any runtime — it will wait for LLEXT with the runtime and the model, and load it into the application once it arrives via UART. You can build the base app as follows:

west build -p always -b stm32f746g_disco app -- DEXTRA_CONF_FILE="llext.conf"

To simulate the STM32F746 Discovery, we will use the Renode framework. While building the ELF file, corresponding REPL files for Renode are built as well.

Let’s now simulate the device in Renode — since it’s ready to use in the running container, let’s attach to it in a separate terminal. Open a new terminal, check the ID of the running container (e.g. in docker ps or in container’s prompt, e.g. root@03bb003e0882 means that the ID is 03bb003e0882) and run:

docker exec -it 03bb003e0882 /bin/bash
source .venv/bin/activate
python ./scripts/run_renode.py

In Renode, you may observe the following logs:

Starting Renode simulation. Press CTRL+C to exit.
*** Booting Zephyr OS build v4.0.0-4-g371e3a6769c4 ***
I: Inference server started
E: Error receiving message: 773 (KENNING_PROTOCOL_STATUS_TIMEOUT)
E: Error receiving message: 773 (KENNING_PROTOCOL_STATUS_TIMEOUT)
...

The timeout messages are expected — the server is waiting for activity from the Kenning application running on the host.

In the original container shell, a model containing LLEXT can be either built and deployed manually or automatically using Kenning as shown below:

python -m kenning optimize test \
    --json-cfg kenning-scenarios/zephyr-tflite-llext-magic-wand-inference.json \
    --measurements results.json \
    --verbosity INFO

Later, you can switch to an entirely different runtime (while still running the same device), in this case microTVM:

python -m kenning optimize test \
    --json-cfg kenning-scenarios/zephyr-tvm-llext-magic-wand-inference.json \
    --measurements results.json \
    --verbosity INFO

Kenning will now build the LLEXT extension, send it to the device and run inference in the loaded runtime.

Kenning Zephyr Runtime demo applications in Zephyr Dashboard

To demonstrate the extent of platform support in Kenning Zephyr Runtime, we introduced example applications for LiteRT for microcontrollers and microTVM to the Renode Zephyr Dashboard. The dashboard lets you see CLI recordings, detailed execution logs and post-test reports from the demos running in Zephyr for each board, as well as download all the files necessary for the simulation.

As for now, Kenning Zephyr Runtime supports 524 different boards with LiteRT inference engine and 477 boards with microTVM inference engine. This shows how easily it can be integrated on various hardware platforms with heavily limited resources.

Speed up development and testing with LLEXT, Kenning and Renode

The implementation of Zephyr’s Linkable Loadable Extensions in Kenning enables developers to hotswap AI models, their configurations and inference engines without the need for reflashing the target platform, leading to quicker turnaround for edge AI platform development. It can also be used to update models on microcontroller platforms on demand in OTA scenarios.

If you are building AI enabled products, developing AI-enabled edge platforms or inference runtimes and need help with adding support in Zephyr RTOS and Kenning Zephyr Runtime using MicroTVM, LiteRT or other inference runtimes; or if you want to add support for your boards in Renode for faster prototyping and testing of edge AI systems on physical and simulated targets, feel free to reach out at contact@antmicro.com.

See Also: