OTA updates and fleet management for Zephyr-controlled MCUs with RDFM

Published:

Topics: Open cloud systems, Open source tools

Reliable, fast and secure updates are crucial for uptime and scalability of physically dispersed industrial setups Antmicro helps build, and in order to address this need for internal and customer projects, we have been developing Remote Device Fleet Manager (RDFM), Antmicro’s open source framework for modular, configurable OTA updates as well as fleet and ML model management.

So far, the effort behind RDFM was focused around systems running Linux and Android, but since real-world devices often include auxiliary MCUs or MCU-based daughter nodes that also run software, addressing updates for resource-constrained devices was the natural next step for the framework. Such MCU + application solutions are widely used across many industries, including robotics, manufacturing, medical and space, with real-time, continuous operation functions realized with lower-performance devices and user-facing functionalities typically performed by Linux/Android. In those mixed-OS scenarios, having a unified management system is especially useful.

As Platinum Members of the Zephyr Project and active contributors, we obviously picked Zephyr as the RTOS to integrate with to enable support for microcontrollers and other constrained devices in RFDM. Below, we will go into the details of the implementation and present a demo of the new functionality working with devices simulated in Renode - which now supports over 500 Zephyr-enabled MCUs.

OTA updates for Zephyr with RDFM, illustration

Secure, efficient update flow with A/B partitioning

Just like the Linux/Android implementation, RDFM for Zephyr utilizes an A/B partitioning scheme, handled by Zephyr and the MCUboot bootloader. The implementation works with MCUmgr, an open source MCU management library, already used in the RTOS. The communication between RDFM and the target device is handled by the MCUmgr RDFM client which allows remote deployment of Zephyr applications via RDFM to devices running an implementation of the MCUmgr server utilizing the Simple Management Protocol (SMP). Checking for updates, fetching, validating as well as uploading them to the target device is handled by the client, minimizing the overhead for the target device. A single MCUmgr client can act as a proxy for multiple devices at a time, and since MCUmgr is not Zephyr-specific, it also offers potential for extending RDFM support to other RTOSs.

A typical update flow with RDFM looks as follows:

  • The device’s flash memory is beforehand partitioned into slots: primary and secondary
  • The main application runs on the primary slot
  • When commencing an update, the image is uploaded to the secondary partition and marked as a test image
  • Upon reboot, MCUboot swaps the contents of primary and secondary partitions
  • The new image is booted
  • Upon successful boot, the update is considered successful and the new image is marked as confirmed, finalizing the update
  • In case of errors, the device reboots without confirming the image, MCUboot considers the new image rejected and swaps the images in primary & secondary slots once more and boots the old image.

Several kinds of errors will lead MCUboot to revert to the old image, for example:

  • The new image causes the device to crash and reboot (e.g. by a watchdog)
  • The image is signed with an incorrect key
  • The image was corrupted during the image swap process.

RDFM update flow diagram

Developing and testing Zephyr-based solutions in Renode

The RDFM repository contains CI tests that build RDFM, MCUmgr RDFM client, Zephyr-based payloads and test the entire update flow in Renode simulation.

Renode offers vast and ever-growing support for Zephyr-enabled devices, with over 500 supported boards as of the writing of this article, as illustrated by the Renode Zephyr Dashboard, a massive CI setup testing all of Zephyr targets across a range of demos on simulated targets. Work on the dashboard has also enabled us to bring many improvements to Zephyr itself, including a stream of updates to devicetree descriptions and other fixes.

Combining RDFM, simulation in Renode, and a CI environment lets developers scalably and reliably test their solutions end-to-end, taking into account post-deployment scenarios, even very early in the process, as well as ensure proper functioning of the software later on.

RDFM CI workflow for Zephyr

The videos below demonstrate an example workflow for updating an STM32F7 MCU (simulated in Renode) running Zephyr using RDFM with updates served in a GitLab CI environment. In the first clip, the RDFM server starts, the update packages are uploaded with the CI system and identified by RDFM:

The second part of the update flow video shows the process of creating a device group in the manager, adding a device to the group, and successfully deploying an update from v0 to v1 on the device via MCUmgr.

Build and control your distributed, multi-OS solutions with Antmicro

With the addition of Zephyr RTOS support, RDFM, along with other parts of Antmicro’s open source portfolio, is not only a tool capable of updating and managing multi-OS fleets of devices but also a development aid that can help test your products throughout their entire lifecycle.

If you are interested in adopting RDFM for managing your existing fleets, or you would like to use Antmicro’s expertise for building such systems from scratch with complete transparency and ownership of the finished product, do not hesitate to contact us at contact@antmicro.com to discuss your needs.

See Also: