Over-The-Air streaming updates using RDFM for NVIDIA BSP releases

Published:

Topics: Open cloud systems, Open software libraries, Open source tools

When deploying large fleets of devices, the ability to update devices remotely without requiring difficult and costly physical access is not only convenient, but necessary. Today, with widespread LTE or satellite internet access, updates can be deployed in a variety of scenarios, including those that require a large amount of flexibility in scheduling within the roll out process. Implementing reliable OTA updates for complex systems, however, requires a lot of expertise and careful planning, as well as a solid toolkit.

An image showing RDFM in the cloud connecting and downloading updates to a large fleet of NVIDIA devices

To provide a robust implementation of Over-The-Air (OTA) updates for NVIDIA Board Support Packages (BSPs), some time ago Antmicro developed an update solution for NVIDIA-based devices that builds on our previous work with OTA implementations for embedded Linux, Android and other systems. Our solution is compatible with newer NVIDIA JetPack releases through the meta-antmicro Yocto layer. RDFM can be used to remotely update devices based on the Xavier and Orin families of System-on-Modules (SOMs), such as ones based on the Jetson Orin Baseboard designed by Antmicro. Our OTA update solution can also be integrated with the latest Linux4Tegra (L4T) that includes a new kernel and bootloaders, ensuring compatibility with new devices without requiring specific software customizations.

Existing NVIDIA JetPack solutions for bootloader updates

When implementing any update system, it is essential to include A/B redundancy to ensure the safety and stability of the boot process. This requires changing boot slots, which is handled by the nvbootctrl tool in NVIDIA BSP releases. This tool operates differently in various versions of the NVIDIA JetPack SDK, as described below:

In JetPack 4, the SDK used a separate partition for boot slot metadata, operating in the QSPI flash memory which was read and modified by the nvbootctrl tool, with the new metadata used by CBoot/U-Boot. In this implementation, the initial bootloaders are agnostic to any slot changes. When the Jetson SoM boots to either CBoot/U-Boot or UEFI, the Boot Configuration Table stored in the QSPI flash memory is overwritten, resulting in changed data that is provided to the initial bootloaders.

In later versions of JetPack, access to the QSPI flash by software has been limited. This therefore invalidated the method used by our previous OTA update method for modifying the Boot Configuration Table by manipulating the QSPI flash block device (/dev/mtdblock0) as a HW firewall prevents access to the QSPI beyond a certain bootloader stage. In newer versions of NVIDIA Jetson Linux BSPs starting from 35, NVIDIA Tegra-based systems use an OTA update approach that is specific to NVIDIA BSPs only, based on storing an update payload package at a predetermined location on the rootfs. Updates are then carried out by rebooting into the bootloader.

To provide for OTA updates In JetPack 5 and 6, nvbootctrl modifies the metadata according to the variables exposed by UEFI. When the boot flow process reaches the UEFI, a specified application is launched based on the specific variable set by the tool.

Specifically, three NVIDIA-specific variables are modified:

RootfsStatusSlotA-781e084c-a330-417c-b678-38e696380cb9 - provides the boot status of the A slot (bootable/unbootable)
RootfsStatusSlotB-781e084c-a330-417c-b678-38e696380cb9 - provides boot status of the B slot (bootable/unbootable)
BootChainFwNext-781e084c-a330-417c-b678-38e696380cb9 - is set when a slot switch is requested from the lower bootloader

The next boot slot can be modified by setting the last variable as appropriate, containing a 32-bit integer of the slot that should be booted from on the next reboot. To boot into slot A, BootChainFwNext should be set to 0, and for slot B this will be set to 1.

By default, non-standard EFI variables exposed in efivarfs have the immutability bit set. This is done as a safety precaution to protect devices from faulty firmware, which in certain cases could cause the device to fail to boot if a certain firmware-specific EFI variable is accidentally removed, effectively soft-bricking the device. The above UEFI variables are considered non-standard, and as such, any modifications to them must be guarded by a corresponding removal and re-addition of the immutable bit.

By taking advantage of the amendability of these variables, it is possible to manipulate the boot chain of the system and reboot into a chosen slot from within the Linux userspace. When paired with access to the root block device, it is possible to create an A/B update mechanism on top of the Jetson Linux BSP that allows for safe and atomic updates.

An alternative method for OTA updates - Remote Device Fleet Manager

To provide for a robust and reliable update method across a fleet of devices, Antmicro has been developing an open source update manager for OTA updates called the Remote Device Fleet Manager (RDFM) which integrates the NVIDIA A/B update mechanism as used in Jetson-based devices.

As a generic tool that runs on a wide range of systems, RDFM can be quickly and easily extended to new platforms, such as NVIDIA-based devices with their modern bootloaders, with just a few additional steps. In RDFM, we developed and included tegra-fw-tools for the purpose of switching system slots on NVIDIA devices, which is tied directly to RDFM’s generic slot switching mechanism, so it did not require any changes within RDFM itself. Instead, RDFM installs the update through Linux, and after reboot, the system immediately boots into the new update.

The RDFM update process is enhanced by the ability to use streamed updates, in comparison to the NVIDIA OTA update process which requires that the update package is stored entirely in the filesystem. As newer system OTA updates can exceed the amount of storage space available on a device, requiring the update package to be stored locally effectively blocks a device from updating until additional storage can be added. With the use of RDFM, package data can be streamed directly to the target partition, without requiring it to be stored locally first.

The RDFM update process also writes to secondary system partitions directly, which means that the device can keep running during the update process, with a reboot only required to switch to the new version of the downloaded package. RDFM is able to do this by manipulating the UEFI variables to switch to the new system slot, similar in operation to nvbootctrl. The other significant improvement that RDFM brings is being able to remotely manage OTA updates using an RDFM server, which provides for a seamless link between the server and the devices being updated.

Ensuring system stability in difficult environmental conditions

Long range mobile communications such as LTE and WiMax are not always reliable, particularly during inclement weather conditions or during internet outages. As a result, if communication is lost during the update process, the device must seamlessly ‘fall back’ to a previously known working configuration, including restoring communications.

When combining our OTA update process with one of our open hardware devices, such as the Jetson Orin Baseboard, it is possible to use different forms of remote updates to ensure system stability while still updating the system as and when required. For example, if OTA updates over LTE fail, then as a fallback, a WiFi connection can be set up as an alternative update mechanism without requiring physical access to the device.

An image showing RDFM in the cloud connecting to a NVIDIA-based device with two system slots, the inactive one is being updated.

By utilizing an A/B update method, RDFM provides for safety and stability for device updates where a stable and reliable power or internet supply may not be available. It uses two system slots for system software, as one slot is allocated to the currently running software (“active slot”) and the other is used for the previous version of the system or updates (“inactive”). When the update process is being carried out, we write the new software to the “inactive” slot, and when this is complete, we reboot the system and switch the slots.

Once the device successfully reboots, the change of system slot is made permanent and subsequent reboots will always boot the newest version. If the new software fails to boot however, the slot change is reverted, with the device now returning to the previously used slot. By using this method, even if the update fails, the device is only inaccessible for a short period of time before it automatically recovers. As an example, NVIDIA BSPs automatically restore the previous system after three failed attempts at booting.

Robust and reliable OTA updates for fleet management

With this update, we are able to support RDFM-based OTA updates on the Jetson Orin, as well as on other Tegra-based platforms. RDFM, as an open source framework update and fleet manager for embedded devices, can be scaled across different operating systems such as Linux, Android and Zephyr RTOS. You can also check out an example Yocto BSP for the Jetson Orin Baseboard with RDFM support on the Antmicro Github.

With the assistance of our collection of open source tools that support the newest NVIDIA BSPs, Antmicro can help you create and maintain an OTA update-based fleet management environment that integrates seamlessly with your existing use cases and solutions. If you want to find out how we can help, get in touch at contact@antmicro.com

See Also: