Implementing Over-The-Air (OTA) updates for embedded Linux and Android systems
Published:
Topics: Edge AI, Open OS
At Antmicro, we often help customers bring their cutting edge technology ideas to life, which more often than not, require extensive R&D. We work with a range of Linux- and Android-capable embedded platforms, from NVIDIA’s Jetson/Orin through Qualcomm’s Snapdragon, NXP’s i.MX and LayerScape, emerging RISC-V platforms and FPGA SoCs from AMD, Intel, Microsemi, Lattice and QuickLogic and as well as a plethora of MCUs using Zephyr and other RTOS.
This broad perspective has us look for general solutions to common problems and put a strong emphasis on portability and reusability, which makes open source an obvious choice. Many projects start with designing custom hardware boards based on a wide range of our open source baseboards for off-the-shelf modules, FPGA boards and accessories. On top of those, we build a wide array of software components, starting of course with custom Board Support Packages (BSPs). However, the software is never really done, there are always new things to add, new improvements and, of course, security and bug fixes. This means, we need a reliable and tested way to update the software, regardless of the number or location of the devices, and that is where our Over-The-Air (OTA) update system expertise comes into play.
Managing updates at scale
While developers often tend to use whatever comes pre-installed on their embedded platform during development and prototyping (a desktop distribution such as Debian or Ubuntu, or a specific Android image with some pre-installed applications), this approach can quickly prove insufficient when they need to introduce another person to the project, or reproduce the setup on another hardware unit. That’s why we encourage reproducible builds and full OS update packages early on in our projects.
Initially, the approach to updates is not the most important thing, as the developer has easy access to the device and can reflash it anytime using, for example, a USB serial interface available directly from the development kit lying on their desk. However when starting to work with more devices, perhaps spread across a few locations, this quickly becomes a critical issue. With 5 or 10 devices we can still get this done pretty fast, e.g. using a dedicated USB stick-based procedure (which is often a useful fallback for service personnel that we sometimes implement for customers) or directly connecting from a development PC. With 100s or 1000s of devices, however, this is simply not feasible, and that’s why we prefer to architect a scalable methodology from the very beginning.
Over-The-Air updates
OTA updates enable updating devices without physical access, and can happen over different mediums, wired or wireless. Nowadays, a well-developed LTE network means that devices can be accessed and updated even in completely remote locations. While traditionally companies deploying devices in the field would rely on an extensive network of service personnel to keep their devices running, with the fast iteration and flexibility required by modern applications in retail, industry, automotive and other areas, OTA systems are the only way to go. While they take some expertise and infrastructure to implement properly, OTA updates have many other advantages, such as that they can be performed at any time, scheduled for specific dates or rolled out and monitored gradually, allowing e.g. A/B testing of various configurations.
Based on our work with devices spanning a wide array of industries, including robotics, medical, manufacturing, automotive, retail and other fields, we can definitely acknowledge that different devices need different approaches - there is no right one suitable for all use cases. To decide which one best meets our needs, we should consider several points including:
- resources (disk space) available and the size of the target system image,
- medium over which the update is performed, especially in case of LTE-based updates - the reliability/coverage of the medium itself,
- existing infrastructure and user preferences - for example, if the customer already has some kind of update server/procedure in place, we may want to provide a familiar OTA update system on the newly designed device, as it could be easier for users / personnel to handle the updates of all devices in the same way,
- the preferable scenario in case of a failure - whether we can afford multiple updates in case of a failure, or if we prefer not to have to re-update in that case and just stick to the older but stable system image,
- the time window we have to maintain the update and its possible failure recovery.
Choosing the right approach
On a high level, there are three main approaches that we can choose for the case we work on: single image, A/B images and main/aux images.
Single image
The most basic approach uses a simple partition layout, a bootloader, kernel, rootfs, maybe a separate data partition. The OTA system can download and then overwrite one of those partitions (or all of them). In this scenario, if the freshly updated system fails, there is no automatic way to get back to the previous system image. In this case servicing the device in its physical location would most likely be required. Ideally, this should not be a commonly occurring scenario, the update image could be extensively tested and only then pushed to remote devices, but at scale and with the varying conditions occurring in real world applications, not everything can be predicted. Sometimes a simple change can introduce unexpected consequences and render the update defective.
Moreover, sometimes the conditions the device operates in can affect the update procedure itself, for example if there is a power shortage during the update (e.g. during reflashing the rootfs partition), this could completely brick the device. Similarly, temporary unavailability of network coverage could also cause unexpected consequences - in this example the update could work perfectly in a city environment but fail in more remote locations. That’s why we almost always recommend putting some kind of a recovery mode or redundancy mechanism in place.
A/B images
One of the safest ways to perform OTA updates is to have two system images out of which one stores the running copy of the system and the other is there for redundancy purposes. By system image we mean a set of partitions that store necessary system data and should take part in the update, like bootloader, kernel, rootfs etc. In different devices this may be organized differently, but essentially A and B images include multiple partitions that are a part of set A or B.
During an update, the inactive image is updated and the active one stays untouched. When the update succeeds, the updated image becomes the one in use, and the original one becomes inactive, keeping a copy of the old system image. Keeping the previous system image gives us the possibility to rollback to the system from before the update, in case the updated system crashes. In this approach we will always have a fully featured system working, either in the updated or the previously used version. This approach is the default in Android systems and also one that our customers most often choose for their Linux and other systems. It has one major drawback though, in order to have two copies of everything, we need twice as much storage as in the single image scenario.
Main/aux images
Robust updates require a second system image of some kind, but this could instead be a recovery system image only. Such a recovery image might not have all the user-facing functionalities, but is able to perform another system update in case of failure. The auxiliary image approach is based on an asymmetrical layout where we have the main image holding the fully featured system, and another one that is only focused on allowing remote updates. This approach saves storage space on the device, as the auxiliary update system image can be very small, having to only handle basic maintenance operations. At the same time this still gives us the opportunity to re-update the device in case of a failed update.
OTA updates in practice
In our projects we work with both Linux, Android and RTOS systems such as Zephyr and use OTA updates in a vast majority of use cases. NVIDIA’s Jetson family of modules are a common choice in the projects we work on and can serve as an example. The default BSP (Board Support Package) based on Ubuntu for these platforms is provided by NVIDIA and helps customers get running quickly in a familiar environment with a lot of functionalities working out of the box. However, in most cases as the development cycle progresses, inevitably our customers have to transition to a system, which we typically build using e.g. Yocto, that is better tailored to their needs, without unnecessary features which consume resources and open attack surfaces. For our Jetson Nano / Xavier NX Baseboard, we provide a good Yocto-based starting point for the tailored approach we recommend. By default, the Jetson SoMs feature a complex partition layout which needs to be changed to handle a robust OTA updating flow. For example, to implement an A/B image approach, an abbreviated partition layout may look as follows:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
mmcblk0 179:0 0 29.1G 0 disk
|-mmcblk0p1 179:1 0 7G 0 part
.
.
.
|-mmcblk0p26 259:18 0 512K 0 part
|-mmcblk0p27 259:19 0 512K 0 part
.
.
.
|-mmcblk0p29 259:21 0 7G 0 part /
|-mmcblk0p30 259:22 0 3G 0 part /var
`-mmcblk0p31 259:23 0 11.7G 0 part /data
mmcblk0boot0 179:8 0 4M 1 disk
mmcblk0boot1 179:16 0 4M 1 disk
mmcblk0rpmb 179:24 0 4M 0 disk
In this layout, separate partitions for rootfs, bootloader, linux kernel, devicetree etc. are doubled. For example, the mmcblk0p29
partition is a duplicate of the mmcblk0p1
partition that stores the root filesystem (called “APP” in NVIDIA Jetson platforms).
In this case, the OTA update system needs a partition marker that is used to pick the right partition to boot to and leave the other one inactive. For this purpose, the bootloader has to be modified, as the root filesystem partition is chosen when the bootloader runs. On the running system, there must be an application that handles the download of the update - it can be designed to be run on demand or automatically. When the update is performed, the application should give us the opportunity to approve or disapprove the update, so the active partition marker changes to point to the other partition.
Simplifying the update system
OTA updates are a reliable and efficient way to maintain the product throughout the development lifecycle. Along with customized BSPs, we can provide the most suitable OTA update flow for our customers, to help them prevent common problems encountered while managing remote devices, and save both time and resources. Our aim is to prepare an update system which best fits the application, so that updates can be securely performed using minimum bandwidth and without hassle or fear of unpredicted issues.
A robust OTA approach is invaluable for the product’s long-term stability and performance, as keeping software up-to-date guarantees that fixes or new features can be introduced immediately. Through its engineering services, Antmicro can help in choosing and implementing the most applicable update flow, together with the software, AI and fleet management / cloud processing and CI solutions you need to make your product successful.