Deployment pipelines and enhanced modularity in Kenning

Published:

Topics: Edge AI, Open source tools

The field-deployed device systems Antmicro helps its customers build increasingly often use neural networks for tasks such as understanding voice commands, detecting, tracking and telling apart objects in video, or recognizing and processing text.

Until recently, most of the edge “deep learning” applications relied on the trick of delegating the actual inference to the cloud. But as embedded platforms gain more memory and compute power, a significant part of the processing can be moved closer to the data. This approach if done right - offers better reliability, privacy, security and scalability than purely cloud-based alternatives.

One difficulty with working with deep learning on the edge is very sparse support between deep learning models, training frameworks, deployment frameworks such as TensorFlow Lite or Apache TVM and edge platforms or accelerators - a change in one element of the solution may require a modification of the whole production flow.

To tackle this diverse framework and hardware ecosystem, we developed Kenning, a highly modular framework that provides a unified API for dataset management, model training, optimization, compilation and runtime on target platforms, allowing the developer to create deployment pipelines and applications, using Kenning Runtime modules.

Kenning automates the process of evaluating the model to keep track of its quality and speed on target hardware. The evaluation is unified across used models, frameworks and runtimes, providing reliable means of ensuring the model’s quality.

Kenning renders plots and summary reports based on the collected measurements which can be easily embedded in existing CI flows and offer extensive insight into the real performance of the deep learning application. Antmicro is using Kenning in both in commercial production-grade edge AI systems as well as research - measuring and extending the landscape of edge AI capabilities in projects such as VEDLIoT.

Adding chained optimizers into Kenning

A crucial recent update Kenning’s original structure enables users to build a chain of self-contained optimizers. Every optimizer takes a model, its format and some metadata as an input, compiles it according to the parameters specified and passes it to the next block in the chain. To simplify the process for the user, Kenning consults the model format between the optimizers and chooses the most suitable one. During this process the model is converted to the chosen format and passed to the subsequent optimizer to be further optimized.

Frameworks like TF Lite and TVM can mutually benefit from this feature as they can be used together to significantly reduce memory and computational demand of the model. For example, the TensorFlow Model Optimization Toolkit can be used to prune, cluster and quantize the initial model which can then be compiled for a given target using TVM.

Diagram depicting the optimization process

To enable this approach and simplify working with Kenning, Antmicro added a more modular, manageable and traceable way of passing flow definitions to Kenning.

JSON-based definition of the Kenning flow

The original implementation of Kenning was usable either as a Python module, or via scenarios, such as inference_performance (native framework evaluation) or inference_tester (model compilation and evaluation on the target hardware). Those scenarios required lots of arguments to configure the compilation process for the target. Soon, it became troublesome to work with, which prompted a new communication system, one that could easily convey configuration parameters, flexible and modular in nature like Kenning itself.

To address that, support for JSON-based Kenning object definitions was introduced. Every core-based class comes with a structure describing its parameters internally called arguments_structure.

arguments_structure = {
    'modelframework': {
        'argparse_name': '--model-framework',
        'description': 'The input type of the model, framework-wise',
        'default': 'onnx',
        'enum': ['onnx', 'keras']
    },
    'inputdims': {
        'argparse_name': '--inference-input-type',
        'description': 'Data type of the input layer',
        'required': true,
        'is_list': true
    }
}

In this example, our class has two arguments. First, modelframework which has a default onnx value and a range of possible values consisting of onnx and keras. The second argument is inputdims which has to be a list specified by the user. If any of the properties are not satisfied, Kenning will raise an error and inform the user what went wrong.

This JSON schema supports both reading parameters for the previously implemented scenarios (via argument parser), and extracting the parameters using the JSON file with deployment pipeline configuration.

Now with the JSON format as our base we can easily define and adjust our optimization flow and use it with an inference tester script which uses the structure described in a previous post. There are two required modules: Dataset and ModelWrapper, as well as an optional list of Optimizer modules and an optional Runtime module.

The Dataset module provides data for training, evaluating and quantizing the model, as well as model-agnostic methods for loading and preprocessing inputs and expected outputs, or providing class names.

The ModelWrapper module provides methods for loading the model in a native framework, as well as methods for model-specific preprocessing of the inputs and postprocessing of the outputs.

Optimizer modules provided in the list are applied sequentially - they take the model from the previous block (ModelWrapper or other Optimizer) as an input, apply various optimizations (quantization, network pruning, or compilation to a target device) and return the optimized model in the form of a file.

The runtime module takes the optimized model and runs it on target hardware (either locally or remotely). It implements the inference process using the underlying runtime framework.

Example use case of classifying pets

Let’s consider a simple scenario, where we want to optimize the inference time and memory usage of the model executed on the CPU. In this example we are going to use PetDataset Dataset and TensorFlowPetDatasetMobileNetV2 ModelWrapper. This is a sample MobileNetV2 model used for classifying dog and cat breeds.

First of all, we want to check how the trained model performs using the native framework on the CPU.

For this, let’s define the following configuration:

{
    "model_wrapper":
    {
        "type": "kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2",
        "parameters":
        {
            "model_path": "./kenning/resources/models/classification/tensorflow_pet_dataset_mobilenetv2.h5"
        }
    },
    "dataset":
    {
        "type": "kenning.datasets.pet_dataset.PetDataset",
        "parameters":
        {
            "dataset_root": "./build/pet-dataset",
            "download_dataset": true
        }
    }
}

This JSON provides a configuration for running the model natively and evaluating it against the defined Dataset.

For every class in the above JSON file two keys are required: type, which is a module path of our class, and parameters, which is used to provide arguments for the instances of our classes. The parameters are derived from the previously described arguments_structure.

In model_wrapper we specify the model used for evaluation - here it is MobileNetV2 trained on Pet Dataset. The model_path is a path to the saved model. The TensorFlowPetDatasetMobileNetV2 model wrapper provides methods for loading the model, preprocessing the inputs, postprocessing the outputs and running inference using the native framework (TensorFlow in this case).

The dataset provided for evaluation is Pet Dataset - here we specify that we want to download the dataset (download_dataset) to the ./build/pet-dataset directory (dataset_root). The PetDataset class can download the dataset (if necessary), load it, read the inputs and outputs from files and process them, and implement the evaluation methods for the model.

With the above config in a file named native.json file, run the json_inference_tester scenario:

python -m kenning.scenarios.json_inference_tester blog-configs/native.json build/native-out.json --verbosity INFO

This module runs the inference based on given configuration, evaluates the model and stores the quality and performance metrics in a JSON format, saved to the build/native-out.json file.

To visualize the evaluation and benchmark results, run the render_report module:

python -m kenning.scenarios.render_report build/native-out.json 'native' build/benchmarks/native.rst --root-dir build/benchmarks --img-dir build/benchmarks/imgs --verbosity INFO --report-types performance classification

This module takes the output JSON file generated by the json_inference_tester module, and creates a report titled native, which is saved in the build/benchmarks/native.rst directory. As specified in the --report-types flag, we create the sections in the report for the performance and classification metrics (for example, there is also a detection report type for object detection tasks).

In the build/benchmarks/imgs there will be images with the native_* prefix visualizing the confusion matrix, CPU and memory usage, as well as inference time.

The build/benchmarks/native.rst file is a ReStructuredText document containing the full report for the model - apart from linking to the generated visualizations, it provides aggregated information about the CPU and memory usage, as well as classification quality metrics, such as accuracy, sensitivity, precision, G-Mean.
Such a file can be included in a larger, Sphinx-based documentation, which allows easy, automated report generation, using e.g. CI, as can be seen in the Kenning documentation.

While native frameworks are great for training and inference, model design, training on GPUs and distributing training across many devices, e.g. in a cloud environment, for pure inference for production purposes there is a quite large variety of inference-focused frameworks that aim to get the most out of the available edge hardware. Kenning helps cherry-pick the best parts of all of them and use them in a production setting.

In this note, a number of different configurations and optimizations will be presented, and you can use the commands above but just replace the JSON and rst input/output filenames to reproduce the results for various alternatives.

Optimizing the model using TensorFlow Lite

One of the most popular edge ML frameworks is TensorFlow Lite - a lightweight library for inferring networks on edge. It has a small binary size, which can be even more reduced (by disabling unused operators), and a very optimized format of input models, called FlatBuffers.

Before the TensorFlow Lite Interpreter (runtime for the TensorFlow Lite library) can be used, the model first needs to be optimized and compiled to the .tflite format - so let’s add a TensorFlow Lite Optimizer that will convert our MobileNetV2 model to a FlatBuffer format, as well as TensorFlow Lite Runtime that will execute the model:

{
    "model_wrapper":
    {
        "type": "kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2",
        "parameters":
        {
            "model_path": "./kenning/resources/models/classification/tensorflow_pet_dataset_mobilenetv2.h5"
        }
    },
    "dataset":
    {
        "type": "kenning.datasets.pet_dataset.PetDataset",
        "parameters":
        {
            "dataset_root": "./build/pet-dataset"
        }
    },
    "optimizers":
    [
        {
            "type": "kenning.compilers.tflite.TFLiteCompiler",
            "parameters":
            {
                "target": "default",
                "compiled_model_path": "./build/fp32.tflite",
                "inference_input_type": "float32",
                "inference_output_type": "float32"
            }
        }
    ],
    "runtime":
    {
        "type": "kenning.runtimes.tflite.TFLiteRuntime",
        "parameters":
        {
            "save_model_path": "./build/fp32.tflite"
        }
    }
}

The only change that we did to the previous blocks is the removal of the download_dataset entry in the Dataset object - we already have the Pet Dataset downloaded.

The first new addition is the optimizers list - it allows us to add one or more objects inheriting from the kenning.core.optimizer.Optimizer class. Optimizers read the model from the input file, apply various optimizations on them, and then save the optimized model to a new file.

In our current scenario we will use the TFLiteCompiler class - it reads the model in Keras-specific format, optimizes the model and saves it to the ./build/fp32.tflite file. The following parameters of this particular Optimizer (each Optimizer usually has a different set of parameters) are especially noteworthy:

  • target - indicates what the desired target device (or model type) is, default is the regular CPU. Another example here could be edgetpu, which can compile models for Google Coral platform.
  • compiled_model_path - indicates where the model should be saved.
  • inference_input_type and inference_output_type - indicate what the input and output type of the model should be. Usually, all trained models use FP32 weights (32-bit floating point) and activations - using float32 here keeps the weights unchanged.

The second thing that has been added to the previous flow is the runtime block - it provides a class inheriting from the kenning.core.runtime.Runtime class that is able to load the final model and run inference on target hardware. Usually, each Optimizer has a corresponding Runtime that is able to run its results.

While it depends on the platform used, there should be a significant improvement in both inference time (model around 10-15 times faster compared to the native one) and memory usage (output model around 2 times smaller). What’s worth noting is that we get a significant improvement with no harm in the quality of the model - the outputs stay the same.

Further improvements of the model with TensorFlow Lite

The question is - can we go further from here? We can at least significantly reduce the memory usage by quantizing the model - it is a process where all weights and activations in the model are calibrated to work with the INT8 precision, instead of FP32 precision. While it may severely harm the quality of the predictions, with proper calibration process the quality reduction can be negligible.

Some platforms do not benefit from quantization at all in terms of speed, or even work slower (this may also be a thing if the runtime library does not utilize some of the platform’s features). Other platforms can significantly benefit from quantizing the weights - not only can we gain a nice memory reduction, but also a massive speed boost (for example in recent NVIDIA Jetson platforms). Lastly, some platforms actually require quantizing weights in order to run inference (such as Google Coral with EdgeTPU, or Apache VTA accelerator).

TensorFlow Lite can quantize weights provided with the calibration dataset. Let’s create a quantized version of our model with the following configuration:

{
    "model_wrapper":
    {
        "type": "kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2",
        "parameters":
        {
            "model_path": "./kenning/resources/models/classification/tensorflow_pet_dataset_mobilenetv2.h5"
        }
    },
    "dataset":
    {
        "type": "kenning.datasets.pet_dataset.PetDataset",
        "parameters":
        {
            "dataset_root": "./build/pet-dataset"
        }
    },
    "optimizers":
    [
        {
            "type": "kenning.compilers.tflite.TFLiteCompiler",
            "parameters":
            {
                "target": "int8",
                "compiled_model_path": "./build/int8.tflite",
                "inference_input_type": "int8",
                "inference_output_type": "int8"
            }
        }
    ],
    "runtime":
    {
        "type": "kenning.runtimes.tflite.TFLiteRuntime",
        "parameters":
        {
            "save_model_path": "./build/int8.tflite"
        }
    }
}

The only changes here in comparison to the previous configuration appear in the TFLiteCompiler configuration - we change target, inference_input_type and inference_output_type to int8. What TFLiteCompiler does in the background is fetch the subset of images from the PetDataset object to calibrate the model, so the entire model calibration process happens automatically.

The resulting model is over 7 times smaller compared to the native model, but the speed boost is not as good as in FP32 inference. Accuracy doesn’t seem to suffer significantly in comparison to the native model in this particular use case.

Is there anything that we could do from this point to improve the speed of the model?

Creating a dedicated implementation of the model with Apache TVM

We can use the Apache TVM - a framework that optimizes the model and compiles it to an efficient library that can later be used for inference. Apache TVM takes the model, applies various target-agnostic optimizations - like dead code elimination, operation fusing and data layout reorganization - to address preferable memory layout, or replacement of operations in the network with their approximate and significantly faster implementations. Later, Apache TVM performs target-specific optimizations, like replacing operations with optimized calls to the accelerator hardware, parallelizing certain parts of the network, utilizing vector extensions or other SIMD-like instructions.

Since we already have quite a well-performing INT8 model in the TFLite format, let’s use it. The configuration is as follows:

{
    "model_wrapper":
    {
        "type": "kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2",
        "parameters":
        {
            "model_path": "./kenning/resources/models/classification/tensorflow_pet_dataset_mobilenetv2.h5"
        }
    },
    "dataset":
    {
        "type": "kenning.datasets.pet_dataset.PetDataset",
        "parameters":
        {
            "dataset_root": "./build/pet-dataset"
        }
    },
    "optimizers":
    [
        {
            "type": "kenning.compilers.tflite.TFLiteCompiler",
            "parameters":
            {
                "target": "int8",
                "compiled_model_path": "./build/int8.tflite",
                "inference_input_type": "int8",
                "inference_output_type": "int8"
            }
        },
        {
            "type": "kenning.compilers.tvm.TVMCompiler",
            "parameters": {
                "target": "llvm",
                "opt_level": 3,
                "compiled_model_path": "./build/compiled_model.tar"
            }
        }
    ],
    "runtime":
    {
        "type": "kenning.runtimes.tvm.TVMRuntime",
        "parameters":
        {
            "save_model_path": "./build/compiled_model.tar"
        }
    }
}

The llvm target in TVMCompiler compiles the model to a shared library that will run the model on CPU. The opt_level is an optimization level - the higher the value, the more aggressive optimization passes are used by the TVMCompiler.

The compilation of the TFLite INT8 model using Apache TVM results in a model that is over 5 times faster than the native implementation, and over 3 times faster compared to just TFLite INT8 model. The model size slightly increased compared to the TFLite FlatBuffer file - this is a shared library that implements the whole model, and is only loaded by the Apache TVM Runtime to access the common API for I/O preparation and execution of the model.

But can we beat the 15x speedup of the FP32 model in TFLite, and still keep the quite nice memory reduction of the INT8 models?

Using additional compiler features

Our current target architecture is x86 CPUs - most of the currently available CPUs have support for the AVX2 Advanced Vector Extensions, which can operate on vectors of INT8 values in a single cycle. Let’s consider the following configuration:

{
    "model_wrapper":
    {
        "type": "kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2",
        "parameters":
        {
            "model_path": "./kenning/resources/models/classification/tensorflow_pet_dataset_mobilenetv2.h5"
        }
    },
    "dataset":
    {
        "type": "kenning.datasets.pet_dataset.PetDataset",
        "parameters":
        {
            "dataset_root": "./build/pet-dataset"
        }
    },
    "optimizers":
    [
        {
            "type": "kenning.compilers.tflite.TFLiteCompiler",
            "parameters":
            {
                "target": "int8",
                "compiled_model_path": "./build/int8.tflite",
                "inference_input_type": "int8",
                "inference_output_type": "int8"
            }
        },
        {
            "type": "kenning.compilers.tvm.TVMCompiler",
            "parameters": {
                "target": "llvm -mcpu=core-avx2",
                "opt_level": 3,
                "conv2d_data_layout": "NCHW",
                "compiled_model_path": "./build/compiled_model.tar"
            }
        }
    ],
    "runtime":
    {
        "type": "kenning.runtimes.tvm.TVMRuntime",
        "parameters":
        {
            "save_model_path": "./build/compiled_model.tar"
        }
    }
}

The only changes here are:

  • target - llvm -mcpu=core-avx2 tells the compiler to use AVX2 instructions when possible.
  • conv2d_data_layout - Keras models represent data in the NHWC format (the channel is the last dimension, meaning that values for each channel are next to each other), which is not necessarily the best layout for CPU execution. To utilize AVX2 instructions, the layout is changed to NCHW (channels for each image are delivered separately).

While the results may vary between platforms, the speed boost for this particular scenario can be amazing (depending on the CPU) - we got a model over 40 times faster compared to native, while reducing its size three times.

This shows how beneficial the usage of multiple optimization frameworks can be, and how easily it can be implemented with Kenning.

New compilers and optimizers in Kenning

The above improvements to models that can be done with Kenning are only a tip of the iceberg - you can try other optimization algorithms in addition to quantization.

We are actively expanding the existing set of core-based classes in Kenning to support new frameworks and models.

For model optimization algorithms, we have recently added:

  • kenning.compilers.tensorflow_pruning.TensorFlowPruningOptimizer - this Optimizer performs pruning of the network. Pruning is the process of removing connections (and also sometimes neurons) that are not contributing to predictions (they have close-to-zero weights). The underlying framework used for this task is TensorFlow Model Optimization Toolkit. This can both drastically reduce the size of the model, and increase the speed of the model (if the final runtime supports sparse matrix computations).
  • kenning.compilers.tensorflow_clustering.TensorFlowClusteringOptimizer - this Optimizer uses K-Means algorithm to divide the weights in each layer into K groups, in order to use the centroids for each group as new weights’ values, and stores the indexes to those centroids in kernels instead of whole weights. Such optimization reduces the number of bits necessary to represent a single weight to log_2(K), which brings a significant memory reduction. Again, the internals of the Optimizer are based on the TensorFlow Optimization Toolkit.

As for compilers and runtimes, on top of TensorFlow Lite and Apache TVM we have added:

  • IREE Compiler and Runtime (kenning.compilers.iree.IREECompiler and kenning.runtimes.iree.IREERuntime) - IREE is a MLIR-based end-to-end framework that uses an Intermediate Representation to unify the internal structure of the models, thanks to which many accelerators are supported. Among various targets it can be compiled and run on diverse CPU architectures like ARM, x86 and RISC-V, and GPU architectures like CUDA and Vulkan.
  • ONNX Runtime (kenning.runtimes.onnx.ONNXRuntime) - ONNX Runtime benefits from the fact that the vast majority of existing frameworks are able to export their models to the ONNX format. The ONNX Runtime actively adds support for new ONNX opsets making it possible to run most of the state-of-the-art models. It also supports running certain operations on accelerators, such as GPUs, when they are available, in order to get the most out of the target platform. The operations that cannot be accelerated on any accelerator fallback to the base CPU implementations, making it a very versatile runtime that is able to support the vast majority of existing models.

New level of Kenning modularity

The summary for all of the above experiments is shown below:

Table with results

This shows how we can use Kenning for optimizing our models for various platforms and that combining different frameworks can significantly benefit the process of optimizing and deploying a model.

Thanks to Kenning’s modular and seamless nature, additions or replacements of various models, datasets, optimizations and runtimes are very straightforward. Thanks to JSON configuration, as well as JSON-based benchmark results and unified report generation it is also possible to check all kinds of solutions and get a credible result indicating which combination of optimizations best fits the current model and target device.

Reach out to us at contact@antmicro.com if you want to use Kenning in your next project or need Antmicro’s help in creating deployment pipelines for edge, end-to-end AI solutions, including hardware design, tailored system development and optimized AI-based applications.

See Also: