Deployment of Machine Learning Algorithms on Resource-Constrained Hardware Platforms for Prosthetics

Motion intent recognition for controlling prosthetic systems has long relied on machine learning algorithms. Artificial neural networks have shown great promise for solving such nonlinear classification tasks, making them a viable method for this purpose. To bring these advanced methods and algorithms beyond the confines of the laboratory and into the daily lives of prosthetic users, self-contained embedded systems are essential. However, embedded systems face constraints in size, computational power, memory footprint, and power consumption, as they must be non-intrusive and discreetly integrated into commercial prosthetic components. One promising approach to tackle these challenges is to use network quantization, which allows complying with limitations without significant loss in accuracy. Here, we compare network quantization performance for self-contained systems using TensorFlow Lite and the recently developed QKeras platform. Due to internal libraries, the use of TensorFlow Lite led to a 8 times higher flash memory usage than that of the unquantized reference network, disadvantageous for self-contained prosthetic systems. In response, we offer open-source code solutions that leverage the QKeras platform, effectively reducing flash memory requirements by 24 times compared to Tensorflow Lite. Additionally, we conducted a comprehensive comparison of state-of-the-art microcontrollers. Our results reveal that the adoption of new architectures offers substantial reductions in inference time and power consumption. These improvements pave the way for real-time decoding of motor intent using more advanced machine learning algorithms for daily life usage, possibly enabling more reliable and precise control for prosthetic users.


I. INTRODUCTION
Laboratory experiments have shown promising strategies to decode human motor intent to control bionic limbs [1], [2], [3], [4], [5], [6].Mostly, electromyography (EMG) recordings are used to relate the activity of muscles remnant from The associate editor coordinating the review of this manuscript and approving it for publication was Nuno M. Garcia .amputation to a desired prosthesis joint actuation.Recent improvements in computational hardware, especially the computational power-to-size ratio, allowed for the acquisition and processing of EMG signals to be made more portable and reliable.This led several prosthetic control studies to be carried out outside lab settings, as well as many commercial applications like commercial prosthetics using advanced myoelectric algorithms [7].For computational hardware in prosthetics, microcontrollers (MCU) are commonly favored over other embedded systems like Field Programmable Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs) as the go-to platform because they are less complex to program and often feature extensive library and tool support [8], [9], [10].Previously, standard machine learning algorithms like support vector machine (SVM) and linear discriminant analysis (LDA) were used on the computational hardware to decode motion intent [1].Due to the availability of bigger motion intent data sets and the need for higher performance, the field shifted towards neural networks (NN), especially deep neural networks [2].Deep learning techniques like deep neural networks enable more complex and more accurate motion intent recognition [11].
However, the current state-of-the-art neural networks imposes significant demands on memory, computation, and energy [12], [13].This presents a challenge for portable, selfcontained systems that operate all day with limited resources, relying solely on low-power MCUs.To address this issue and accommodate the limited computing capability and storage capacity of MCU-class devices, recent advancements in deep learning training methodologies have introduced novel quantization methods.These methods aim to compress the network's weights or activations into 8-bit or smaller data types while incurring minimal or even negligible accuracy loss [14], [15], [16], [17].By employing these quantization techniques, known as network quantization, the memory requirements of the models are significantly reduced compared to their full precision counterparts.
Consequently, both industry and academia have been actively engaged in developing hardware and software platforms dedicated to the efficient execution of quantized neural networks (QNNs) on MCU-class devices.This concerted effort is driven by the recognition that QNNs offer a viable solution to enable the deployment of neural networks on resource-constrained systems, where memory, computation, and energy efficiency are paramount considerations.One of the most used platform for training and quantization is TensorFlow Lite [18].
This paper compares the advantages of quantization faced by self-contained systems when utilizing the TensorFlow Lite and Qkeras platform.The latter was made possible thanks to the proposed innovative deployment workflow targeting the newly released QKeras platform.We present a comprehensive QNN comparison tailored to Cortex-M-based microcontrollers and benchmark microcontrollers featuring DSP functionalities.The aim is to enhance the performance of self-contained embedded systems and simultaneously reducing flash memory footprint.As the results, we present the following key contributions in this paper: 1.We compared the quantization and deployment strategie for Cortex-M-based microcontrollers in TensorFlow Lite and QKeras.For the hardware targets, we measured inference time, energy consumption, and memory footprint.2. We benchmarked different MCUs and compared the inference time of different data types 3. We show the potential of the newly released ARM Cortex-M55 processor with and without ARM Helium Vector Extension (VE) 1 for the computing load, in terms of inference time and memory footprint.4. We provide open-source scripts2 to facilitate the integration of our findings and to increase the usability of the novel QKeras platform with the CMSIS library (Common Microcontroller Software Interface Standard) [19].

II. RELATED WORK
Various research efforts have targeted machine learning and deep learning on resource-constrained devices and are presented in this section.Existing research can be grouped into embedded hardware for motion intent and network quantization methods.

A. EMBEDDED HARDWARE FOR MOTION INTENT RECOGNITION (DEEP LEARNING)
While the field of deep learning has been growing in terms of performance, network size, and training run time, the development of embedded hardware to process deep learning algorithms is struggling to keep up [20], [21].As the demand for higher performance grows, researchers are exploring and investigating various options for implementation, including field-programmable gate array (FPGA) devices on systemson-chip (SoCs) [22], [23], [24], [25], graphics processing units (GPUs) [26], [27], [28], and neuromorphic chips and processors [29], [30], [31].However, when considering prosthetic devices that operate on batteries, the power envelope becomes a crucial constraint.FPGA SoCs and GPUs often exceed this power limitation, making them less feasible for such applications.On the one hand, neuromorphic technology offers the promise of exceptionally low power consumption, but several challenges still impede its rapid growth and widespread usage [32].On the other hand, resource-constrained microcontrollers (MCUs) offer software programmability, affordability, and low power consumption.These unique attributes make MCUs well-suited for the development of portable prosthetic systems that are designed for all-day use [33].Table 1 showcases numerous scientific papers that have utilized MCUs for EMG motion intent recognition.
The Cortex-M4 is the currently most used CPU in the field (see Table 1).In [34], a Raspberry Pi board is used to process the radial basis function (RBF) neural network classification algorithm.Both [35] and [38] have the highest time values, since they only reported their completion time including the finished movement of the prosthetic system (see Table 1).Our research gorup uses the Artificial Limb Controller (ALC) embedded system [36].The system draws different currents depending on the operation mode, from idling (275 mW) to streaming data (495 mW) [36].

B. NETWORK QUANTIZATION
With the success of deep neural networks and their everincreasing sizes, the quantization of neural networks has emerged as a fundamental technique to reduce model size and memory footprints.Remarkable progress in this area has led to quantized neural networks achieving similar levels of accuracy as their full-precision counterparts [14], [15], [17].In neural network quantization, three key components can be targeted: weights, activations, and gradients [40].In this study, our focus is on quantizing weights and activations.

III. METHOD A. FRAMEWORKS
We tested and compared the newly released QKeras framework [55] with the widely used TensorFlow Lite.
TensorFlow Lite is an open-source framework to enable machine learning models on embedded systems.TensorFlow Lite was designed to provide a unified ML framework, addressing the multitude of embedded platforms and hardware support [10].Therefore, the TensorFlow library can run with or without CMSIS support, so it is hardware independent.
QKeras is a quantization extension to Keras.It allows the developer to choose the quantizer and the quantizer's parameters for each network layer leading to a customized deep quantized version of the Keras network [55], [56].In our study, the QKeras implementation always uses the CMSIS support due to our developed support scripts.

B. TOOLCHAIN
As depicted in Fig. 1, the initial phase involves the model design by selecting the network type and structure for testing purposes.
Subsequently, the model undergoes training and analysis (referred to as model training and analysis) to assess the baseline performance metrics, namely accuracy and byte size.At this stage, it is possible to examine the complexity of the model by considering the fixed number of parameters and estimated operations.The subsequent step entails optimizing the model to achieve reduced memory usage and reduced inference time (known as model optimization).By this point in the development, the model has been quantized.It is important to consider the hardware architecture of the target device during the quantization process.For instance, current MCU architectures cannot effectively utilize sub-int8 quantization.It should be noted that optimizations made during this phase permanently modify the neural network and have the potential to impact its accuracy.Hence, by conducting on-host evaluation, we can promptly evaluate the performance of the quantized model.Once the model is optimized, it's ready to be deployed on the MCU.The implementation on the MCU makes use of the CMSIS-NN library.This library, comprising optimized kernels, is an open-source solution designed to optimize the performance and minimize the memory usage of neural network applications on Arm Cortex-M processors [13].These optimized kernels do not change the numerical representation itself but increase the execution efficiency by leveraging target-specific features like single instruction, multiple data units (SMID).
The concluding steps involve testing the deployed model and evaluating its performance.Although the NN deployment does not modify the NN itself, the accuracy on the MCU can be altered by different underlying computations, potentially leading to varied classifications [13].For this reason, it must be verified that the inference on the target hardware is identical to the inference on the host system (target evaluation, see Fig. 1).Furthermore, the inference time and the energy consumption are measured as part of the evaluation process.The tests were conducted on two distinct MCUs, employing different data types, and comparing the two implementations of the neural network, one utilizing TFLite on top of CMSIS-NN and the other without it (QKeras).Subsequently, the iterative development cycle recommences.

C. NETWORK AND EMG DATASET
A representative EMG dataset for motion intent with hand gestures in float32 data format was used for this study [57].To classify motion intent for prosthetics, a simple feedforward neural network (FFNN) with one hidden layer was chosen.This choice is substantiated by several surveys [58], [59] that highlight the widespread use of this model for motion intent recognition.The dataset used is balanced, this justifies the choice of the accuracy as a metric for the NN evaluation.In addition, the accuracy results as the most popular metric according to recent surveys [58], [59].The network architecture is described in Table 2.

D. EXPERIMENT DESIGN
The experimental design workflow is depicted in Fig. 2. The comparison involves quantizing the float32 neural network to integer (int) types using one-time TensorFlow Lite and onetime QKeras.
As shown in Fig. 2, both experiment workflows initially utilize the Keras and QKeras tool for Quantization-Aware Training.In the first experiment, the trained Keras model is then converted into a TFLite model.This conversion process involves quantizing the model with int8 weights and uint8 activations, while keeping biases as int32 data type (mixed quantization).In TFLite, the neural network topology is deployed as microcode, which is interpreted at runtime rather than statically compiled.This process hinders compiler optimizations and leads to increased memory usage [60].TFLite can run with or without usage of the CMSIS-NN library.
For the QKeras experiment workflow, we conducted tests using QKeras for two homogeneous quantization methods: int8 (q7) and int16 (q15).Following Quantization-Aware Training, we manipulate the extracted parameters before being used as input for the CMSIS-NN functions.The manipulation is needed since the tensor layout differs between the machine learning framework and the CMSIS-NN library.The network parameters that the QKeras quantization produces are not integers, which is the format used by CMSIS.Therefore, bias and weight need to be converted to integers.Additionally, the layout of tensors follows a different convention in the QKeras compared to CMSIS-NN, so a reshape of the NN parameters was necessary before using them.
Importantly, only the parameter values and the CMSIS library significantly contribute to the flash memory footprint on the target hardware using QKeras compared to TFLite, which by itself occupies a significant part of the flash memory.
We tested the same implementation on three different hardware targets: 1.The TM4C123GHPM, from Texas Instrument, based on the Arm Cortex-M4 32-bit RISC core operating at up to 80 MHz used in the ALC developed by our group.2. The STM32H7A3ZI from STMicroelectronics, based on the high-performance Arm Cortex-M7 32-bit RISC core operating at up to 280 MHz. 3. Corstone-300 (SSE-300) leverages Cortex-M55, Arm's most AI-capable Cortex-M processor and the first to feature Arm Helium vector processing technology.We tested the performance using a virtual hardware target since it is not released yet.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
memory, and the armclang compiler was used.Due to limited flash memory on TM4 using TensorFlow the Oz compiler optimization had to be used.All other measurements were carried out with the O2 compiler optimization.Energy measurements were conducted using the ARM ULINKplus debugger.This debugger serves as a debug adapter that connects the PC's USB port to the target system through JTAG.It facilitates debugging, tracing, analysis, and the measurement of program consumption on the target hardware.Energy measurements were not done with the SSE-300, since it is only simulated and not released as hardware.

F. PRIMARY OUTCOME MEASURES
The three primary outcome measures of the study are inference time, flash memory footprint, and energy per inference, which are measured using Keil and Python.
Inference time: Inference time is defined as the duration required to execute a single motion intent classification on the trained neural network.
Flash memory footprint: The flash memory footprint is determined by the total memory required for the neural network code, library size, and initial parameters necessary for executing the motion intent classification.
Energy per inference: The energy per inference is calculated by multiplying the current consumed during classification by the 3.3V supply voltage over the inference time.Therefore,

Accuracy loss and MCU accuracy loss:
The MCU accuracy loss value is calculated by comparing the accuracy of the on-host QNN to the MCU accuracy of the deployed QNN.

A. NETWORK SIZE REDUCTION
With TFLite, 65% on-host network size reduction is obtained (see Table 3).With QKeras, a network size reduction of 50% and 73% is achieved through int16 and int8 quantization strategies, respectively.The QNN, when deployed using CMSIS-NN, displays a decreased accuracy of less than 1% for all implementations (''Acc.loss'' column in Table 3).The accuracy drop after the deployment in QKeras is slightly higher compared to when TFLite is used.

B. INFERENCE TIME
Using the CMSIS library in TensorFlow Lite results in a threefold inference time improvement compared to the hardware-independent solution with absence of CMSIS (see Table 4 and Fig. 3).Furthermore, when comparing all these results to our start scenario without quantization (none-float32), all configurations utilizing the CMSIS library demonstrated a minimum doubling of the speed and, in some cases, achieved up to three times faster inference time.
The new and, therefore, simulated Cortex-M55 processor (SSE-300) exhibits an inference time improvement of approximately 2 to 3 times when operated with helium.In general, SSE-300 with helium is double as fast as the STM32 and 4 times faster than the TM4.

C. MEMORY FOOTPRINT
TFLite exhibits a substantial increase in flash memory consumption independent of the usage of the CMSIS library, as shown in Table 5 and Fig. 4. TFLite requires up to 24 times more flash memory compared to the QKeras implementation.The SSE-300 shows a 3% to 16% lower flash memory footprint than the TM4 or STM32 chip and the use of helium reduces the SSE-300 footprint by 4% up to 9%.

D. POWER CONSUMPTION
The TM4 chip exhibits a current consumption that is up to 106% higher than the STM32 chip, as depicted in Table 6.The q15 quantization of the neural network on both chips consumes the highest amount of current.In contrast, the unquantized float32 implementation consumes approximately 3% to 6% less current than the q15 implementation.The q7 implementation demonstrates the lowest power consumption, consuming 14% to 17% less current compared to q15.
The energy consumption per inference can be reduced by implementing quantization on the TM4 chip, achieving a reduction of 60%, and on the STM32 chip, achieving a reduction of 64% (see Fig. 5).Moreover, upgrading the chip architecture without applying quantization techniques leads to a noticeable reduction in the energy per inference, ranging from 71% to 73% (see Fig. 5).Notably, the combined utilization of quantization and chip upgrades in this case can yield an energy reduction of up to 90% per inference.

A. NETWORK SIZE REDUCTION
The quantization methods demonstrate an accuracy loss and MCU accuracy loss of less than 1% while achieving a significant network size reduction of 50-74% for each approach.This highlights the advantage of quantization in embedded neural networks on resource-constrained systems, where notable memory savings can be achieved without sacrificing performance.Specifically, the observed accuracy drop is more pronounced in QKeras compared to TFLite due to the manipulation and conversion of parameters to integers [13],  A shown reduction of the deployed accuracy is not a feature of CMSIS-NN, but a random effect due to specific kernel implementations and a consequence of the different underlying computations [13], [61].

B. INFERENCE TIME
The considerable reductions in inference time, by a factor of 2 to 3, directly result from the quantization of the neural network.TFLite leverages the CMSIS-NN library and exploits the hardware optimizations provided by CMSIS functions.Alternatively, it can operate independently of specific hardware but with a higher inference time, even surpassing that of the unquantized network.In the Cortex-M4 (TM4) and Cortex-M7-based microcontroller (STM32), the q15 inference time is up to 7% faster than the q7 inference time.This is due to the limitation of the microcontroller, as it can only perform dual 16-bit multiply-accumulates and cannot parallelize quad 8-bit signed multiply-accumulates. Consequently, the q7 inference implementation requires the expansion of q7 vectors into q15 vectors, making it slower than the q15 implementation.

C. MEMORY FOOTPRINT
The significant memory footprint of TensorFlow, which is 7 times higher than the unquantized network condition float32, predominantly stems from the TensorFlow library itself, as evidenced by the neural network model occupying only 13 KB (see Fig. 3).Flash memory consumption can be reduced by removing unused operations from the library 40444 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and incorporating only the operations required by the current model.Notably, the utilization of QKeras together with our support for the CMSIS library results in a remarkable reduction in memory footprint.Thus, when memory footprint is a critical factor for the system, QKeras emerges as the preferred software solution.For the SSE-300 chip with and without helium, low-overhead branch instructions are available leading to much lower memory consumption than the STM32 and TM4 MCUs (see Table 5), improving even more the reduction of flash memory for resource-constrained embedded systems [62].

D. POWER CONSUMPTION
The q15 quantization leverages the SIMD architecture, SIMD instructions are likely to consume more power compared to the f32 scalar code.Since the 16-bit operations are more efficient, we can see these advantages in Fig. 5.With the int8 data type, operations are still performed with 32-bit registers, but half of the register is unused since int8 data are half the int16 size.In this case, the switching activity of the transistors involved in the operations is lower, which leads to lower power consumption.
In the context of prosthetics, minimizing power consumption in self-contained prostheses is crucial to ensure functionality and independence in everyday life.This significance arises from the fact that active prosthetic users typically engage with their prostheses for an average of 11 hours daily, with merely 7 minutes dedicated to actual hand movement [63].Even with periods of heightened energy consumption during active usage, the prosthetic system idles for a staggering 99% of the usage time, revealing an immense potential for power consumption savings.Therefore, by reducing the general power consumption through chip upgrades and network quantization (Fig. 5), substantial improvements in battery life can be achieved.Even more power consumption reductions can be achieved by incorporating the sleep mode of chips in the code.

E. GENERAL DISCUSSION
Quantization of neural networks plays a vital role in enabling resource-constrained embedded systems to effectively deploy larger networks with increased thereby maximizing real-time classification accuracy.In scenarios where flash memory is limited, the QKeras platform emerges as the optimal software choice for network quantization.Therefore, detailed guidelines and scripts are provided in the supplementary material to facilitate the utilization of QKeras for feedforward neural networks.
The newly released Cortex-M55 processor proved to have the best flash memory footprint values in the simulation, due to efficiently exploiting the capabilities of int8 data type, and effectively handling ML workloads, due to the Helium vector extension.However, M55's performance should be also tested on board in hardware as soon it is available for purchase to have a correct comparison to the other MCU performances.
While this study focuses primarily on the quantization method for network compression, it is important to note that other compression techniques, such as pruning, are commonly employed.Future research should explore the effectiveness of these compression methods, which have shown promise and are now supported by both TensorFlow and QKeras frameworks.

VI. CONCLUSION
Quantization of neural networks with TensorFlow Lite leads to a massive increase in flash memory consumption.This poses a notable limitation for resource-constrained platforms, such as prosthetics designed for all-day wearability.QKeras with our CMSIS-NN support stands out by exclusively utilizing CMSIS-NN functions for quantization, therefore, delivering superior performance as TensorFlow Lite usable for resource-constrained platforms.
For the prosthetics community, we have provided open-source scripts and explanations to support the integration of QKeras and facilitate further development. 3

APPENDIX A
Our integration begins with the definition and training of a neural network via QKeras.
The first step entails establishing the network's architecture and configuring the desired quantization.Representatively as in the paper, we use a three-layer feed-forward network as an example (see Table 7).
Notably, QKeras inherently facilitates network quantization, and we elaborate on the process of reducing the network parameters down to the 8-bit precision.In Listing 1, we reported the function that defines the NN model.For sake of conciseness, we only reported the first NN layer.The parameter n bits corresponds to the bitwidth, defining the precision of the quantization.Additionally, within the function quantized_bits() the secondary parameter represents the number of bits allocated for the integer segment of the quantified value.For this instance, we selected 1 bit for the integer component, although alternative selections are viable.It should be noted that the integer bit excludes the sign bit (1).The count of unsigned bits equals the total quantized number of bits minus the signed bit.
Furthermore, the fractional part of the fixed-point number equals the unsigned bits minus the integer bits.
The QKeras functions can be used in order to compile the model, fit it, and calculate the accuracy using the test set.
As explained in the paper, our contribution consisted in manipulating the trained weights and exporting them so that they could be included in the MCU code.The following step is then saving the quantized parameters in a way that is compatible with CMSIS.In order to do that, we changed the QKeras function model_save_quantized_weight() (see Listing 2).Specifically, the quantizer() outputs, as it usually returns a fixed-point quantized value, while in contrast CMSIS only works with integer quantized values.In this appendix we'll take the int8 quantization as a reference.In the function save_parameters() every parameter needs to be multiplied by an integer value mul, which in our case can be calculated as follows: As an example, let's consider a quantization to 8 bits of a nonnegative numbers, the lowest possible number lowest = 0.The smallest fixed-point number is number fixedpoint = 1 2 8 = 1 256 = 0.00390625.As we cannot work with fixed-point values when using CMSIS, we need the smallest number to be the smallest integer (aka.1).To achieve this, we multiply number fixedpoint by mul, where mul equals to 2 unsigned bit 2 integer = 2 8 .Or rewritten: number integer = number fixedpoint * mul.
The last step to make QKeras parameters compatible with CMSIS is to reshape the weights.The below code (Listing 3), adapted from an example that achieves this for convolutional layers, 4 transforms the weights into a CMSIS compatible format: Now the parameters are ready to be used by CMSIS functions.
We now explain the full implementation to run a quantized network using CMSIS.
Below some additional considerations in case the provided code is wished to be extended: Notation: We will use the Q notation used by ARM to represent a fixed-point number.This means that Qm.f is a fixed-point number that has m bits in the integer part of the value counting the sign bit, and f fractional bits.
In the CMSIS library, The data are assumed in dynamic fixed-point format.For example, a q7_t input number can be Q4.3,where the actual represented value is the int8 value divided by 2 3 , but it can be Q1.7 as well, where the actual represented value is the int8 value divided by 2 7 .As mentioned before, we'll take the int8 configuration as a reference example.
Let's assume that we have the input data in the Q1.7 format and the NN's parameters in the Q2.6.The first fully connected layer is taken as an example, where the output of the fully connected layer in chosen to be in the Q2.6 format.The definition of the ARM fully connected function is shown in Listing4, we'll later explain the usage of the parameters in bold.
We specify for completeness that our parameters are in the Q2.6 format because the value of ''integer'' in the QKeras create_model() function was chosen to be 1, this means that we have 2 bits in the integer part of the fixed-point number, and one of the two is the signed bit.
The core operation of the fully connected is the matrix multiplication between weight and input matrixes, the result is then summed to the bias matrix (W * I + B = OUT).In fixed-point, W * I becomes Q2.6 * Q1.7, then the result is saved in a 32-bit accumulator, in the Q19.13 format.
The reason for this result is that in fixed-point multiplication, the number of fractional bits have to be summed to obtain the correct format.In this case, the result will have 6+7 fractional bits, and the rest of the 32-bit accumulator is assigned to the integer part (32-13=19).This is the reason why Q2.6 * Q1.7 = Q19.13.
Before summing the bias (Q2.6) to the multiplication result (Q19.13)these two addends need to be in the same format.
For this purpose, we have to use the bias_shift parameter of the fully connected CMSIS function, bias_shift should be 7 since a right shift of 7 adjusts the difference between biases and the multiplication result.
The output layer format is Q19.13 as well.Since we chose a Q2.6 output format, a left shift of 7 is necessary for the output.Thus, the value 7 has to be assigned to the out_shift parameter of the fully connected CMSIS function.
These operations for the first layer become: The arm_fully_connected_q7() function doesn't only sum the bias to the multiplication result, it sums NN_ROUND(out_shift) together with the bias, where NN_ROUND(out_shift) equals 2 out_shift−1 .Even if the function is meant to compensate for the error due to the n-bit shifting of the output, it alters the bias values, and it could introduce an error.

FIGURE 1 .TABLE 1 .
FIGURE 1. Design loop for development, optimization, deployment, and evaluation of neural networks for embedded microcontroller (MCU) use.

FIGURE 2 .
FIGURE 2. Experiment design using the Tensorflow Lite and QKeras platform for quantization of neural networks.Evaluation of the quantization performance is done on three MCU boards representing the neural interface market with usage of the CMSIS library.

E
. EXPERIMENT HARDWARE SETUPHardware tests were performed using the Keil MDK tool, which comprises an Integrated Development Environment called µ Vision specifically designed for Cortex-M devices, along with the ARM compiler.The tests were conducted with an operating frequency of 80 MHz.For each measurement, both the model and parameters were stored in the flash 40442VOLUME 12, 2024

FIGURE 3 .TABLE 5 .
FIGURE 3. Inference time for the quantized networks in QKeras or TFlite on a STM32 chip with and without CMSIS library.TABLE 5. Flash memory footprint [kB] of the neural network on TM4, STM32, and SSE-300 chips with various quantizations and CMSIS library usage.O2 optimization employed to accommodate limited flash memory of the MCU.

FIGURE 4 .
FIGURE 4. Flash memory footprint [kB] for the quantized neural networks with QKeras and TFLite on a STM32 chip with and without usage of the CMSIS library (only TFLite).

FIGURE 5 .
FIGURE 5. Quantization and chip benefit effects on the primary outcome measure energy per inference of float32 to q7 and q15 for the TM4 and STM32 chip.

TABLE 2 .
3 layer (1 hidden layer) neural network architecture used for this study.

TABLE 3 .
Memory footprint and accuracy loss of the network due to quantization with TensorFlow Lite (mixed quantization) or QKeras (q7,q15) Number of multiply-accumulates MACs = 9728 and parameters=9928.

TABLE 4 .
Inference time[µs]of the neural network on TM4, STM32, and SSE-300 Chips with various quantizations and CMSIS library usage.The O2 optimization was employed to accommodate limited flash memory of the MCU.SSE-300 was tested with and without Helium usage.

TABLE 6 .
Current consumption running the neural network on the TM4 and STM32 chips [mA].

TABLE 7 .
3 layer (1 hidden layer) neural network architecture used for this study.