Embedded Machine Learning Using Microcontrollers in Wearable and Ambulatory Systems for Health and Care Applications: A Review

The use of machine learning in medical and assistive applications is receiving significant attention thanks to the unique potential it offers to solve complex healthcare problems for which no other solutions had been found. Particularly promising in this field is the combination of machine learning with novel wearable devices. Machine learning models, however, suffer from being computationally demanding, which typically has resulted on the acquired data having to be transmitted to remote cloud servers for inference. This is not ideal from the system’s requirements point of view. Recently, efforts to replace the cloud servers with an alternative inference device closer to the sensing platform, has given rise to a new area of research Tiny Machine Learning (TinyML). In this work, we investigate the different challenges and specifications trade-offs associated to existing hardware options, as well as recently developed software tools, when trying to use microcontroller units (MCUs) as inference devices for health and care applications. The paper also reviews existing wearable systems incorporating MCUs for monitoring, and management, in the context of different health and care intended uses. Overall, this work addresses the gap in literature targeting the use of MCUs as edge inference devices for healthcare wearables. Thus, can be used as a kick-start for embedding machine learning models on MCUs, focusing on healthcare wearables.

machine learning have led to researchers starting to look into 23 combining the benefits of the two-i.e., usability and high 24 The associate editor coordinating the review of this manuscript and approving it for publication was Wei Xiang . ability to extract information of interest-within the context 25 of health and care applications. 26 Designing wearable devices for these applications, how-27 ever, is not a straightforward process. As technological 28 advances provide advantages, they also present different chal-29 lenges along the way. These challenges are summarized in 30 Fig. 1. Although there is an evident increase in the use of 31 wearables, this reflects the acceptance of only part of the 32 population which voluntarily uses them. But, when wearables 33 are intended to be used for monitoring health status and 34 providing care and assistance, users might not show the same 35 enthusiasm, especially the elderly or disabled. Therefore, user 36 acceptability is a major challenge that must be taken into 37 Lastly, an additional challenge in the design of ML based 76 healthcare wearables is the choice of ML algorithm and 77 its actual physical implementation. The algorithm will han-78 dle the collected big data, dictate the system's performance 79 (which in the context of healthcare will be linked to both 80 safety and intended use), and its implementation location will 81 have an effect in all the above-mentioned challenges: power 82 consumption, privacy and security, and indirectly usability. 83 Therefore, a closer look on the application of machine learn-84 ing algorithms in healthcare, and its integration in wearable 85 devices is a must. The application of machine learning algorithms in various 88 medical scenarios is a relatively new, but exponentially grow-89 ing, field of research. Machine learning (ML), including 90 deep learning (DL), algorithms have the ability to handle 91 and interpret large amount of data, identifying patterns and 92 trends not usually noticeable by physicians. This has been 93 demonstrated in a variety of medical areas; namely, oncol-94 ogy [4], [5], [6], [7], [8], pulmonology [9], [10], [11], car-95 diovascular diseases [12], [13], [14], [15], neurology [16], 96 [17], [18], orthopaedics [19], [20], histopathology [21], [22], 97 and others. Different types of medical data that have been 98 successfully used in the context of ML for healthcare appli-99 cations include [23]: (1) medical images, such as X-rays, 100 MRI, CT scans; (2) physiological signals, such as electrocar-101 diogram (ECG), electroencephalogram (EEG), electromyo-102 graphy (EMG), and Photoplethysmogram (PPG); (3) omics 103 data such as genomics, transcriptomics, and others [24]; and 104 (4) electronic health records. These various medical data are 105 used in healthcare applications including diagnosis, progno-106 sis, monitoring of disease and physical fitness, as well as 107 in body-machine interfaces for ambient assisted living and 108 patient rehabilitation. enables fast inference but at the expense of high-power con-147 sumption [30]. FPGAs consume less power and can fit ML 148 algorithms with an acceptable performance and programma-149 bility on hardware, but are less area and energy efficient 150 than ASICs [33] (and this, in turn can affect the size of 151 the overall healthcare device). ASICs, however, have huge 152 development costs, which are not always affordable when 153 taking into account the commercial route to market of the 154 devices. MCUs, on the other hand, although also requiring 155 higher power consumption than ASICs, can be more area and 156 cost efficient than FPGAs. In addition, they are also easier 157 to reprogram requiring embedded C rather than specialist 158 knowledge in VHDL; this allowing for quick updates. The 159 performance of ML algorithms on MCUs is dependent on 160 multiple variables, related to both the MCUs specification, 161 and the specific application. In general, weighting the pros 162 and cons of the hardware options, MCUs are the most appeal-163 ing option for use in wearable devices, because of their low 164 power consumption, latency, size, flexibility, and cost. There-165 fore, this review explores the use of MCUs as edge device for 166 inferences in wearable devices for healthcare applications. 167 An example design for a future health and care wearable 168 device using TinyML technology is presented in Fig. 3. Data 169 collected from wearable sensors and related public datasets 170 are used for building the ML model. The process of building 171 the ML model follows the conventional process requiring 172 pre-processing, data splitting for training followed by eval-173 uation of the chosen model. Development of the model is 174 then achieved through error calculations on validation set 175 and changing of hyperparameters as needed. Once a final 176 optimum model is reached, the actual work related to TinyML 177 begins. The model is optimized through different compres-178 sion techniques, and then converted to an MCU compatible 179 format. The process of model optimization through different 180 optimization and deployment of ML model on MCU, also 218 analyzing their use and compatibility with specific hardware. 219 The paper also covers how these hardware and software 220 tools have been used in the design of the reported healthcare 221 aimed TinyML systems. Overall, this work can serve as a 222 guide for researchers and system designers when trying to 223 make system architecture decisions in the initial phases of the 224 design process. Peer-reviewed articles written in English reporting TinyML 228 wearable systems for healthcare applications were included 229 in this review. The articles were searched for using Google 230 Scholar and IEEExplore. Further articles and data reposi-231 tories were obtained from the citations of found articles. 232 Search terms used included: edge machine learning, embed-233 ded machine learning, TinyML, inference on microcontroller, 234 machine learning on microcontrollers combined with health 235 and care (healthcare) applications, wearable devices/systems. 236 1240 records were found in this search. Article titles and 237 abstracts were used for initial screening to identify articles 238 presenting health and care systems with embedded ML algo-239 rithms. This led to the exclusion of 1207 records. Selected 240 articles (33 studies) were further screened based on eligibility 241 criteria: articles with missing information related to on-board 242 performance metrics, algorithm training and deployment, not 243 using MCUs, and are non-wearable systems were excluded. 244 This resulted in further exclusion of 16 studies. The remain-245 ing 17 studies were included in the review.

246
Data extraction was undertaken on eligible articles using a 247 customized data extraction form. Information related to the 248 system application, components, sensor placement, dataset 249 used for training, ML algorithm and architecture, software 250 tools for deployment, choice of MCU, and performance 251 achieved for each TinyML system was extracted; together 252 with performance metrics focusing on memory usage, time 253 per inference, and power per inference.  [47]. The choice of ISA alone does not dictate 300 the overall performance of the processor, as the hardware 301 integration/architecture and available peripherals play impor-302 tant role too. Therefore, MCUs with similar core processor 303 ISA can still perform differently and other characteristics of 304 the MCU should be investigated before a decision.

305
Another key aspect when choosing a microcontroller for 306 wearable devices is the power consumption since this plays a 307 vital role in usability (linked to both size and device main-308 tenance). This on its turn is linked to minimum accuracy 309 of inference required by the application, as it is going to 310 be one of the factors determining the information loss that 311 can be afforded when transferring ML algorithms from high 312 computational servers to the edge. The latter is also going to 313 be affected by the memory constraints of the chosen MCU. 314 The available memory in an MCU both Flash and RAM 315 generates further constraints on the size and architecture of 316 the deployed ML models. There should be enough storage 317 for both collected data and ML model parameters (weights 318 and activation functions), as well as enough RAM to process 319 data and run inference. Speed of the processor is also a factor 320 to take into account, which can be crucial in the context of 321 certain applications since it will determine how fast an infer-322 ence and corresponding decision can be made. And the size is 323 also important because wearable devices are heavily volume 324 and area constrained. Overall, considering the limitations of 325 MCUs, it is important in the design process to carefully 326 establish the acceptable performance trade-offs as a function 327 of all these different variables, taking into account the device 328 intended use. Table 1 summarizes the specifications of the 329 different ARM and AVR MCUs that have been used in the 330 context of wearable healthcare devices, including type of pro-331 cessor, its speed, available memory Flash and RAM, as well 332 as operating voltage range with current consumption when 333 MCU is in run mode.

334
From the table, it can be seen how ARM is the most 335 dominant provider for various Vendors, including STMicro-336 electronics [48], [49], [50], [51], [52], and Nordic Semi-337 conductors [53], [54]. ARM provides low power Cortex-M 338 processors suitable for use in embedding neural net-339 work architectures for on-device inference [57]. Within the 340 different Cortex-M processors, Cortex-M4, and Cortex-M4F 341 done after training the model for optimum performance (post-397 quantization) to provide up to 4× smaller model. Pruning, 398 on the other hand, eliminates neurons and connections in 399 the model's architecture that do not affect the overall model 400 performance as much as major connections do. The use 401 of quantization and pruning provide an MCU compatible 402 model that reduces both power consumption and latency as 403 a result of reduced memory usage, but at the expense of 404 model's accuracy. The challenge is to find an acceptable 405 trade-off between performance and power/latency for the 406 embedded ML model. It is possible to use a single technique 407 alone or both combined. This was demonstrated in the work 408 of [64], where both techniques were used to compress a 409 7-layers convolutional neural network (CNN) and a ResNet-410 50 model. Both classification models were tested using the 411 CIFAR-10 dataset [65] to classify images to one of the 412 corresponding 10 classes. The dataset composed of 60,000 413 images (6000/class) divided into six batches. The sixth batch 414 containing 1000 images from each class was kept for testing. 415 First, the trained model's weights were extracted, and then 416 pruned using different sparsity levels. The pruned models 417 were retrained using the same dataset to conclude with the 418 best performing pruned model. In addition to quantization and pruning, the deployment 431 of ML algorithms onto MCU requires a certain level of 432 optimization for embedding. To achieve this, several software 433 tools, libraries, and frameworks are used to facilitate it, some 434 which include quantization within their process of optimiza-435 tion. The ones that have been used within the context of the 436 reviewed healthcare systems are covered in the following.  [66]. 441 The key features of this framework are: its optimized ML 442 model for on-device deployment, its compatibility with sev-443 eral platforms including mobiles (iOS and Android) and 444 microcontrollers (using TFLite for microcontrollers), and 445 its support for multiple languages including Python, C++, 446 Objective-C, Java, and Swift. TFLite models are represented 447 in a FlatBuffers format, which provides reduced size and 448 faster inference compared to the TensorFlow's Buffer for-449 mat, due to its smaller code footprint and directly accessi-450 ble data.  Apart from compression of pre-trained ML models for 511 deployment on MCU, a ready compact algorithm can be 512 used for direct integration on resource-constrained devices. 513 With this notion, Microsoft developed a library of algorithms 514 (EdgeML) that can be used for direct inference on edge 515 devices [71]. The algorithms are written in Python using Ten-516 sorFlow and PyTorch and provided as C++ implementation. 517 The four algorithms provided by EdgeML are: The model size is less than 10 KB.

532
SeeDot Embedded Learning Library (ELL) is a framework 533 provided by Microsoft EdgeML for deployment of these 534 algorithms to IoT devices [71]. SeeDot provides compilation 535 of the model, and quantization from floating point to fixed 536 point representation to run efficiently.

538
Deployment of trained models onto RISC-V based MCUs 539 of the GAP family require specific set of neural network 540 tools provided by GAPflow [72]. GAPflow is composed 541 of a multiple set of tools that can automatically intake a 542 trained NN model and produce an MCU compatible algo-543 rithm for deployment and on-board inference. The tools 544 used are, NNTool; AutoTiler; and the GCC. The NNtool 545 translates a TFLite model (quantized or unquantized) to an 546 ''AutoTiler Model'', which is a.c file describing the NN 547 topology and the quantization policy of the different NN 548 layers. The main tool in GAPflow is the AutoTiler, which 549 is responsible for optimizing the execution of convolutional 550 layers and minimizing access to memory, as well as use 551 of parallel convolutional kernels that leverage the available 552 multi-core clusters of GAP8 MCU. The optimized model 553 files produced by the Autotiler tool defining the applica-554 tion code and memory allocation of constant parameters 555 are then compiled using the GCC tool. Finally, the GAP8 556 executable file and flash file are used to run inference on 557 GAP8 [72].  The choice of a software tool for optimum deployment is 622 not limited to one library or framework. It is possible to use 623 a combination of compatible tools and libraries. An example 624 using both TFLite and X-CUBE-AI for the implementation of 625 ''Hello World'' ML model on NUCLEO-F74ZG board was 626 presented in [63]. The CNN model for the recognition of 627 handwritten numbers (MNIST dataset [81] of 70,000 samples 628 for the digits (0-9) was initially trained (60,000 samples for 629 training and 10,000 for testing) in TensorFlow using Keras, 630 having a size of 7.17 MB. It was then converted into a lighter 631 version (2.4 MB) using TFLite and TFLite Converter tool. 632 Further compression using the X-CUBE-AI STM32 package 633 applied multiple actions for model reduction, including com-634 pression of weights of the fully connected layers, fusion of 635 layers by combining two layers, and optimization of activa-636 tion function leading to model size reduction by a factor of 4 637 (668.97 KB).

639
The use of ML in the context of health and care applications 640 has been so far mostly dependent on cloud and fog computa-641 tion. The ability to perform edge inference on MCUs has only 642 become recently feasible due to the technological advances, 643 allowing deployment of compressed ML algorithms with 644 minimum loss in model performance. In this work, various 645 proof-of-concept TinyML systems are reviewed, focusing on 646 application in health and care including medical use, ambi-647 ent assistant living, and physical health for rehabilitation 648 and fitness tracking. Related information for each system 649 is summarized in Tables 2-5. System components and pro-650 totype placement (if available) are summarized in Table 2, 651 while datasets used in training and testing of the ML algo-652 rithms are given in Table 3 with corresponding ML architec-653 ture in Table 4. Finally, a summary for the embedded ML 654 implementation-including accuracy of running algorithms on 655 board, the occupied memory, time and power consumption 656 per single inference-are tabulated in Table 5. Most of the 657 reviewed works reported only the accuracy of the embedded 658 algorithm.       The system required 747 ms for feature extraction before (9.91 GMAC/s/W), 23x higher than STM32 using X-CUBE-736 AI with TFLite, as well as 46.8x faster inference (2.7 ms).

737
Edge inference was also proposed in [59] in the context 738 of a wearable artificial pancreas systems for patients with 739 Type 1 Diabetes (T1D). The system input readings, acquired 740 from a continuous glucose monitoring (CGM) sensor, were 741 used to predict blood glucose through the use of an RNN 742 model, based on LSTM layers. The LSTM-RNN regression 743 model was trained using TensorFlow and Keras using the 744 ''OhioT1DM dataset'' [98], having a single LSTM layer and 745 two dense layers with ReLU activation before the output 746 layer. The final trained model was then deployed on an ARM 747 Cortex-M4F STM32F303RE MCU for edge inference, occu-748 pying 34.69 KB of flash memory and 1 KB of RAM. The 749 deployment of the model onto the STM32 MCU made use 750 of the available X-CUBE-AI library for an 8-bit fixed point 751 representation. As a regression problem, the performance was 752 evaluated by calculating root mean square error (RMSE) and 753 mean absolute error (MAE) at two prediction horizon (PH), 754 30-min PH (MAE = 13.59 ± 1.47 mg/dL and RMSE = 755 19.10 ± 2.04 mg/dL) and 60-min PH (MAE = 24.25 ± 756 2.8 mg/dL and RMSE = 32.61 ± 3.45 mg/dL). The reported 757 error between inference on edge and local server was reported 758 to be 0.0029 mg/dL and 0.0025 mg/dL for RMSE and MSE 759 respectively. In relation to power consumption, individual 760 blood glucose level predictions were made within 22.2 ms 761 for every new CGM reading (every 5 minutes), with the sys-762 tem being in sleep mode otherwise. This led to an ultra-low 763 average power of 8 µW for running the algorithm on MCU. 764 The authors did not present a final device prototype. However, 765 they demonstrated the feasibility of using edge inference in 766 predicting blood glucose level for future incorporation with 767 CGM and insulin pumps for diabetes care device.

768
An epileptic seizure detection wearable system was pro-769 posed in [60]. Using only two EEG channels for signal 770 acquisition (F7-T7, F8-T8), 54 features were extracted and 771 data fusion was then used to enhance variability of data. The 772 publicly available dataset ''CHB-MIT'' [99] was used for 773 the training and testing of a subject-based RF ML model. 774 The implemented system used an ADS1299 EEG AFE for 775 data conditioning, and the trained ML model was deployed 776 on STM32L476 MCU for on-board inference. An infer-777 ence was made every 4 seconds (length of EEG epoch) and 778 the algorithm for processing and classifying the readings 779 required 27.9 ms per EEG epoch for both channels, consum-780 ing 7.34 mA to run the algorithm, the power consumption was 781 not reported nor the energy or operating voltage. Operating 782 The work in [47] presented another implementation of 789 a seizure detection algorithm on MCU for use in future   Table 5 855 achieving an average power consumption of 12.1 mW per 856 inference.

857
In addition to the above mentioned works focusing on 858 systems in the context of physical health, work has also 859 been reported in the context of mental health. A wearable 860 smart bracelet, named InfiniWolf, for stress detection was 861 presented in [46]. InfiniWolf (size of a coin) was assembled 862 to contain two MCUs, the Nordic nRF52832 and Mr. Wolf, 863 a pressure sensor, a 9-axis motion sensor, a microphone, 864 an ECG/EMG and bioimpedance analog front-end (AFE) 865 (Maxim MAX30001), a low power galvanic skin response 866 (GSR) front-end, and 120 mAh LiPo battery. The bracelet 867 also provided two energy harvesting sources based on thermal 868 and solar energy, which eliminated the need for recharging of 869 the LiPo battery. The system was tested in an application for 870 stress detection based on collected readings from ECG and 871 GSR, implementing a multi-layer perceptron (MLP) using 872 fast artificial neural network (FANN) library [76]. The model 873 was given five features as input to classify into one of the 874 three output classes (stress, medium stress, no stress    4, 8, 16, 32). The use of accelerometer 951 readings alone showed better performance than the gyroscope 952 alone, while the use of both provided a noticeable increase 953 in accuracy. The number of LSTM layers (1, 2) had very 954 little influence on the results. Therefore, a single layer was 955 chosen taking into consideration memory needs. As for the 956 cell size, both 16 and 32 presented nearly similar results for 957 best performance. Therefore, both inputs were used and the 958 RNN with single LSTM layer and 16 cells was trained in 959 TensorFlow. To minimize the effect of imbalanced classes, 960 the authors used a weighted cross-entropy loss function. 961 Each window contributing to the gradient is weighted by the 962 inverse of its class size in training set. For the embedded 963 design, the SensorTile was also chosen with an ARM Cortex 964 M4F MCU, and the RNN model was deployed using CMSIS 965 library without quantization keeping its 32 floating point 966 format. Hence, the classification performance of the model 967 was very close to that of the workstation with a reported mean 968 squared numerical error in the order of 10 −7 . The embedded 969 model achieved an average accuracy of 93.52% across the 970 three classes, occupying less than 18.5 KB of memory and 971 requiring a processing time of 51 ms/window. It was also 972 reported that a wearable device with minimal architecture 973 is estimated to run for 132 hours using a 100 mAh battery. 974 Moreover, an expected decrease in memory usage by 2.5 KB 975 and of 1 ms in inference time was reported to be possible 976 if only accelerometer readings were used, translating into 977 4 hours increase in battery life.

978
Other application of embedded ML in healthcare was 979 demonstrated in fitness tracking. In [86], a wrist worn fit-980 ness tracker was designed to monitor the type of exercise 981 and number of its repetitions. Three exercises were targeted 982 (squat, curl, and push-ups). The wearable was designed using 983 the STMicroelectronics SensorTile module, which incorpo-984 rates the STM32L476 MCU, a Bluetooth low energy (BLE), 985 LSM6DSM 581 (tri-axial accelerometer and tri-axial gyro-986 scope) and other unused sensors. The device was used to 987 collect real-time data for use in training and testing of an 988 ML model. As summarized in Table 3, data was collected 989 from 15 subjects at a sampling frequency of 20 Hz. The 990 collected data was classified into four classes, having the 991 fourth as not an exercise (NAE) for resting or the time 992 between exercise. The NAE was used as an indication for 993 start/end of an exercise, which helped in counting the number 994 of repetitions. A window size of 3.4s (68 samples) was used as 995 input with 6 readings (3 accelerometer, 3 gyroscope). A total 996 of 700 repetitions were collected and split into 80% and 997 20% for training and testing. The trained CNN was tested in 998 Keras at first (50 cases) reporting 100% accuracy. Then using 999 X-CUBE-AI, the model was translated into MCU compatible 1000 format for another testing. Testing on the MCU was done 1001 using a total of 70 repetitions which were all correctly classi-1002 fied. However, the testing set was imbalanced having 6 curls, 1003 15 push-ups, 13 squats, and 36 NAE. Having more NAE 1004 events was expected as resting between each exercise was 1005 classified as NAE. The model on MCU was also compressed 1006 VOLUME 10, 2022 of embedding are reported in tables.
the EEGNet model was not as significant as the improve-1063 ment noticed in the CNN model. Therefore, the authors 1064 chose to proceed implementation using their global model. 1065 The model trained on Keras and TensorFlow was deployed 1066 in its 32-bit floating point representation onto 2 STM32 1067 MCU boards using X-CUBE-AI package. These MCUs have 1068 an ARM Cortex-M processor; the first with Cortex-M4F 1069 (STM32L475VG) for lower power operation, while the other 1070 Cortex-M7 core (STM32F756 Nucleo 144) for higher pro-1071 cessing capability. Due to memory constraints, the classifica-1072 tion model was reduced in size by investigating the effect of 1073 temporal down sampling, reduced input window, and reduced 1074 channel numbers on model accuracy. Down sampling factors 1075 (ds) of 2 and 3 reported maximum decrease in accuracy 1076 of 0.32% and 1.25% respectively. Channel reduction from 1077 original 64 channels to 32, 19, and 8 channels were examined 1078 reporting a decrease in accuracy of 0.95%, 2.66% and 6.52% 1079 respectively. As for window size, 2s and 1s window resulted 1080 in accuracy reduction of 1.62% and 3.6% respectively. Two 1081 final models were deployed on the two MCUs. Down sam-1082 pling factor of 3 and input of 32 EEG channels were used 1083 in both models. However, the input window for the first 1084 model was 1s (deployed on both MCUs), while the second 1085 model used 2s window and deployed only on the Cortex-M7 1086 MCU due to limited memory of the Cortex-M4F MCU. The 1087 trade-off between speed and power consumption was clearly 1088 demonstrated in this work, where the choice of MCU depends 1089 on the system's intended use which would result in different 1090 design criteria prioritizing either speed or power.

1091
Another proposed system focused on human-machine 1092 interaction for application in prosthetic control was pre-1093 sented in [89]. The wearable system was composed of eight-1094 electrode surface-EMG (s-EMG) AFE (ADS1298) placed 1095 around the forearm and GAP8 MCU for hand gesture recog-1096 nition. A novel deep learning model based on TCN topology, 1097 Temporal Embedded Muscular Processing Online Network 1098 (TEMPONet) was trained and tested on two datasets. The 1099 first was a publicly available dataset, the Non-Invasive Adap-1100 tive hand Prosthetics Database 6 (NinaPro DB6) [106]. The 1101 second was their own 20-sessions dataset collected using the 1102 wearable prototype. Relevant information of both datasets are 1103 summarized in Table 3. Incremental training protocol was 1104 used, splitting the first half of the datasets for training, and the 1105 second for testing, with 2-fold stratified cross validation. The 1106 resulting average accuracy of testing sessions are reported 1107 in Table 5. It was noticed the performance of TEMPONet 1108 on the NinaPro DB6 was much lower than the 20-sessions 1109 dataset. This was due to the fact the NinaPro dataset differen-1110 tiates between different grasps, while the 20-session datasets 1111 between different gestures. Nevertheless, it was reported that 1112 the performance of TEMPONet on NinaPro DB6 presented 1113 an increase in accuracy of 7.8% compared to the state-of-1114 the-art models. The trained TEMPONet was quantized to 1115 8-bit representation after training using PULP-NN library 1116 for deployment on GAP8. An accuracy loss of 0.4% and 1117 4.2% was reported due to quantization in comparison to the 1118 so far. However, one commercially available system that has 1175 successfully implemented its edge computation using Cortex-1176 M4 MCU is the Amiko Respiro [113]. Although the device 1177 is not wearable, it is worth mentioning. Amiko Respiro uses 1178 sensors embedded with ML algorithms for smart monitoring 1179 of use of inhalers. It is an add-on sensor which fits to multiple 1180 commercially available inhalers. The smart sensor collects 1181 data related to vibrations from inhaler during use, information 1182 related to patient's inhalation frequency and time, and breath-1183 ing pattern providing real time feedback. Furthermore, infor-1184 mation related to lung capacity and inhalation techniques and 1185 other important parameters are calculated on device using the 1186 embedded ML algorithm. Moreover, the sensor comes with a 1187 smartphone app that patients can use to monitor their use of 1188 inhalers, and a professional dashboard for clinicians is also 1189 available if required for remote monitoring of patient's use. 1190

1191
The TinyML systems covered in this review had been 1192 designed for a variety of health and care applications. 1193 As shown in Fig. 4, some targeted the detection and manage-1194 ment of medical conditions, while others focused on elderly 1195 fall detection, as well as fitness tracking and prosthetics for 1196 rehabilitation. This demonstrated the capability of utilizing 1197 TinyML for a wide range of intended uses. However, due to 1198 this difference in target application, it is difficult and unfair 1199 to compare systems against each other. Each application 1200 targeted a different problem, be it classification or regres-1201 sion, requiring the use of different algorithms, architectures, 1202 hyperparameters, and datasets which resulted in application-1203 dependent algorithms. This defined the performance of the 1204 base model which then went through post-processing before 1205 deployment on MCU. Hence, the model's performance was 1206 initially influenced by the original floating point base model 1207 before deployment.

1208
In the context of software tools and frameworks, the 1209 reviewed tools can be categorized based on their MCU com-1210 patibility as shown in Fig.5 It was also noticed the target of most of these 1225 tools/frameworks is compression of neural network based 1226 architectures, while classical machine learning algorithms 1227 where not supported, this resulting on them not being used 1228 when classical ML algorithms were deployed on MCU 1229 VOLUME 10, 2022    In terms of targeted MCUs, their percentage usage in Tiny 1254 ML health and care systems based on the core processor is 1255 shown in Fig. 6. More than half of the reviewed systems used 1256 ARM based MCUs, followed by RISC-V MCUs, and only 1257 tenth used AVR MCUs. 8-bit MCUs are more suitable for use 1258 in simpler operations as opposed to 32-bit MCUs which can 1259 were optimized especially for FPGA use, while M0+ is an 1298 optimized version of M0 providing better performance, and 1299 lower energy consumption, as well as an optional Memory 1300 Protection Unit (MPU) for task isolation. The M7, similar to 1301 the M4, has optional floating-point unit (FPU) and a digital 1302 signal processing (DSP) unit, but its performance is superior 1303 with the use of 6-stages instruction pipeline and the optional 1304 addition of instruction and data cache/TCM [114]. The M3 1305 has a very close structure to the M4 but without the FPU 1306 and DSP extensions. In the case where neither the FPU nor 1307 the DSP units are needed the M3 processor could be used 1308 instead of the M4, as it occupies less area and consumes lower 1309 dynamic power.

1310
The M23 with Armv8-M Mainline ISA, and M33, M35P, 1311 and M55 with Armv8-M Baseline ISA, advanced Cortex-M 1312 processors provide an optional security extension for soft-1313 ware isolation, the ARM TrustZone technology, which can 1314 be beneficial in certain TinyML based healthcare applications 1315 where security is key. In addition to the software protection 1316 layer, the M35 further provides built-in physical protection 1317 against invasive and non-invasive physical attacks. Amongst 1318 the four processors, the M23 does not provide optional FPU 1319 nor DSP extension, making it the smallest and lowest power 1320 one with TrustZone security. Similar to the M7, the M55 fur-1321 ther provides optional instruction and data cache/TCM, mak-1322 ing it ''Arm's most AI-capable Cortex-M processor'' [114], 1323 [115]. With this wide range of options, it is clear that depend-1324 ing on the specific intended use, as well as performance 1325 and safety constraints different processors might be the most 1326 suitable design choice.

1327
Apart from the MCU core processor, the memory lim-1328 itations and clock speed of the MCU also played impor-1329 tant roles in the overall system performance. One common 1330 goal pursued by all researchers was to maintain a balance 1331 between on-board performance, in terms of time/inference 1332 and power/inference, and memory constraints. The memory 1333 footprint was decided by the algorithm and its architecture, 1334 as well as the post-processing techniques used through the 1335 software tools. The effect of doubling the input window size 1336 was reflected in an approximate doubling of the RAM occu-1337 pancy and the time/inference, with slight increase of Flash 1338 (0.51 KB) and an accuracy improvement of 2.25% [88]. The 1339 purpose of the algorithm also affected the memory footprint 1340 and the overall performance. This was observed in the imple-1341 mentation of the same algorithm (TEMPONet) for a 9 classes 1342 classification problem versus a regression problem, where the 1343 classification problem was more complex requiring ×6.5 of 1344 Flash memory and ×2.7 more time/inference although run-1345 ning at ×1.7 higher clock speed than the regression prob-   The following challenge would be the compression of the 1383 model whilst keeping a balance between accuracy loss and 1384 required memory footprint. This is conditioned to the post 1385 processing techniques and software tools used which are also 1386 constrained by the choice of MCU. Hence, the choice of 1387 MCU and software tools is a co-dependent process. Choosing 1388 the most suitable MCU for an application is on its own a non-1389 trivial decision, considering the number of choices available. 1390 In order to find a balance between performance, memory, 1391 speed, and power a list of prioritized specifications needs to 1392 be set at first.

1393
As TinyML is still a growing field, constant developments 1394 and advances are surfacing each day to address these chal-1395 lenges and provide an easier path for the implementation of 1396 TinyML systems. One of these advances are the TinyML 1397 development platforms which provide a bridge between the 1398 development and training of the ML algorithm and its deploy-1399 ment on embedded systems. These platforms incorporate 1400 some of the libraries and tools previously reviewed such 1401 as TFLite, and CMSIS-NN, and provide compatibility with 1402 multiple commercially available boards; allowing design-1403 ers to collect data, train, test, optimize, and deploy their 1404 model on different MCUs using a single stop. Some of 1405 these newly developed platforms are Edge Impulse, imag-1406 imob, Qeexo, NanoEdge AI, OctoML Apache TVM, and 1407 others. These platforms are worth exploring as they provide 1408 a smooth starting point for beginners and can be used to 1409

1862
ESTHER RODRIGUEZ-VILLEGAS is currently 1863 a Professor (Chair) of low-power electronics at 1864 Imperial College London, originally known for her 1865 engineering techniques to drastically reduce power 1866 in integrated circuits. She subsequently focused 1867 her research on life-science applications, found-1868 ing the Wearable Technologies Laboratory. This 1869 laboratory specializes on both: creating innovative 1870 wearable medical technologies to improve man-1871 agement and diagnosis of chronic diseases; and 1872 neural interfaces to facilitate brain research whilst improving animals' wel-1873 fare. She is also a Founder and a Co-CEO/CSO, of two active life-sciences 1874 companies, such as Acurable and TainiTec. She was elected as a fellow of 1875 the U.K. Royal Academy of Engineering, in 2020. She has received many 1876 international recognitions and awards, including an IET Innovation Award in