A Machine Learning-oriented Survey on Tiny Machine Learning

The emergence of Tiny Machine Learning (TinyML) has positively revolutionized the field of Artificial Intelligence by promoting the joint design of resource-constrained IoT hardware devices and their learning-based software architectures. TinyML carries an essential role within the fourth and fifth industrial revolutions in helping societies, economies, and individuals employ effective AI-infused computing technologies (e.g., smart cities, automotive, and medical robotics). Given its multidisciplinary nature, the field of TinyML has been approached from many different angles: this comprehensive survey wishes to provide an up-to-date overview focused on all the learning algorithms within TinyML-based solutions. The survey is based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodological flow, allowing for a systematic and complete literature survey. In particular, firstly we will examine the three different workflows for implementing a TinyML-based system, i.e., ML-oriented, HW-oriented, and co-design. Secondly, we propose a taxonomy that covers the learning panorama under the TinyML lens, examining in detail the different families of model optimization and design, as well as the state-of-the-art learning techniques. Thirdly, this survey will present the distinct features of hardware devices and software tools that represent the current state-of-the-art for TinyML intelligent edge applications. Finally, we discuss the challenges and future directions.


I. INTRODUCTION
A prodigious amount of research has been invested over the past decades in improving embedded technologies in order to enable the use of real-time solutions for many complex and safety-critical applications [1].In this regard, hardware-specific (e.g., Edge TPUs) and Micro-Controller Unit (MCU)-based embedded systems have earned a lot of attention, primarily due to their low power requirements, high performance, and, secondarily, for their maintainability, adaptability, and reliability [2].Their integration with sensors enables the perception of the external world, their connection with activators allows different kinds of interventions, and their interconnection unlocks distributed intelligence.
When we talk about the intelligence of onboard embedded technologies, we mean the learning algorithms that allow devices to make reasoned decisions based on acquired data.Unfortunately, Machine Learning (ML) on tiny devices is substantially hard, due to severe architectural, energetic, and latency constraints [17]: the available memory averages a few kilobytes, the accessible power is in the order of milliwatts, and often real-time responses must be guaranteed [18], as in safety-critical systems like health care devices, autonomous driving, or human-robot collaboration in industrial environments, where delayed decisions may have disastrous consequences, ranging from compromised patient well-being and increased road safety hazards, to operational disruptions.A glance at the latest hardware developed for TinyML reveals a notable trend: recent advances are focused on the goal of minimizing power consumption.This means that the main focus right now is making sure that ML can run on devices with limited resources.
From these premises, since 2018, the notion of Tiny Machine Learning (TinyML) has begun to take shape with the following acknowledged definition: TinyML is a paradigm that facilitates running machine learning at the edge devices with minimal processor and memory requirements; hence, the power consumption of such systems is expected to be within a few milliwatts or less [19].The challenges for TinyML practitioners are formidable: e.g., in modern neural networks, among the best currently available technologies, the number of required parameters have skyrocketed to the order of billions [20], with larger networks having better results and wider applicability.Unfortunately, the energy required to run these networks is proportional to their size, making this trend of scaling up neural networks energetically unsustainable at large scales [21]: another reason why TinyML has to be considered as a necessary, other than promising, research direction.Recent market trends (see Figure 1) confirm this rationale: priority has been given to deploying hardware that is less power-hungry, and constraining, therefore, the complexity of implemented learning algorithms: truly, TinyML has to be tinier.
When it comes to developing a TinyML solution, there are two main classical workflows, namely ML-oriented and HWoriented, and a third more recent approach, co-design.The classical workflows are widely adopted and separate the ML framework design from its hardware incarnation [1], [22].In the first approach, ML experts create, train, and test a suitable model for the problem domain, optimize its parameters, and then deploy this solution on a satisfactory device.In the second one, the hardware platform is not prearranged, and development aims to produce optimized hardware by employing specially reduced models and techniques.
The novel workflow is named co-design because ML experts and hardware engineers are involved together from the start in the design of the solution and actively exchange operational knowledge [23].Hardware engineers approach the mathematical notions underlying the ML algorithms and propose suitable hardware components for efficient translations.ML researchers meanwhile examine the cutting-edge resources they can exploit, and potentially re-design their algorithms to provide a seamless integration of hardware and software, where the form and the content are malleable and shape each other.Here is where shape and content are mixed together, with a blended recipe that constitutes the state-ofthe-art of contemporary TinyML.Specifically, in Section IV, we will detail each one of these workflows.

A. MOTIVATION AND CONTRIBUTIONS
We offer two contributions with this survey: we first provide an up-to-date overview of the rapidly evolving state-of-theart in the field of TinyML.Since the number of research articles published on TinyML is increasing exponentially (see Figure 2), the number of surveys and papers on the subject is following suit.Specifically, we catalog the literature up to September 2023.
As a further and unique contribution, this survey emphasizes the ML point of view, not only reporting the very latest in TinyML frameworks but also suggesting recent variations and advancements in the ML technologies that a TinyML practitioner may want to explore to improve on the state-ofthe-art: e.g., topics like meta-learning [24], Rational Activation Functions (RAFs) [25], and Versatile Learned Optimizers (VeLO) [26].In this regard, we provide some insights into these ML methodologies and we additionally address the most recent developments as potential TinyML future breakthroughs.The number of papers on TinyML published so far, and the surveys on the topic.As evident, there is an exponential growth in the number of research papers, and it's worth noting that our survey not only stands as the most recent but also uniquely concentrates on the ML perspective, distinguishing it from all other existing surveys in the field.

B. ARTICLE ORGANIZATION
The survey is organized as follows.Section II provides an extensive overview of the existing surveys on TinyML, indicating the differences with this work, while Section III describes the article selection criteria used in creating this systematic review.Section IV clarifies what it means to design a TinyML solution in terms of workflow, as we briefly explained in the previous subsection.Section V is the core of the survey and presents the collection of algorithms and techniques to enable efficient ML on tiny devices.Section VI reports several hardware specifications, libraries, and software platforms that are state-of-the-art for TinyML application development.Section VII summarizes the overview and proposes potential future directions.Finally, Section VIII concludes the paper.

A. HARDWARE PERSPECTIVE
From the point of view of hardware technologies, we can distinguish TinyML solutions based on Application-Specific Integrated Circuits (ASICs), (MCUs), and Field Programmable Gate Arrays (FPGAs), in increasing order of power consumption (see Figure 1).Specific surveys exist for each one of these technologies.
ASIC is the focus of [42].The RISC-V, i.e., the fifth generation of the Berkeley Reduced Instruction Set Computer (RISC) architecture, has been widely adopted by many researchers and commercial users, with several openly available implementations to choose from.Selecting the appropriate combination of RISC-V processor cores, architectures, configurations, and ML software frameworks is not trivial.In order to facilitate this process, the survey discusses the various RISC-V-based hardware implementations, in terms of available cores and System-On-Chip (SoC), in conjunction with the software frameworks and software stacks for the SoC generation.It includes a review of the latest released frameworks supporting open hardware integration for ML applications.
MCU and TinyML are the subjects of [28], [33], [34].In [28], the authors analyze the TinyML frameworks for integrating ML algorithms within MCUs and present a realworld case study.They first give a small overview of the ecosystem of applications in which TinyML techniques can be applied and then highlight the opportunities in various sectors currently undergoing a digital transformation.Finally, they propose a Multi-Radio Access Network (Multi-RAT) architecture for smart frugal objects: i.e., sporadically messaging interconnected devices with constrained resources.
In [33], the authors focus on surveying, comparing, and evaluating seven different recent and popular MCUs on a face recognition task based on a Convolutional Neural Network (CNN) workload.Their evaluation considers four key metrics (i.e., power efficiency, energy per inference, inference efficiency, and inference time) that can be used to benchmark ML applications on MCU-based devices.
FPGA is considered in [41].The authors present ''CFU Playground'', a full-stack open-source framework that enables the rapid and iterative design of ML accelerators for embedded ML systems through custom function units (CFU), i.e., hardware that augments the standard functions of a CPU.This toolchain integrates open-source software, register transfer level generators, and FPGA tools for synthesis, place, and route.To illustrate their approach, they apply their methodology to two common TinyML use cases: image classification and keyword spotting.In the first case, they show how to obtain iterative hardware-software improvements with ease, and, in the second, how to co-optimize the CPU and the CFU together in severely resource-constrained environments.

B. APPLICATION PERSPECTIVE
In relation to applied TinyML, we review surveys in the fields of IoT, environmental challenges, predictive maintenance (PdM), anomaly detection, and healthcare.
In [1], the authors provide background information on the benefits that TinyML can offer to the IoT panorama, such as low latency, effective bandwidth utilization, strengthened data safety, and enhanced privacy.Then, they show how to implement TinyML-as-a-service, i.e., an IoT device that concretely takes part in the execution of intelligent services.In [35], the authors explore the integration of TinyML with network technologies such as 5G and LPWAN.Ultimately, we anticipate that this analysis will serve as an informational pillar for the IoT/cloud research community and pave the way for future studies.
Of particular interest in recent years, TinyML has been applied to environmental challenges [38] such as global warming, climate change, natural resource scarcity, and pollution monitoring.With their ability to deploy intelligent analysis together with sensing devices, TinyML provides the natural evolution to data gathering in the environmental domain to protect our societies and the natural world.This survey elaborates on the role of TinyML devices and their limit in this context.
In [32], the authors investigate techniques used to optimize TinyML-based PdM systems.They describe PdM, and how TinyML can provide an alternative to cloud-based PdM, showing commonly used libraries, hardware, datasets, and models.Furthermore, they show known techniques for optimizing TinyML models.
Anomaly detection is the detection of unexpected patterns in the data.In [29], the authors highlight the state-of-the-art current works on TinyML for anomaly detection, providing suggestions on the research direction, and introducing potential future endeavors.
An essay on TinyML approaches for healthcare is presented in [30].The authors collect references related to i) the selection of patients for investigation, monitoring, and protocol adherence, ii) the collection, processing, analysis, and management of data, and iii) drug validation trials, followed by the solutions they bring, especially using wearable devices.
Finally, in [43], the authors present an overview of many TinyML applications and related research efforts.Specifically, the survey builds a taxonomy of TinyML techniques that have been used so far to bring new solutions to various domains, such as healthcare, smart farming, environment, and anomaly detection.

C. FIELD VIEWPOINT
In [27], the authors discuss the challenges and directions toward developing a fair and useful TinyML benchmarking suite.The group has selected four use cases to target: audio wake words, visual wake words, image classification, and anomaly detection.For each of the use cases, reference datasets, and baselines were also selected.The benchmarking suite provides results in terms of the accuracy of the model, inference latency, and energy consumption.Notably, this is one of the few literature reviews that presents datasets that can be particularly useful for the benchmarking and design of TinyML systems.
In [31] and [37], the authors present the background of TinyML, list the tool for supporting TinyML, and the key enablers (e.g., model compression and quantization) for the improvement of TinyML systems.However, neither of them is focused on advanced learning aspects for TinyML, as well as established HW-SW co-design support to enhance TinyML systems, making our survey clearly distinct from these two.
In [36], the authors aggregate the key challenges reported by TinyML developers and identify state-of-art Software Engineering (SE) approaches that can help address key challenges in TinyML-based IoT embedded vision.Examples of these challenges include the lack of curated datasets derived from IoT-embedded vision sensors, the application portability across different devices and different vendors, and the compiler choices, since embracing sophisticated compilers can help optimize for specific MCU targets.However, this affects portability, and hence, it challenges large-scale deployment under availability constraints.
In [39] the authors focus on edge training and edge inference.This paper provides a survey of existing architectures, technologies, frameworks, and implementations in these two areas, and discusses existing challenges, possible solutions, and future directions.
Finally, in [40], a review of deployment techniques for TinyML devices is provided, with also numerical insights to prove which deployment workflow is more promising given the constraints of the models and input data (e.g., sparsity, compression, etc.).They inspect the engineering of reducing the computation and memory footprint for the inference of already existing models.Furthermore, they set up some case studies to present the deployment of several famous models with and without various techniques (such as compression and feature projection), with numerical results to highlight the benefits of each technique.
Most of the works previously listed assume that TinyML can only run inference on data.Despite this, growing interest in TinyML has led to work that makes them reformable, i.e., work that permits TinyML to learn from new data points once deployed.This originates from the need to combat model drift -the inevitable degradation of a model's performance due to the ever-changing nature of data.In [22], the authors provide a survey on reformable TinyML solutions.
It should be noted that the majority of these works are not conducted based on a well-defined and widely known Systematic Literature Review (SLR), e.g., using Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) [44], except [36] and [43].

III. SELECTION CRITERIA
This section describes the selection criteria of this systematic review, i.e., how the papers that were considered were selected.Only publications in the English language were considered, and all studies had to be published in peer-reviewed journals or conference proceedings (except for the TinyML software tools presented in Section VI-E).The search strategy and selection criteria were developed in consultation with all authors.Any disagreements between authors were resolved through discussion and consensus.To gather upto-date knowledge from a broad spectrum of information sources, this comprehensive ML-oriented survey on TinyML was conducted following a widely known SLR methodology based on the PRISMA guidelines, the golden standard for improving transparency, accuracy, and completeness in documented systematic reviews and meta-analyses.
The included studies were extracted from the following five databases: Web of Science, Scopus, IEEE Xplore, ScienceDirect, and Google Scholar, from January 2018 to September 2023.All searches included the following terms: ''TinyML'', ''efficient machine deep learning'', ''neural network optimization'', ''iot machine deep learning'', ''embedded machine deep learning'', ''edge machine deep learning'' and ''mcu machine deep learning''.Therefore, all the cited papers in this work were found using the above keyword combination.
Our keywords produced a total of 4233 records.Figure 3 illustrates the PRISMA flowchart, which serves as a transparent and replicable means of reporting the systematic review's search and selection process.First, we removed all duplicate papers (3413 excluded).Next, we excluded all the papers marked as ineligible by the automation tool (271 excluded), and not accessible papers (e.g., requiring paid access) (549 excluded).After the title and abstract screening process, a total of 189 articles were selected (3341 excluded).The number of records not found is 17.As a result, 172 were eligible.
Finally, out of the 172 reviewed papers, none of them were found to be survey papers on ML-oriented techniques for TinyML.As a result, we claim that this is the first systematic review to address this topic.

IV. TINYML WORKFLOWS
In this section, we present an overview of how TinyMLbased systems are built.The two intrinsic ingredients of such systems are the ML model and the hardware platform, therefore the natural approach for developers in the field is to start working from the most familiar component.A more efficient, but challenging, alternative is to develop both sides from the beginning and create an integrated solution.As anticipated in Section I, the two traditional workflows for TinyML solutions are ML-oriented and HW-oriented, while the holistic methodology is called co-design.
In the ML-oriented workflow (see Figure 4.a), the majority of the expertise is in the design, adaptation, training, and evaluation of ML models, while the choice of hardware platforms is fixed or limited, due to necessity or specific industrial requirements [45]- [47].A typical example of this workflow is the porting of modern neural network models to embedded devices [48].This requires extensive experimental investigations for the implementation to be efficient in terms of power consumption, latency, and memory usage, all resources in short supply on such devices compared to cloud solutions.
In particular, we identify the following stages in the MLoriented workflow: • Model design: ML practitioners formulate, train, and validate a comprehensive model suitable for the problem domain.This stage is highly dependent on the nature of this domain but disregards on purpose the specifics of the hardware platform to achieve maximum generalization and performance.• Model optimization: This stage consists of different strategies to compromise performance for efficiency, discussed in more detail in Section V. • On-host evaluation: The optimized model is evaluated against the performance parameters required in the specifications and, if found lacking, it is re-designed.• Target deployment: Specialized optimizations are applied to the model in order to increase the inference efficiency by leveraging specific features of the hardware device.
• Target evaluation: The final evaluation of the system in production is performed.
On the other hand, in the HW-oriented approach (Figure 4.b), the developers are mainly focusing on designing enhanced hardware platforms that are optimized for embedded applications, in order to run current and future state-of-the-art ML algorithms.This often involves investigating the bottlenecks in an existing architecture with regard to computations within a ML framework, like neural networks, and the design of hardware accelerator modules to improve throughput and consumption: e.g., reducing computational complexity in convolution layers [49], [50], efficient, low-power and feature-rich perceptrons [51], enhanced data caches [52].In other cases, the developers design new hardware platforms optimized for embedded applications with extended digital signal processing capabilities already integrated [53].These  in turn require the development of optimized computing libraries [54], [55] to extract the most performances.
A HW-oriented workflow may have the following stages: • Hardware design: Hardware practitioners create the design for an architecture, or accelerator module in an architecture, that improves performances for a given class of computing problems or signal processing algorithms.• Target deployment: Assessment of the performance of the optimized hardware on benchmarks of the given computing problems, mostly in simulated environments.
In case of unsatisfying results, return to the design stage.• Target evaluation: Production and evaluation of the physical hardware devices.Finally, in the co-design workflow (Figure 4.c), the approach is to integrate both sides of the development from the start in order to gain further improvements in performance and resource consumption.In particular, while model optimization and hardware design are separate steps in the previous workflows (Figure 4.a and Figure 4.b), here they are intertwined and co-optimized: in some cases to create bespoke architectures for specific ML algorithms on FP-GAs [41], in other cases to allow neural network computations on customized accelerators using analog computein-memory (CiM) hardware through HW-informed training methodologies [56].
The co-design workflow may be described with the following steps: • Problem design brief: Two separate working groups G1 (ML specialist) and G2 (HW specialist) define the capabilities and requirements of the target device.• Problem analysis: The two groups specify their stateof-the-art architecture after exploring the possible alternatives.
• Co-design step: In a cooperative and concurrent design process, specific hardware and software components for selected sections of an application must be chosen with a global view of the system.• Target evaluation: Final evaluation of the modelspecific and target-specific optimizations for the device in production.

V. LEARNING PANORAMA UNDER THE TINYML LENS
To leverage the full potential of TinyML, exploring and understanding the complexities involved in designing and optimizing ML models for specifically resource-limited devices is essential.We propose the taxonomy shown in Figure 5, which covers the learning panorama under the TinyML lens.In the following sections, we will delve into the macro-areas of model optimization, model design, and learning algorithms.

A. MODEL OPTIMIZATION
Model optimization techniques tailored for TinyML generally produce smaller memory footprints, lower energy consumption, and reduced inference latency.In many cases, e.g., large neural network models, there are simply not enough resources available on embedded devices.The following paragraphs explore techniques such as pruning, quantization, knowledge distillation, and Hyper-Parameter Optimization (HPO), that enable efficient model deployment.Table 1 provides an overview of the popularity of these techniques within the TinyML field, considering the referenced research contributions.

1) Pruning
This process eliminates weight connections within a network to accomplish different goals like reduced model footprint and accelerated inference speed [64].Despite a lack of standardized benchmarks and metrics, due to differences in goals favoring different design choices and evaluations, pruning is effective at compressing models while keeping (or sometimes increasing) accuracy [62].Pruning techniques can be applied during the training process or after the model has been trained.During training, pruning regularizes the model, mitigating the risk of overfitting [59].Post-training pruning is instead employed to eliminate redundant connections and parameters from the model, thereby enhancing its efficiency and accelerating its execution [61].We identified in the current literature three main approaches: weight pruning, neuron pruning, and structured pruning.
Specifically, weight pruning is a technique that eliminates connections or weights in a model falling below a given threshold for weight size [60].Approaches based on this technique are gaining interest due to their immediate applicability [65], [66].Similarly, neuron pruning discards entire neurons based on a given threshold of importance [57] and structured pruning removes entire structures or sub-networks from a model [58].

2) Quantization
This involves performing computations and storing tensors at lower bit widths compared to floating point precision [73].By utilizing fewer bits to represent data, such as 16-bit floats or 8-bit integers instead of 32-bit floating-point numbers, quantization enables more compact model representations and the utilization of efficient vectorized operations on various hardware platforms [69].This technique is particularly beneficial during inference, significantly reducing computation costs while maintaining inference accuracy [67].Quantization can be achieved through two approaches: Quantization-Aware Training (QAT), which involves re-training the model, and Post-Training Quantization (PTQ), which applies quantization without re-training.
QAT involves quantizing a pre-trained model and subsequently performing a fine-tuning step to recover any accuracy loss caused by quantization-related errors, which may impact model performance [74].The QAT process consists of two stages: pre-training and fine-tuning.In the pre-training stage, the network is trained using standard techniques in a fullprecision floating-point format (32-bit) to learn data patterns and develop robust feature representations.In the fine-tuning stage, the network is converted to a quantized representation, combining fixed-point and floating-point arithmetic.This adjustment allows the network to adapt to the quantized representation while preserving accuracy.QAT encompasses different methods, including hybrid [70], layer-wise [94], and adaptive approaches [95].
PTQ reduces memory usage and computational costs by converting model weights and activations from highprecision floating-point to low-precision numbers [68].Initially, the model is trained using floating-point representation, followed by quantization of weights and activations using techniques like k-means clustering or vector quantization [71].The adoption of low-precision numbers, such as 8-bit integers, significantly reduces memory requirements, enabling more efficient model execution and suitability for resource-constrained environments [72].These techniques are also known as Dynamic Range Quantization (DRQ) or Full-Integer Quantization (FIQ), depending on whether only the weights are being quantized to 8-bit integers or inputs and activations functions.
Like in pruning, the application of quantization techniques allows for immediate deployment of already existing models to resource-constrained devices in various fields like computer vision [76], [77] and healthcare [78].

3) Knowledge Distillation
This technique transfers knowledge from a large, complex model (teacher) to a smaller, simpler model (student) [84].This process is important for various reasons, such as reducing computational demands or enhancing model performance on specific tasks.Knowledge types, distillation strategies, and teacher-student architectures are vital factors in student learning during knowledge distillation.The subsequent paragraphs introduce the key categories of knowledge types and distillation strategies.
The extraction of knowledge from teachers and its utilization for training student networks can be classified into three categories: response-based, feature-based, and relationbased.Specifically, response-based knowledge distillation involves mimicking the final predictions of the teacher model by capturing the neural response in the last output layer [87].Feature-based knowledge expands upon this approach by using both the outputs of the last layer and intermediate layers to train thinner networks [81].Finally, relation-based knowledge takes a step further by exploring the relationships between different layers or data samples in addition to the outputs of specific layers in the teacher model [86].
The distillation schemes are also crucial for the student learning process.Depending on the training strategy, the following three different categories are presented: offline distillation, online distillation, self-distillation.Offline distillation is a two-stage strategy, where the teacher model is first trained on a set of training samples, and then the trained teacher model is used to guide the student model by extracting intermediate features or logits [80].On the other hand, online distillation is an end-to-end approach where both the teacher and student models are updated simultaneously, making it suitable when the teacher model is not significantly larger or higher performing [85].Finally, self-distillation is a special case of online distillation where the teacher and student networks have the same architecture [79].
In general, knowledge distillation is used to achieve a good trade-off between small model size and an acceptable accuracy [88].For this reason, it is widely adopted in several fields where existing models are well-performing but unable to be deployed ''as they are'' in resource-constrained hardware.This is the case with large scaling requirements [89], [110] Attention M [111], [112] [113]- [117] bandwidth-limited domains [82], and healthcare applications, where the trade-off between accuracy and model size needs to produce a high accuracy model that can fit the hardware requirements [83].

4) Hyper-Parameter Optimization (HPO)
This technique automates the search for the optimal hyperparameter values of a model to enhance its performance on a specific task [90].Hyper-parameters, such as learning rate, batch, and network size, are predetermined parameters that influence model behavior [92].HPO utilizes search algorithms, such as Grid Search, Random Search, and Bayesian Optimization, to explore the hyperparameter space and identify the combination that yields the best performance [91].By automating the tuning process, HPO reduces the effort and time required while improving the model's performance.

B. MODEL DESIGN
Unlike traditional ML models, TinyML models require careful design considerations to strike a balance between accuracy and efficiency.This section investigates techniques in model architecture exploration, model simplification, and architectural modifications that provide lightweight models, yet are capable of delivering acceptable performances for their intended applications.In the following paragraphs we explore Neural Architecture Search (NAS), Rational Activation Functions (RAFs), depth-separable convolution, and the attention mechanism.Table 2 provides an overview of the popularity of these techniques within the TinyML field, considering the referenced research contributions.

1) Neural Architecture Search (NAS)
Neural architecture design plays a crucial role in data representation and performance, but it heavily relies on researchers' knowledge and experience.NAS automates the process of discovering optimal architectures for specific needs, replacing manual tweaking with an automated exploration of more complex architectures.
NAS utilizes search algorithms, such as reinforcement learning, evolutionary algorithms, and gradient-based methods, to identify architectures that maximize performance on a given task.Moreover, in [102], the authors argue that it is beneficial to NAS approaches for resource-constrained systems to also search for appropriate data granularity.Specifically, data granularity refers to the concept that data can be fed into an ML model at various levels of detail (e.g., an audio sample can be presented to an ML model using different sample rates).By automating the search process, NAS reduces the time and effort required for network design and optimization, leading to improved task performance [99]- [101], [103].
For example, in these works [97], [98], [104], NAS algorithms targeted specifically to microcontrollers are investigated, demonstrating that NAS promises to help design accurate ML models that meet the tight MCU memory, latency, and energy constraints [96].
2) Rational Activation Functions (RAFs) Activation functions play a central role in deep learning since they form an essential building stone of neural networks, thus, individuating new activation functions that can potentially improve the results is still an open field of research.Recently, RAFs have awakened interest because they were shown to perform on par with state-of-the-art activations on image classification [105].They are trainable in an end-to-end fashion using backpropagation and can be seemingly integrated into any neural network in the same way as common activation functions (e.g., ReLU).In other words, the key idea is to involve the activation functions in the learning process together (or separately) with the other parameters of the network such as weights and biases.
RAFs have several advantages over standard activation functions [106].For example, they can provide better approximation capabilities, which can improve the performance of the neural network [25].Additionally, RAFs can have more flexible shapes, making them better suited for modeling a wider range of data distributions.
Thus, by exploring RAFs, which can potentially strike a balance between accuracy and computational cost, we could unlock new avenues for creating compact yet highperforming models ideal for resource-constrained contexts.Despite the scarcity of previous research on the subject, examining RAFs might lead to ground-breaking findings and innovative insights in refining TinyML models for real-world applications.

3) Convolutional Layers
In the convolution operation, each filter convolves over the spatial dimensions and channel dimensions of the input.The filter size is typically denoted as s x × s y × in ch .Standard convolutions have a high computational cost, depending on the size of the kernel and the size of the input.To optimize this process, depth-separable convolution was introduced.Specifically, depth-separable convolution involves two steps: 1) Performing a point-wise convolution with 1 × 1 filters, resulting in a feature map with a depth of out ch .2) Conducting a spatial convolution with s x × s y filters in the x and y dimensions.By stacking these two operations without intermediate nonlinear activation, the output shape remains the same as that of a regular convolution, but with significantly fewer parameters.
This technique is utilized in models like MobileNet [107], and MobileNetV2 [109], designed for mobile and embedded devices.Using depth-wise separable layers instead of regular convolutions, MobileNet reduces the number of parameters and multiply-add operations, enabling efficient deployment on mobile devices for computer vision tasks.
Furthermore, in [110] Tiny-Sepformer is presented, a tiny time-domain transformer network, that uses Convolution-Attention (CA) block into the masking network, in order to split the layer into convolution path and attention path parallelly.In particular, to further reduce the computation, the convolution part of CA is a 1D depthwise separable convolution.
Finally, in [108] Xception is presented, an interpretation that considers Inception modules in convolutional neural networks as an intermediate step between regular convolution and the depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution).

4) Attention Mechanism
This technique, while commonly associated with notable contributions to machine translation tasks, has been adapted and adopted for a wide range of applications [115].Its fundamental purpose remains unchanged: allow the model to focus on relevant parts of the input while generating outputs.This enables the model to selectively attend to different regions or features, thereby facilitating the extraction of salient information from complex and high-dimensional data.Beyond translation tasks, the attention mechanism has been successfully employed in natural language processing [113], image captioning [112], speech recognition [111], and more.By incorporating attention into these tasks, models can effectively handle long-range dependencies, capture context-specific information, and improve overall accuracy and robustness.
The beauty of the attention mechanism lies in its ability to assign importance to different parts of the input dynamically, based on their relevance to the current context.This adaptability allows models to prioritize relevant information and disregard noise or irrelevant details, resulting in more precise and context-aware predictions.Hence, attention can be particularly useful in TinyML applications, where resourceconstrained devices require efficient and compact models.In [114], the authors introduce AttendNets, a deep selfattention architecture based on visual attention condensers, to deploy on-device visual perception tasks like image recognition.

C. LEARNING ALGORITHMS
Among the many taxonomies covering the whole discipline of ML, we focus first on the standard paradigms of supervised, unsupervised, Self-Supervised (SSL), and deep reinforcement learning.Moreover, for the particular case of optimal usage of resources, we review weakly-supervised learning, meta-learning and continual learning techniques, detailing how these are useful for TinyML.Also in this case, Table 3 provides an overview of the popularity of these techniques within the TinyML field, considering the referenced research contributions.

1) Supervised Learning
In this paradigm, a model learns from labeled training data to make predictions.It involves training a model on inputoutput pairs, where the input data is fed into the model, and the corresponding desired output or label is provided.This training process allows for robust models but at the cost of requiring a large number of annotated data.Supervised learning, in general, guarantees good performances when a lot of training data is available, which usually also requires big models with a high number of parameters that are capable of learning the patterns present in the data.In this sense, there are TinyML approaches that perform supervised learning on big models and then distill the knowledge on small models leveraging on the knowledge learned from the big supervised models.This can be performed by compressing a previously trained model (teacher) to a smaller one (student) as in [118], in a process that is similar to the knowledge distillation process (as explained in Section V-A3), or by learning directly from a big model leveraging parallel computation, as proposed by [119].
As stated before, supervised learning requires a lot of annotated data to be effective.Despite that, some TinyML works try to compensate for the lack of annotation on the data.In this regard, in [120] the authors implement a method for generating missing samples during training in the context of human action recognition, where missing samples could lead to inaccurate classifications.In [121], they propose to take into account the uncertainty of future samples in a power consumption management implant.This is performed by trying to capture the underlying seasonal and daily changes from some annotated data, and then forecast the uncertainty of future energy consumption.

2) Unsupervised Learning
In this paradigm, a model learns from unlabeled data, striving to discover patterns, structures, or relationships in the data.
Anomaly detection is one of the most common use cases for employing the unsupervised learning approach.This makes anomaly detection particularly interesting for TinyML, given the need to process raw data streams in their early stages, as close to their origins as possible.Specifically, the industrial environment is where a majority of research and de-velopment related to anomaly detection is concentrated [123].This is due to the unique challenges encountered in such environments, including limited or unreliable communication with the cloud, uninterrupted connectivity, and potential obstacles to accessing systems.In this sense, TinyML becomes of necessary importance.
Besides the industrial environment, other real-world scenarios are interested in unsupervised anomaly detection, such as physiological disorders [122] and climate conditions [124], in which classic machine and deep learning models are proposed.
Specifically regarding deep learning approaches, in [125] a model based on autoencoders has been engineered to be executed on a microcontroller for detecting anomalies of topload industrial washing machines.The model has been ported to an Arduino Nano microcontroller, achieving high accuracy and recall performances, with remarkably low power usage.
Finally, related to the empirical data analysis approach, in [126], the authors propose an unsupervised TinyML approach to detect anomalies on roads, based on the concept of Typicality and Eccentricity of Data (TEDA).Similar work is presented in [127], where the focus lies on monitoring the release of greenhouse gases from urban vehicles.Specifically, a TinyML unsupervised methodology is employed to quantify CO 2 emissions for the evaluation of air quality within urban environments.
3) Self-Supervised Learning (SSL) Supervised learning is currently facing a bottleneck due to its significant dependence on expensive manual labeling, leading to issues such as generalization errors and spurious correlations [128].SSL has emerged as a highly promising technique to address the aforementioned challenges, offering a solution that eliminates the need for costly and expensive manual annotations [130].
In SSL, the model is trained to predict certain aspects of the input data without relying on external annotations.This can be achieved by creating surrogate tasks such as, in the case of image-based tasks, predicting missing parts of an image, reconstructing an image from a corrupted version, and predicting the relative position of image patches.In this manner, the model is encouraged to capture the underlying structure and semantics of the data.
The synergy between SSL and anomaly detection is evident: by reconstructing or predicting parts of the data without explicit labels, the model learns to capture the inherent structure of the majority class, making it more sensitive to deviations and anomalies.Furthermore, is particularly advantageous when labeled anomaly data is scarce or expensive to obtain, as it enables the model to generalize better and identify anomalies effectively in diverse and complex datasets.Since anomaly detection faces a scarcity of labeled anomaly examples, SSL leverages unlabeled data to learn features that capture the underlying structure of the data, effectively utilizing the abundance of unlabeled data available.Specifically, in [129] an SSL method for anomaly detection in IoT devices is proposed.Their method is based on a multivariate Long Short-Term Memory (LSTM) autoencoder.The self-supervision is performed by detecting data that significantly deviates from the learned distribution, and using them as anomalous data to enhance the detection of such anomaly types.

4) Deep Reinforcement Learning
Reinforcement Learning (RL) is a ML technique that trains a model (usually an agent) to take actions (policies) based on the input.The way the model learns is usually based on rewards assigned to a good policy that the model needs to maximize in the form of a Markov Decision Process (MDP) [131].Integrated with deep learning, i.e., deep reinforcement learning, the optimal policies are obtained.This is useful because in real-world applications the space state is high-dimensional and the use of traditional RL algorithms is not effective [151].Specifically, in the case of TinyML, the challenge is to embed the neural networks implemented with deep reinforcement learning approaches on small, constrained devices such as MCUs.
Transferring trained deep reinforcement learning models on constrained devices is possible using several generalpurpose techniques usually designed to alleviate the system resource bottlenecks, as proposed by [132] in their framework suite, making feasible deep reinforcement learning for TinyML platforms.Moreover, in [133], the authors propose the framework TinyRL, to transfer the deep reinforcement learning knowledge into resource-limited devices.
Cheap off-the-shelf MCU devices are particularly interesting for deep reinforcement learning as they are widely adopted in robotics.Deploying a DRL model on an MCUpowered intelligent agent for autonomous driving, for instance, is what the authors of [134] propose to achieve.They present a deterministic policy gradient algorithm that takes into consideration the computation energy and caching costs jointly.This significantly reduces the energy cost of the final model.Moreover, in the efforts of allowing deep reinforcement learning on MCU, [135] propose to train a tiny CNN that can be easily deployed on an MCU, with the aim of solving a physical, electrically actuated tilting maze with repositionable walls.

5) Weakly Supervised Learning
This paradigm lies between supervised and unsupervised learning.It involves training a model using partially labeled or noisy labeled data, where only limited or incomplete supervision is available.Since collecting large amounts of accurately labeled data can be expensive and time-consuming, weakly supervised learning allows for training models with fewer labeled samples, reducing the labeling costs associated with data collection.This cost-efficiency is clearly advantageous for TinyML applications, where resource constraints often limit the availability of labeled training data.
For example, in [136] the authors propose a weakly supervised learning solution for achieving improved anomaly detection performances.In particular, the training phase of the model is improved by some labels in the dataset: in fact, a part of the dataset is labeled, playing the role of ''domain expert,'' which allows weakly supervised learning.The ML model used is an Isolation Forest, and these labels are used to remove unnecessary trees and keep the most informative ones, i.e., those that give the best results).
In [137] is presented another TinyML-based system for anomaly detection in industrial environments.In this case, an ensemble of ML classifiers detects if a sample is anomalous or not.This allows the system to be scalable w.r.t. the size of the ensemble, with a predictable impact on the memory footprint and delay in inference mode.

6) Meta Learning
Unlike traditional methods that solve tasks independently using a fixed learning algorithm, meta-learning enhances the learning algorithm itself based on experiences from multiple learning episodes [24].While various perspectives on metalearning exist, our focus here is on the optimization approach known as neural-network meta-learning, which is particularly relevant for TinyML applications.The neural-network metalearning design should take into account three independent axes that represent the current meta-learning landscape: metarepresentation, meta-optimizer, and meta-objective.
In particular, the concept of meta-representation [138] involves learning a high-level representation that captures the commonalities and patterns across different tasks.This metarepresentation serves as a knowledge base from which the model can quickly adapt to new tasks with limited data.For example, in [143] MetaLDC is proposed, a system that metatrains ultra-efficient low-dimensional computing classifiers to enable fast adaptation on tiny devices with minimal computational costs.Specifically, during the meta-training stage, MetaLDC meta-trains a representation offline by explicitly taking into account that the final (binary) class layer will be fine-tuned for fast adaptation for unseen tasks on tiny devices; during the meta-testing stage, MetaLDC uses closed-form gradients of the loss function to enable fast adaptation of the class layer.
On the other hand, the meta-optimizer [139] learns to optimize the model's parameters in a way that facilitates fast adaptation and generalization across tasks.Specifically, the meta-optimizer tunes the learning algorithm itself, enabling it to efficiently update the model based on task-specific information.TinyReptile, a simple but efficient meta-optimizerbased algorithm to collaboratively learn a solid initialization for a neural network across tiny devices, is presented in [47].
Lastly, the meta-objective [140] guides the meta-learning process by defining a criterion for evaluating the performance of the meta-learner.It provides a signal for learning to adapt and generalize, encouraging the acquisition of task-agnostic knowledge that can be applied to new tasks.To the best of our knowledge, the two closest works in this area are [141], in which the authors propose an Adaptation-aware Network Pruning (ANP), a novel pruning scheme that works with existing meta-learning methods for a compact network capable of fast adaptation, and [142], in which it is shown that the application of Lottery Ticket Hypothesis (LTH) to metalearning enables the adaptation of meta-trained networks on various IoT devices.

7) Continual Learning
This research field aims at developing algorithms that enable models to continuously learn from new data while preserving previously acquired knowledge.This is essential because conventional ML models typically struggle to learn from new data while retaining previously acquired knowledge, often leading to catastrophic forgetting [144].
Recently, there has been a significant development in the field, with several promising algorithms and architectures being proposed and showing improved performance in various continual learning benchmarks.These approaches can be classified into the following three categories: regularizationbased, replay-based, and dynamic architectures.
Specifically, regularization-based methods introduce regularization terms into the loss function to encourage the model to maintain its prior knowledge while learning new tasks.On the other hand, replay-based methods involve storing past data and replaying it during training to prevent forgetting.Finally, dynamic architectures, adjust their capacity to accommodate new information.
The significance of this research area has increased due to the surging demand for TinyML models in several applications, including healthcare, wearables, and IoT devices.Recent research in this domain has been focused on the development of efficient algorithms that can manage the restrictions of TinyML devices, such as limited memory and processing power.One of the promising methods in a TinyML scenario is to use regularization-based approaches that add penalties to the loss function to prevent overfitting and catastrophic forgetting.Another effective approach is to use dynamic architectures that can adapt their structure to accommodate new tasks.For instance, in [149] a regularization-based approach for an IoT scenario is presented, in which MCUs are exploited as edge devices for data processing considering two tasks: gesture recognition based on accelerometer data and image classification.
Replay-based methods, on the other hand, are well adopted in real-world scenarios as the general replay approach is very intuitive.In [147], the authors leverage the quantization of the frozen stage of the model, allowing for 8-bit execution, and replays in the latent space to reduce their memory cost with minimal impact on accuracy.The results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory an amount compatible with embedding in TinyML devices.In [148], the authors propose Train++, an incremental replay-based training algorithm that trains ML models locally at the device level (e.g., on MCUs) using the full n-samples of high-dimensional data.Train++ enables resource-constrained MCU-based IoT edge devices to locally build their own knowledge base on the fly using the live data, thus creating smart self-learning and autonomous problem-solving devices.The authors of [150] propose TyBox, a toolbox for the automatic design of ondevice TinyML classification models, with the idea of automatically generating the ''incremental'' version of an initial (static) pre-trained model using replays.
Lastly, regarding the dynamic architectures approaches for continual learning, in [145] a pioneering contribution in the form of Tiny-Transfer-Learning (TinyTL) is presented.In their work, the authors propose a novel approach that achieves memory efficiency by selectively freezing the weights of the network while solely focusing on learning the bias modules, thereby obviating the need to store intermediate activations.
To ensure the adaptability of the model, a new memoryefficient bias module, referred to as the lite residual module, is introduced.Through extensive experimentation, it is demonstrated that TinyTL yields substantial memory savings with minimal sacrifice in accuracy compared to the conventional fine-tuning approach applied to the entire network.Finally, in [146] the authors propose TinyOL (TinyML with Online Learning), which enables incremental on-device training with streaming data.

VI. TINYML DEVICES AND TOOLS
TinyML heavily depends on hardware devices to enable efficient training and inference for its applications.In particular, Central Processing Units (CPUs), Graphics Processing Units (GPUs), FPGAs, and Tensor Processing Units (TPUs) are essential components for the functioning of TinyMLbased systems.Table 4 presents a summary of a comparison between these different hardware devices, outlining their respective advantages and disadvantages.The following sections present a summary of a comparison between these different hardware devices, outlining their respective advantages and disadvantages.

A. CENTRAL PROCESSING UNIT (CPU)
The primary objective of TinyML is to optimize ML workloads in a way that allows them to be executed on microcontrollers with extremely low power consumption, often just a few milliwatts.Microcontrollers, particularly the Arm Cortex-M family, serve as an ideal platform for implementing ML due to their widespread usage [152].Additionally, their minimal power requirements make them suitable for deployment in environments where replacing batteries is challenging or inconvenient [153].
However, despite their advantages, microcontrollers exhibit several drawbacks.Their wide-ranging applicability leads to the inclusion of unnecessary operations and logic checks, which might degrade computational performance.In addition, this fails to fully exploit the potential parallelism offered by deep learning algorithms.Originally designed for accelerating computer graphics, GPUs differ from CPUs in their composition.While CPUs consist of a few Arithmetic Logic Units (ALUs) optimized for sequential processing, GPUs are equipped with thousands of ALUs that enable parallel execution of numerous simple operations.This parallel architecture makes GPUs highly suitable for ML tasks since they can rapidly perform a large number of parallel computations.For example, ML algorithms often involve extensive matrix and vector operations, which can be efficiently parallelized and executed on GPUs, as we can see in [154].
In recent years, Nvidia has introduced multiple generations of GPU microarchitectures, such as the Nvidia Jetson family, with a growing emphasis on enhancing deep learning performance.Additionally, Nvidia has introduced Tensor Cores [155], specialized execution units within their GPUs specifically designed for deep learning applications.
Furthermore, GPUs are specifically designed to efficiently handle large datasets and facilitate rapid data transfer between the main system memory and processing units, a significant attribute since ML typically operates on extensive real-time data [156].Hence, the combination of parallel computing capabilities, a significant number of cores, and high-bandwidth memory access collectively establish GPU microarchitecture as a good choice for TinyML-based applications.

C. FIELD-PROGRAMMABLE GATE ARRAY (FPGA)
An FPGA offers a high-performance, efficient, and scalable solution for handling the intricate mathematical computations demanded by ML [157].
The fundamental building block of an FPGA's architecture is the fabric layer, comprising Configurable Logic Blocks (CLBs) and programmable interconnects.CLBs can be flexibly configured by users to perform various digital functions, including the complex mathematical operations essential for ML algorithms.Through programmable interconnects, CLBs can be interconnected in different configurations, enabling customization to suit diverse ML applications [158].
Aside from the fabric layer, an FPGA designed for ML often incorporates additional specialized hardware blocks, such as Digital Signal Processing (DSP) blocks and highperformance memory blocks.DSP blocks are utilized to enhance the execution speed of intricate mathematical operations like convolutions and dot products, which are commonly employed in ML models.High-performance memory blocks also facilitate rapid access to the extensive datasets [159].
To summarize, an FPGA aims to deliver high performance, efficiency, and scalability, catering to the complex computational requirements of ML tasks.Consequently, FPGAs emerge as an ideal choice for a broad spectrum of applications in the TinyML domain.

D. TENSOR PROCESSING UNIT (TPU)
A specialized processor, known as the TPU, has been developed by Google explicitly for ML tasks, with a specific emphasis on tensor operations [160].TPUs consist of several part, including a high-bandwidth memory system, a systolic array of processing units, and an interconnected network that facilitates communication between these components.
The systolic array represents the core of the TPU, responsible for executing tensor operations.It consists of numerous processing elements arranged in a two-dimensional grid, with each element interconnected to its neighboring ones.This arrangement enables efficient communication between processing elements, facilitating the parallel execution of complex tensor operations.The high-bandwidth memory system ensures swift access to data necessary for tensor operations.Finally, the interconnect network links the TPU with other system components such as the host processor and other TPUs, promoting efficient communication and coordination among them.Google has released Edge TPUs using the Coral platform in various form factors, ranging from a Raspberry-Pi-like Dev Board to stand-alone solderable modules [161].
In summary, the architecture of an Edge TPU is specifically designed to provide high performance, efficiency, and scalability in handling tensor operations, making it an excellent choice for TinyML applications.

E. SOFTWARE TOOLS
As the demand for implementing ML on various hardware devices continues to grow, the software layer emerges as one of the essential components in the development of TinyMLbased systems.To date, prevalent frameworks heavily rely on vendor-specific operator libraries, demonstrating the significant potential for driving advancements in TinyML research.Below, we provide an overview of the main frameworks utilized in this domain: • TensorFlow Lite Micro [162]: is an open-source framework that empowers microcontrollers and similar devices with limited memory capacity to execute ML models.It operates efficiently without relying on an operating system, standard C (or C++ libraries), or dynamic memory allocation.Developed in C++11, this frame-work necessitates a 32-bit platform and exhibits compatibility with most Arm Cortex-M Series processors.
• uTensor [163]: is a remarkably lightweight, opensource framework for ML inference.It is built upon Ten-sorFlow and meticulously optimized for Arm targets.By converting ML models into readable and self-contained C++ source files, uTensor greatly simplifies integration with embedded projects.• Edge Impulse [164]: is a service that facilitates the development of TinyML models specifically tailored for edge devices.The training process takes place on a cloud platform, and the resulting trained model can be easily exported to an edge device.Additionally, Edge Impulse simplifies the collection of actual sensor data, enables live signal processing from raw data to neural networks, and streamlines testing procedures.• Embedded Learning Library [165]: the Microsoft Embedded Learning Library (ELL) empowers users to design and implement intelligent ML models on resource-constrained platforms.Conceptually, the ELL can be seen as a cross-compiler for intelligence embedding, where the compiler operates on the laptop and generates machine code that can be executed on the embedded device.• X-CUBE-AI [166]: is an STM32Cube expansion package.Allows an automatic conversion of pre-trained artificial intelligence algorithms, including neural networks and classical ML models, for STM products.It also integrates an optimized library for STM32 ARM Cortex M-based boards.
• uTVM [167]: is a compiler that offers graph-level and operator-level optimizations, enabling deep learning workloads to achieve performance portability across a wide range of hardware back-ends.It addresses optimization challenges specific to deep learning, including high-level operator fusion, mapping to various hardware primitives, and effectively mitigating memory latency.
• MinUn [168]: is a framework jointly developed by Microsoft Research in India, ETH Zurich, and UC Berkeley, specifically designed for TinyML applications.It presents a comprehensive solution to three critical subproblems.Firstly, it addresses the challenge of utilizing number representations that approximate 32-bit floating point numbers using fewer bits, without compromising accuracy.Secondly, it offers heuristic techniques to optimize bandwidth assignment, ensuring minimal memory usage while preserving accuracy, and lastly, it tackles the issue of memory management on devices with limited resources, mitigating potential problems related to memory fragmentation.

VII. DISCUSSION
In the recent five years, as shown by our search strategy, there has been a notable surge in studies investigating TinyML methods, optimizations, and applications.This trend reflects the growing recognition of the importance of real-time solu-tions for many complex and safety-critical real-world applications.In this paper, we present a comprehensive analysis from the ML point of view of TinyML.Our aim is to provide not just an updated guide on the current state-of-the-art, but also to pinpoint areas that have yet to be explored.By doing so, we hope to lay the foundation for future research and investigations in this field.
From the proposed taxonomy in Figure 5, the area of model optimization, based on referenced research contributions, is the one that has received the most extensive exploration.Indeed, within TinyML, we come across several cutting-edge works that explore techniques such as pruning [63], quantization [75], and knowledge distillation [88].On the contrary, the scarcity of research focusing on HPO [93] can be attributed to its complexity, the lack of awareness about the importance of HPO and its potential to significantly enhance the performance of ML models, and the resource requirements, which can be a limiting factor for researchers with restricted access to high-performance computing infrastructure.Table 1 summarizes this inquiry.
In terms of the model design area, most of the work is focused on NAS [97].Secondly, it's worth noting that another significant portion of the existing research in this area is related to attention mechanisms [116], [117] and depth-separable convolutions since these techniques enhance model efficiency, accuracy, and real-time inference capabilities, making them essential for resource-constrained edge devices.Therefore, we expect that research in these areas will continue to grow significantly in the coming years.However, despite the numerous advancements made in ML through the application of RAFs or VeLO, no such method has been found in any works related to TinyML.Regarding this, we believe that pursuing research in this direction could potentially lead to further enhancements in the quality of the produced models.In this case, Table 2 resumes this information.
Related to the learning algorithms, we have encountered a significant body of research focused on unsupervised learning and continual learning.Additionally, considerable efforts have been made in the domains of supervised learning, meta-learning, and deep reinforcement learning.However, we note that the fields of self-supervised learning and weaklysupervised learning still require effort before they can be widely used in TinyML-based work.Hence, we strongly believe that directing research efforts toward these two areas holds immense potential for significant advancements in the years to come.Table 3 sums up this direction.
In the field of TinyML applications, most efforts focus on addressing the challenges of anomaly detection.This is consistent with the previous statement emphasizing the use of unsupervised and continual learning strategies.It is worth noting that within the TinyML landscape, the majority of methodologies used to tackle various tasks lean toward deep learning paradigms.Non-deep learning algorithms are used sporadically in this context, with notable exceptions being approaches based on the TEDA framework, as demonstrated in [126].
Regarding hardware choices, there isn't a one-size-fits-all solution among CPUs, GPUs, FPGAs, and TPUs.Each of them comes with its own set of advantages and disadvantages, as summarized in Table4.Therefore, selecting the most suitable hardware for a specific application is essential.However, we expect that the co-design approach will significantly focus further advancements in this research field since the early work seems to be extremely promising, making it the primary direction of future developments.
Despite the promising applications and growing scientific literature in the field of TinyML, further research is needed to fully comprehend its advantages and limitations.In this context, we draw other additional unresolved issues that require dedicated research to drive future advancements in the field.
• Benchmarking: The lack of a recognized benchmark, due to the challenges posed by low power, limited memory, hardware heterogeneity, and software heterogeneity, is an important impediment that may hamper TinyML services [27].In this context, the IoT community has shown an increasing interest in benchmarking as a way to scientifically compare the performance of various TinyML solutions, both for training benchmark [169], for inference benchmark [170], and specifically for TinyML systems [171].
• Memory Constraints: The insatiable demand for computation and high accuracy has continued to push the innovations in ML algorithms, but the extremely small size of SRAM and flash memory makes the task of deep learning on edge devices even today very challenging.• Data-driven engineering: It's critical to understand data quality thoroughly because relying solely on accuracy can be misleading when predicting model behavior.
To accomplish this, we will need a large amount of relevant real-world data.This information will assist us in identifying specific instances where the model fails to detect or behaves incorrectly.Furthermore, postprocessing techniques will be required to improve the model's performance in these areas.In essence, we need tools and processes that prioritize ''data excellence'' in order to assess data quality comprehensively.• Lack of accepted models: Currently, many deep learning models are widely accepted for conventional infrastructure.For example, MobileNet is used as the baseline for benchmarking deep neural networks in mobile edge computing devices.However, for example, no such popular model can be adopted for the TinyML on the MCUs ecosystem.
• Lack of public datasets: Despite some datasets specifically designed for TinyML being available (such as for on-device online training [148]), to date, TinyML is mainly concerned with sensor processing in general, so the question that emerges is. . .''What's the Ima-geNet [172] of TinyML''?

VIII. CONCLUSION
The prodigious amount of research invested over the past decades in improving embedded technologies to enable the use of real-time solutions for many complex and safetycritical applications led to the birth of TinyML (Section I).As summarized in Figure 3, this paper presents a systematic review of TinyML from January 2018 to September 2023 (Section III).For the first time ever, we formalize the three different workflows to implement a TinyML-based system (Section IV).As an additional and distinct contribution, this survey places a strong emphasis on the ML perspective.It not only presents the most current TinyML frameworks but also recommends recent variations and advancements in ML technologies that TinyML practitioners may consider exploring to enhance the state-of-the-art capabilities (Section V).
In Section VI we examine the advantages and disadvantages of different hardware devices that can be used to develop TinyML-based applications.Finally, in Section VII we highlight the fields that hold the most promise for further research in the upcoming years.Additionally, we provide a list of unresolved problems that need to be addressed in order to propel the field forward.
FIGURE 1.A glance at the latest hardware developed for TinyML reveals a notable trend: recent advances are focused on the goal of minimizing power consumption.This means that the main focus right now is making sure that ML can run on devices with limited resources.

FIGURE 2 .
FIGURE 2.The number of papers on TinyML published so far, and the surveys on the topic.As evident, there is an exponential growth in the number of research papers, and it's worth noting that our survey not only stands as the most recent but also uniquely concentrates on the ML perspective, distinguishing it from all other existing surveys in the field.
The HW-oriented workflow for implementing a TinyML-based system.The co-design workflow for implementing a TinyML-based system.The ML-oriented workflow for implementing a TinyML-based system.Target evaluation

FIGURE 4 .
FIGURE 4. The three workflows for implementing a TinyML-based system: (a) ML-oriented, (b) HW-oriented, and (c) co-design.In the latter, two separate working groups G1 (ML specialists) and G2 (HW specialists) collaborate together.

FIGURE 5 .
FIGURE 5. Our proposed taxonomy, which covers the learning panorama under the TinyML lens, includes three key domains: model optimization, model design, and learning algorithms.
LUIGI CAPOGROSSO (Student, IEEE) is a Ph.D. Student of the National Program in Artificial Intelligence at the University of Verona in collaboration with the Polytechnic of Turin in the IntelliGO labs under the supervision of Prof. Marco Cristani and Prof. Franco Fummi.He received his B.Sc. (2019) and M.Sc.(2021) at the University of Verona in Computer Science and Computer Engineering for Robotics and Smart Industry, respectively.His research interest can be grouped into the two areas of efficient deep learning and representation learning.

FEDERICO
CUNICO is a Ph.D. Candidate in Computer Science at the University of Verona under the supervision of Prof. Marco Cristani.He received his Master's degree in Computer Science and Engineering at the University of Verona in 2019.His main research interests include deep learning and computer vision for industrial scene analysis, with emphasis on human-centered tasks such as attention estimation, human pose and motion forecasting, and human pose estimation.

TABLE 1 .
Popularity and their respective papers over the years of model optimization techniques based on referenced research contributions.

TABLE 2 .
Popularity and their respective papers over the years of model design techniques based on referenced research contributions.

TABLE 3 .
Popularity and their respective papers over the years of learning algorithms based on referenced research contributions.

TABLE 4 .
Comparison between the different hardware for a TinyML-based system.