Optimized CNN Architectures Benchmarking in Hardware-Constrained Edge Devices in IoT Environments

Internet of Things (IoT) and edge devices have grown in their application fields due to machine learning (ML) models and their capacity to classify images into previously known labels, working close to the end-user. However, the model might be trained with several convolutional neural network (CNN) architectures that can affect its performance when developed in hardware-constrained environments, such as edge devices. In addition, new training trends suggest using transfer learning techniques to get an excellent feature extractor obtained from one domain and use it in a new domain, which has not enough images to train the whole model. In light of these trends, this work benchmarks the most representative CNN architectures on emerging edge devices, some of which have hardware accelerators. The ML models were trained and optimized using a small set of images obtained in IoT environments and using transfer learning. Our results show that unfreezing until the last 20 layers of the model’s architecture can be fine-tuned correctly to the new set of IoT images depending on the CNN architecture. Additionally, quantization is a suitable optimization technique to shrink <inline-formula> <tex-math notation="LaTeX">$2\times $ </tex-math></inline-formula> or <inline-formula> <tex-math notation="LaTeX">$3\times $ </tex-math></inline-formula> times the model leading to a lighter memory footprint, lower execution time, and battery consumption. Finally, the Coral Dev Board can boost <inline-formula> <tex-math notation="LaTeX">$100\times $ </tex-math></inline-formula> the inference process, and the EfficientNet model architecture keeps the same classification accuracy even when the model is adopted to a hardware-constrained environment.


I. INTRODUCTION
I N RECENT years, there has been a growth in the Internet of Things (IoT) and edge devices, which refers to computing tasks performed on or near the final user [1].The emergence of these technologies has led to an increasing demand for machine learning (ML) models that can be deployed on resource-constrained devices.These devices can run the model's inference on the device to save energy by keeping data without sending them constantly to the Cloud [2], which reduces the risk of sensitive data to be stolen and ensures data privacy.However, deploying traditional ML models on such devices is challenging due to limited computational resources, memory, hardware specifications, and power consumption constraints [3].
To save time in training ML models, transfer learning is a technique that enables the transfer of knowledge learned from one domain to another, allowing for the development of efficient and accurate models in a new domain [4], [5].Transfer learning is mainly applied in convolutional neural network (CNN) architectures, which were trained with large sets of images to get an excellent feature extractor.Once the CNN architecture is trained (called ML model), neurons' weights are frozen and shared in a repository for further downloads.Then, on top of this architecture, new layers could be added to make an accurate classifier trained with a few samples of the new domain; as a result, the model is fine-tuned to the new domain [6].Consequently, the transfer learning technique has shown great potential in edge devices, which can be used to develop models with higher accuracy and efficiency while minimizing the required training data and computational resources [7], [8].However, these models may need to be further optimized to fit into the memory requirements and match with the hardware architectures of modern edge devices since some of the edge devices have hardware accelerators to speed up the inference process [9], [10].In this scenario, optimization techniques, such as quantization, model pruning, training aware-quantization, and knowledge distillation, could shrink ML models regarding complexity and memory footprint while maintaining decent accuracy [11].
Optimization techniques are focused on reducing the number of bits used to represent the weights and activations of a neural network or removing the redundant neurons or layers from the neural network [12].However, ML models trained with the transfer learning technique have their own architecture and complexity, which affect their capacity to fit into different hardware architectures that might influence their performance, adaptability, and power consumption.Additionally, these ML models were trained with cleaned data sets, whereas in IoT applications, the incoming data to infer on the device might have drift or noise, affecting the model accuracy [7].To shed some light on the performance tradeoffs of running ML models at the end for image classification, this article aims to benchmark the most representative CNN architectures trained with the transfer learning technique for IoT environments.The ML models are fine-tuned with sensor data first and deployed onto edge devices with different hardware architectures and accelerators afterwards.Then, we quantify model accuracy hardware utilization to select the suitable ML models to deploy on the edge devices for inference.The main contributions of this article are as follows.
1) A novel methodology to benchmark edge devices and ML models using transfer learning and considering optimizations on hardware and software is presented.2) An end-to-end view for ML models to be deployed at the edge for inference is examined focusing on ML model training, fine-tuning, and optimizations for hardwareconstrained environments.
3) The benchmark results for ML models running on edge devices that are trained by using sensor data with different classes and conditions to get each model's benefits and limitations is demonstrated.Our results highlights the following.1) When ML models are fine-tuned with a small set of images from the new IoT domain using transfer learning, the classification accuracy increases.In addition, unfreezing the last 20 or less layers of their CNN architecture is a suitable configuration to get the highest classification scores with negligible impact in their training time toward being exported on edge devices.2) EfficientNet B0 can maintain its classification score even when the model is optimized using quantization and pruning methods.On the other hand, MobileNet V2 suffers when fine-tuned, and its classification performance is lower with multiclass data sets.However, it is the most lightweight model with the least execution time and memory footprint among all the models we benchmark.3) Resnet-50 and VGG-16 are the heaviest models in terms of hardware resource needs and do not exhibit significant improvements in their accuracy after fine-tuning.4) The integer quantization (IQ) method only shrinks the Inception V3 model 3× rather than 4× as in the case of the rest of the ML models.Quantization techniques not only shrink the model, they also optimize the CPU utilization and reduce the inference time.Nevertheless, both Post-training float16 quantization and Post-training dynamic range quantization (DRQ) can shrink the model 2× or 3× times; when the model makes inferences, the execution time is far from the expected results since it takes longer than using the nonoptimized model.5) Finally, the many prior benchmarking efforts in this domain solely focus on inference time after the model is loaded into memory from disk.While in some scenarios, this is a one-time cost, in some others, it isn't.Therefore, depending on the IoT application, the time to load the model into memory must be taken into account.The remainder of this article is structured as follows: Section II surveys the background and related works.The methodology is presented in Section III.The results are shown in Section IV, and their discussion is carried out in Section V. Finally, Section VI is devoted to the conclusions and future work.

A. Related Works
Transfer learning has gained significant attention enabling the efficient training of models on limited data.With the rise of edge computing and the increasing popularity of edge devices, there has been a growing interest in benchmarking ML models with different CNN architectures specifically designed for deployment on edge devices.
Ahmed et al. [13] showed how heterogeneous edge devices can be part of the ML pipeline to train partial models or make inferences on the device.In addition, Li et al. [14] explained why the new small hardware accelerators in edge devices are suitable solutions to speed up the inference process.He et al. [15] mentioned the most representative CNN architectures to transfer knowledge between different domains.Amannejad et al. [16] evaluated federated models with CNN architectures on edge devices.Demosthenous and Vassiliades [17] expanded the TensorFlow Lite framework to shrink ML models to make inferences on the edge device, mainly on cell phones.Zheng et al. [18] showed a data-driven task allocation in edge devices for multitasking purposes.Zheng et al. [18] highlighted that transfer learning imposes a heavy computational burden, and it is necessary to allocate tasks properly.Finally, Tanghatari et al. [19] used knowledge distillation to small models to reduce the ML models' complexity and export them on edge devices.
Regarding benchmarking ML models on edge devices, Baller et al. [20] presented a transfer learning benchmark with edge devices, such as Asus Tinker Edge R, Raspberry Pi 4, Google Coral Dev Board, Nvidia Jetson Nano, and one microcontroller: Arduino Nano 33 BLE on different frameworks.They compared the performance of various ML models, including MobileNet and Inception-V3.Additionally, a method to measure the devices' power consumption, inference time, and accuracy is provided.Such a method can be easily extended to other devices.On the other hand, Kang et al. [21] presented the state of the art in research for implementing high-performance computing (HPC) methods on edge devices equipped with parallel accelerators, focusing on software environments, including programming models and benchmark methods.
While benchmarking ML Models on edge devices, optimizing the model's size and complexity becomes crucial.To address this, Li et al. [22] proposed an optimization framework for ML Models targeting edge deployment.They introduced a pruning technique that selectively removes redundant parameters from the models, reducing their size without significant loss in accuracy.Li et al. [22] demonstrated the effectiveness of their approach by benchmarking the optimized models on a range of edge devices.
Another aspect of benchmarking ML Models on edge devices is the consideration of specialized hardware accelerators.Varghese et al. [8] investigated the performance of ML Models on edge devices equipped with GPUs and neural processing units (NPUs).They compared the inference speed and energy efficiency of models, such as VGG-16 and DenseNet across different hardware configurations, highlighting the impact of accelerators on edge computing.
In summary, benchmarking ML Models on edge devices is an active research area.Researchers have focused on evaluating various models, optimizing their size and complexity, and considering the impact of specialized hardware accelerators.However, training ML models and fine-tuning them with sensor data sets to show their performance impact should have been considered in previous works [17].In addition, the performance evaluation used images taken with highresolution cameras or in a controlled environment, which is far from the real conditions that edge devices might face [23].

B. Background on Image Classification With CNNs
CNNs are deep learning architectures that were popularized for image classification due to their robustness in detecting patterns by applying several filters and adding at the top of the architecture Dense layers to create a high-score classifier [20].Researchers have designed well-known CNN architectures trained with large data sets to get an excellent feature extractor over the recent years, while also reducing model complexity and size [19].The CNN architecture mainly contains seven blocks: 1) input layer; 2) rescaling; 3) normalization; 4) zero padding; 5) Conv2D; 6) batch normalization; and 7) activation [24].Then, the last three blocks could be used consecutively until building a robust CNN model [20].
We now briefly describe each CNN architecture used in this article as follows.

III. METHODOLOGY
For benchmarking the abovementioned CNN models on edge devices, we define a novel ML pipeline, starting with the data collection stage from different IoT applications until the model deployment in several edge devices.This is done by using specific frameworks to fit the model into the edge architecture and custom models to speed up the inference process.Fig. 1 shows each step of this ML pipeline.

A. Data Collection
Edge devices are widely used in IoT applications, such as smart farming, remote sensing, smart buildings/homes, wildlife detection, and industrial operations [2], [30], [31].We found representative data sets to be trained by ML models based on these application domains.The data sets we picked, Table I shows the specs, are as follows.
1) Leaf Illness Detection: This data is used to diagnose plant diseases, and it is taken from experimental research stations associated with Land Grant Universities in U.S. (Penn State, Florida, Cornell, and others).The database contains images of both infected and healthy leaves of 14 crops, such as tomatoes, apples, and others.In this article, only a set of tomato crop images were used [32], [33].2) Waste Classification: This targets classification in buildings' trash bins, which is challenging; smart detectors can either recognize the trash belonging to the bin or have one hole to receive the trash.For the latter, smart detectors have an electronic device to deposit the garbage in its corresponding bag.This data set contains images of organic and inorganic trash [34], [35].3) Birds Detection: Denmark has a very active bird spotting community that organizes bird spotting trips every month.One of the more active organizations is Dansk Ornitologisk Forening, Danish ornithological association (DOF), which has more than 17 500 members spread across 13 local branches nationwide.DOF regularly updates the Danish Birds list [36], which reports all the bird species that have been spotted in Denmark (over 466 species so far) with information about how common a species is and which family it belongs to.In this data Fig. 1.ML pipeline to benchmark ML Models for deployment on edge devices.

TABLE I IOT DATA SETS' SPECIFICATIONS
set, the most representative bird species are taken into consideration [37].4) Solar Panels Cracks Detection: Solar modules are composed of many solar cells.The solar cells are subject to degradation, causing many different types of defects.Defects may occur during the transportation, installation of modules, and operation.The size of cracks may range from tiny to large cracks covering the whole cell.This crack might create inactive regions that do not contribute to power production.This data set contains grayscale images with specialized cameras [38], [39].5) Satellite Imagery: The global data set for landcover classification is based on annual timeseries data from three different satellites: a) Sentinel-1; b) Sentinel-2; and c) Landsat-8.The LandCoverNet data set consists of 300 ortho-images of size 100×100 km 2 (referred to as tiles) that are spread across six different regions; a) Africa; b) Asia; c) Australia; d) Europe; e) North America; and f) South America.The 300 tiles are distributed among the regions based on their relative area.Each tile contains approximately 30 chips, each with a size of 256×256 pixels.The chips consist of a label mask that assigns all pixels to one of seven classes and over 100 time-series images of that area [40].

B. Data Preprocessing
Since data comes from different repositories, its size, format, and resolution might change.Therefore, the data processing step aims to standardize data to fit the ML models' requirements.In addition, data might be rotated to provide better information to the classifier or its resolution has to be reduced to mirror real-trial images.Unbalanced data sets could be fixed by adding synthetic data, i.e., data augmentation.

C. Software-Level Optimization
In this article, once data sets were defined, ML models were set by adding a Flatten Layer and Dense Layer to create the classifier on top of the CNN model.All models used the above-mentioned standard configuration to measure their capacity to classify images in new domains without adding a complex classifier, which results in the heaviest and most power-hungry model.Then, this model configuration can be fed with the original data set or with an improved version of it by using a data preprocessing step, which rotates, flips, and readjusts images [18].Finally, the model could be fine-tuned by unfreezing some of the last layers of the feature extractor to fit the neuron's weight to the new domain.As a result, we defined three different approaches to train ML Models with IoT data sets and quantified if these model configurations affect their performance or hardware requirements.The first approach was training the ML model with the original data set and the above-mentioned standard classifier.The second approach was using the original data set and unfreezing the last n layers of the feature extractor to fine-tune the model.The third approach used the data preprocessing step with the data augmentation technique to create synthetic data and fine-tune the model by unfreezing the last n layers [19].
Once these models had been trained, the next step was to optimize them using quantization and pruning techniques since the rest of the optimizing methods needed to retrain the whole model and assign new neurons' weights, which is out of this article's scope.The models were tested with classification metrics, such as accuracy, recall, precision, and error, to select the suitable model and its corresponding software-level optimization technique.

D. Hardware-Level Optimization
Even if the ML Models could be exported onto edge devices without any optimization, they are computationally expensive Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and power hungry.Therefore, the models were optimized to fit the hardware restrictions on processing, memory, power consumption, network usage, and model storage space [20].Optimization techniques could also speed up the inference time.Quantization techniques could reduce the activation and neurons' weights in different variables.Traditionally, those variables are stored in floating points.However, they could also be stored in 16 bits or 8 bits by reducing their precision [22].On the other hand, magnitude-based weight pruning gradually reduces model weights during training to achieve model sparsity.Pruning models could skip the zeroes during inference for latency improvements.These techniques could be used when the model is already trained (post-trained quantization) or while training the model (quantization-aware training) [12].
Our end-to-end pipeline from Fig. 1 start with already trained models.Therefore, we target post-trained quantization techniques in this article.More specifically, we optimized ML Models by using 1) float16 quantization, which converts weights to 16-bit floating point values; 2) DRQ, which converts weights to 8-bit precision and mixing of floatingpoint kernels; and 3) IQ, which is an optimization strategy that converts 32-bit floating-point numbers to the nearest 8bit fixed-point numbers [11].These techniques can also be customized using specialized frameworks targeting particular accelerators, such as GPUs and TPUs.
Finally, Table II details the edge devices benchmarked in this article.

IV. RESULTS
This section first evaluates, on the software side, the ML Models' classification accuracy to select the adequate model configuration to get a high-classification score.Then, on the hardware side, it shows the impact of the optimization techniques to fit ML models onto edge devices, some of which have hardware accelerators.Finally, it evaluates the performance of the models after they are deployed on the edge devices.

A. Software-Level Optimizations
The ML models were trained on a high-performance computer (HPC) cluster, which has several computing nodes with a CPU: 2× Intel Xeon Gold 6242 CPU @ 2.80-GHz Cores: 64, RAM: 384 GiB and hardware accelerator: GPU 6× Tesla V100 PCIe 32 GiB.Furthermore, ML models were trained with TensorFlow framework version 11.0.
The first model configuration was set by using the original samples and adding a simple classifier on top of the neural architecture.Then, each model was trained with several IoT data sets and by changing the number of neurons of the last Dense layer, according to the data sets' labels.Fig. 2 shows the accuracy of the trained models.
The second configuration used original samples, and the last n layers of the feature extractor were unfrozen to fine-tune the ML model.The neural architecture search (NAS) revealed that the last 20 layers improved the accuracy score without overfitting the model.Therefore, n is set with a value less than 20 in the second configuration according to the CNN   architecture and data set.Fig. 3 shows how every ML model reached a better accuracy score except MobileNetV2 compared to the configuration in Fig. 2.This model was confused about predicting proper labels when the label is not binary.
In the last configuration, we used a data augmentation technique to create synthetic samples and feed the ML models with different features to improve the classifier.This procedure took longer than expected, even when data sets were considered small-medium size.In addition, the accuracy score was dramatically decreased in Inception-V3, VGG-16, and Resnet-50, where data sets are clean.Conversely, in the bird detection data set, where images were taken in a wild scenario, the ML models performed better.Therefore, exploring the data set before applying the data augmentation step is highly recommended.Furthermore, the data augmentation stage was used to flip images and create balanced data sets, avoiding to use of rotation techniques.Fig. 4 shows the classification performance of this configuration.
Finally, we got some computational-efficiency metrics when ML models were training, such as the execution time and RAM consumption.Fig. 5 shows memory resources used when training each ML model for each application.Every

B. Hardware-Level Optimization
We decided to move forward with the second model configuration due to its high-classification score and adequate execution time when models were trained.Then, we used two approaches to optimize ML models as explained in Section III-D.First, ML models have been exported with TensorFlow lite framework without using any optimization techniques.Then, each model was optimized using post-training float16 quantization, DRQ, and IQ.However, float16 quantization did not significantly impact model size, accuracy, or execution time.Therefore, the following sections are focused on DRQ and IQ.Finally, ML models were pruned using the sparsity technique to skip the neurons' weights with zeros during inference.
Fig. 6 shows how the optimization techniques impact the accuracy when neurons' weights are stored in different variable types.Resnet-50 is the heaviest model in terms of size; it needs at least 99 MB to be stored.Additionally, this model slightly varies its size according to each data set; the model's weight has a 4 MB of weight variability, and its pruned version also weighs around 100 MB.On the other hand, with DRQ, it weighs around 24 MB, and its IQ version is 24 MB.Inception-V3 is the following heaviest model and, in its Tensorflow-lite (Tf-lite), DRQ, and IQ versions weighs around 88 MB, 22.4 MB, and 22.6 MB, respectively, which shows the IQ only shrinks 3× the model.
On the other hand, VGG-16 weighs 59 MB, 14.5, and 14.8, correspondingly to Tensforflow-lite, DRQ, and IQ.This model does not vary when trained with different data sets.EfficientNet B0 is the second lightweight model, which needs 16 MB, 4.5 MB, and 5 MB, respectively.Finally, MobilenetV2 needs 8 MB, 2.5 MB, and 2.7 MB to be stored using different optimization techniques.Overall, DRQ models' version decreases their accuracy slightly in most cases.The IQ optimization method has a relevant impact on the model's accuracy, getting an average of 10 points less.
Pruned versions weigh similarly to the Tf-lite versions and do not have an impact on the models' accuracy.It is important to mention that even when these tests were obtained using model configuration number two, it did not impact the model size, only their accuracy (see Fig. 6).

C. Models at the Edge
The models were exported to each edge device to measure their execution time, CPU utilization, power, and RAM consumption [41].
Without Hardware-Level Optimizations: First, Tf-lite models were tested without any additional optimization, such as pruning or quantization.However, Coral Dev did not support this model type since it only allows IQ models with specific frameworks.Raspberry devices can support a lightweight Tflite library to make inferences.The Nvidia Jetson devices need an NVIDIA JetPack SDK, which provides a full development environment for hardware-accelerated AI-at-the-edge development.Therefore, those devices can run the whole TensorFlow or Pytorch framework.In this test, we did not use custom instructions to speed up the inference process, and we only used the standard framework.Fig. 7 shows how the Tf-lite version of the models perform in terms of inference time.It can be seen that the VGG-16 models took longer than the rest of the models, where the worst scenario is running this model on the Nvidia Jetson Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The RAM consumption was very similar in all devices, where each model needs, on average, 100 MB of RAM, which is affordable for each device.However, the models' power consumption varies widely depending on the device on which the model is deployed as shown in Fig. 8. Jetson devices have internal power consumption setups to establish a specific threshold where the device cannot work over.Therefore, we used the standard configuration where the Jetson TX2 cannot consume more than 5 Watts, and the Jetson Nano consumes only 3 Watts.This power consumption configuration on the Jetson Nano powers down two cores, which is why Tf-lite models perform poorly in this device.Raspberry Pi devices do not have such power configurations, and they can use all cores if the model needs them.As a result, their power consumption is high, whereas the Raspberry Pi Zero with the VGG-16 model can easily consume 6 Watts, even when it is the smallest device tested.
With Pruning: Regarding both execution time and power consumption, the pruning method had similar results to Tf-lite models.Even when this technique proposes to remove/skip the zeroes during inference for latency improvements, the CPU also makes these computations in real-world conditions.Therefore, it does not affect the execution time or power consumption.
With DRQ: DRQ takes 3× in Mobilenet V2 and Efficientnet B0 and 5× to 8× in the rest of the ML models to make inferences compared to nonoptimized models.This execution time affects CPU utilization, power, and RAM consumption.Therefore, this optimization technique is not suitable for edge devices.
With IQ: IQ is strongly recommended in small devices and microcontrollers.Furthermore, the Coral Dev device allows only this optimization technique to speed up the inference process.In Nvidia Jetson with the Jetpack installed, as mentioned earlier, the model could be optimized to boost the inference process by using TensorRT.Therefore, we tested the IQ optimization method in all devices with the standard configuration and customized instructions over the IQ model with devices with hardware accelerators.Fig. 9 shows the execution time results using the standard IQ configuration, which reduces 2× the inference time with EfficientNet B0, Inception-V3, and Resnet-50.However, MobilenetV2 and VGG-16 took a similar time to Tf-lite versions.
On the other hand Fig. 10 shows how well the inference process is boosted by using custom instructions to allocate the model on the hardware accelerator.The inference took a few milliseconds in all models.Mobilenet V2 is the fastest model on the Coral Dev device, which only needs 2.9 ms to   Finally, Fig. 11 shows the power consumption with integer quantized models.All devices decreased their power consumption considerably, at least 2×.Raspberry Zero still struggled with the VGG-16 model, but its power consumption was optimized with the rest of the models.Each model performed very well in the Raspberry Pi 4. Coral Dev had the lowest power consumption since it does not run the whole Linux Operative System, just a light version.Nvidia Jetson devices have kept similar power consumption of 4 watts, owing to their initial configuration.

V. DISCUSSION
Given the results mentioned above, we dived into exploring how CPU utilization affects power consumption when the model used hardware-level optimization techniques.In this scenario, we took the Raspberry Pi as a target device to explore how Tf-lite and quantized models infer.Fig. 12 shows how the CPU cores were utilized when ML models were optimized with the Tf technique to make inferences.Each model utilized all cores (Raspberry Pi has four), whereas EfficientNet B0, Mobilenet V2, and Inception V3 used all cores below 80%.Resnet-50 used just one core at over 80%, and VGG-16 utilized all cores at 100%.Therefore, VGG-16 is clearly an architecture designed to be compiled in big servers or different hardware architectures.On the other hand, Fig. 13 shows how the IQ technique applied on each ML model needs one core at over 90%, two cores below 20%, and they do not need the last one.Therefore, since IQ models use 8 bits, they can be managed with one core, and the OS runs in others.Furthermore, it demonstrates that the power consumption slightly varies between models, and its execution time is a key metric to consider in battery-powered devices.Finally, this analysis also opens the door to further research to manage cores efficiently in extreme cases when the battery cannot be charged often.

VI. CONCLUSION AND FUTURE WORKS
This article aimed at benchmarking ML models with pretrained weights and fine-tuning them afterward, to be tested on edge devices; some have hardware accelerators to speed up the inference process for being deployed in IoT applications.The conclusions of this article are as follows.
1) We probed that in IoT environments, fine-tuning the model is the suitable configuration since this Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
technique increases the model performance significantly.Conversely, even when synthetic data is helpful, the CPU utilization is bigger than fine-tuning the model without a high impact on the model performance.We concluded that this happens because the public data sets, even when they mention that are focused on the IoT environment, are clean The images were obtained in controlled environments, often far from the real scenarios that edge devices face.
Optimized CNN Architectures Benchmarking in Hardware-Constrained Edge Devices in IoT Environments Paul D. Rosero-Montalvo , Pınar Tözün , and Wilmar Hernandez , Senior Member, IEEE Abstract-Internet of Things (IoT) and edge devices have grown in their application fields due to machine learning (ML) models and their capacity to classify images into previously known labels, working close to the end-user.However, the model might be trained with several convolutional neural network (CNN) architectures that can affect its performance when developed in hardware-constrained environments, such as edge devices.In addition, new training trends suggest using transfer learning techniques to get an excellent feature extractor obtained from one domain and use it in a new domain, which has not enough images to train the whole model.In light of these trends, this work benchmarks the most representative CNN architectures on emerging edge devices, some of which have hardware accelerators.The ML models were trained and optimized using a small set of images obtained in IoT environments and using transfer learning.Our results show that unfreezing until the last 20 layers of the model's architecture can be fine-tuned correctly to the new set of IoT images depending on the CNN architecture.Additionally, quantization is a suitable optimization technique to shrink 2× or 3× times the model leading to a lighter memory footprint, lower execution time, and battery consumption.Finally, the Coral Dev Board can boost 100× the inference process, and the EfficientNet model architecture keeps the same classification accuracy even when the model is adopted to a hardware-constrained environment.Index Terms-Convolutional neural network (CNN) architectures, edge computing, edge devices, model optimization, transfer learning.

Fig. 2 .
Fig. 2. Model configuration 1: Original samples and simple classifier on top of the neural architecture.

Fig. 3 .
Fig. 3. Model configuration 2: Original samples and simple classifier on top of the neural architecture and fine-tuned with the last n layers.

Fig. 4 .
Fig. 4. Model configuration 3: Data augmented and simple classifier on top of the neural architecture and fine-tuned the last n layers.

Fig. 7 .
Fig. 7. Inference time with each model on edge devices when models are in Tf-lite version.

Fig. 8 .
Fig. 8. Power consumption during inference with each model on edge devices when models are in Tf-lite version.

Fig. 9 .
Fig. 9. Inference time with each model on edge devices with IQ optimization.

Fig. 10 .
Fig. 10.Inference time with each model on edge devices with IQ optimization and hardware accelerators.

Fig. 11 .
Fig. 11.Power consumption with each model on edge devices with IQ optimization.

Fig. 12 .
Fig. 12. Raspberry Pi-utilization of each core on the CPU while running the Tf-lite model versions of the models.

Fig. 13 .
Fig. 13.Raspberry Pi-utilization of each core on the CPU while running the IQ optimized versions of the models.

2 )
None of models studied in this work change their size significantly when they are trained with different IoT data sets or fine-tuned.3) Quantization techniques not only shrink the model, they also optimize the CPU utilization and reduce the execution time.However, both Post-training float16 quantization and Post-training DRQ can shrink the model 2× or 3×; when the model makes inferences, the execution time is far from the expected results since it takes longer than using the nonoptimized model.4) On the other hand, the IQ method can reduce 4× the memory footprint and boost 2× the inferences process.Conversely, using customized instructions to run the model on the hardware accelerator can boost the performance 100× in the Coral Dev and Nvidia Jetson devices.In future work, we will consider working with ML models with different optimization techniques to improve the results mentioned above on IoT applications.