Efficient People Counting in Thermal Images: The Benchmark of Resource-Constrained Hardware

The monitoring of presence is a timely topic in intelligent building management systems. Nowadays, most rooms, halls, and auditoriums use a simple binary presence detector that is used to control the operation of HVAC systems. This strategy is not optimal and leads to significant amounts of energy being wasted due to inadequate control of the system. Therefore, knowing the exact person count facilitates better adjustment to current needs and cost reduction. The vision-based people-counting is a well-known area of computer vision research. In addition, with rapid development in the artificial intelligence and IoT sectors, power-limited and resource-constrained devices like single-board computers or microcontrollers are able to run even such sophisticated algorithms as neural networks. This capability not only ensures the tiny size and power effectiveness of the device but also, by definition, preserves privacy by limiting or completely eliminating the transfer of data to the cloud. In this paper, we describe the method for efficient occupancy estimation based on low-resolution thermal images. This approach uses a U-Net-like convolutional neural network that is capable of estimating the number of people in the sensor’s field of view. Although the architecture was optimized and quantized to fit the limited microcontroller’s memory, the metrics obtained by the algorithm outperform the other state-of-the-art solutions. Additionally, the algorithm was deployed on a range of embedded devices to perform a set of benchmarks. The tests carried out on embedded processors allowed the comparison of a wide range of chips and proved that people counting can be efficiently executed on resource-limited hardware while maintaining low power consumption.


I. INTRODUCTION
In recent years, solutions that use artificial intelligence (AI) algorithms have been developing rapidly. Both machine learning and deep learning applications are becoming a part of daily life, taking advantage of access to a vast amount of data [1] and becoming the foundation of smart cities [2], [3], [4]. Furthermore, with the era of Industry 4.0, the Internet of Things (IoT) sector is also quickly expanding [5], [6], with 14.6 billion connected devices in 2021 [7]. Low-cost, tiny microcontrollers (MCUs) or single-board computers (SBCs) are more and more popular both in home and industrial applications [8], [9] and one of the crucial reasons is the The associate editor coordinating the review of this manuscript and approving it for publication was Rodrigo S. Couto . increase in their computing power [10]. Related to the above themes is the paradigm of edge computing, whose principal concept is to store data and perform computations close to the source of data [11]. Following [12], this idea is an extension of cloud computing intended for small-scale, real-time intelligent analysis, which has to meet strict time criteria of local services. This approach is facilitated by processing data onboard without uploading them to the cloud, thereby considerably reducing latency and improving the bandwidth efficiency of the network. Therefore, applications that use edge computing have plenty of potential benefits that Sanchez-Iborra et al. [13] recognize as energy efficiency, low cost and latency, system reliability, and data security. Additionally, contemporary MCUs enable the onboard execution of more and more sophisticated algorithms, even ones such VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as neural networks. Such use cases are associated with the TinyML term, which refers to the machine learning application on inexpensive and resource-constrained devices [14], [15]. Shafique et al. [16] enumerate the TinyML spectrum of applications as healthcare, surveillance and security, smart things (always-on wake modules), industrial monitoring and control, finance and administration. The listed leading use cases are: keyword spotting, image classification, visual wake words, object detection, anomaly detection, semantic segmentation, motor control, gesture recognition, forecasting, face recognition, and activity detection [17]. Amato et al. [18] presented an efficient method to monitor car parking occupancy based on visual detection with a convolutional neural network (CNN). The application was deployed on a Raspberry Pi platform with a camera that enables the supervision of up to fifty parking spaces. T'Jonck et al. [19] proposed an accelerometer-based non-invasive way of monitoring seniors' specific movements. Their strategy profits from the CNNs architecture attaining superior classification results. Furthermore, the neural network was deployed and successfully validated on the nRF52 development kit. However, MCUs have several constraints that require consideration when creating a solution based on them. Dutta et al. [20] specify challenges of TinyML applications such as inconsistent power usage, memory limitations, processor power, or cost reduction in multisensor applications. Besides these aspects, Banbury et al. [21] also indicate hardware and software heterogeneity as vital factors, which preclude straightforward and direct benchmarking of MCUs, due to different architectures (diverse performance, power, and capabilities) and model deployment (hand-coding, code generation, and ML interpreters).
In this paper, we proposed and tested the occupancy counting method based on low-resolution thermal images. The algorithm utilizes the lightweight encoder-decoder neural network architecture [22] with standard layers and operations. The approach used for people counting shares similarities with other state-of-the-art solutions. Raykov et al. [23] present the first noteworthy publication. In their work, motion patterns were extracted from raw sensor data utilizing an infinite hidden Markov model (iHMM). Those patterns were then used to infer the number of occupants with statistical regression methods. In [24] Leech et al. proposed a Bayesian machine learning algorithm to estimate room occupancy using a single analog passive infrared (PIR) sensor. The method was implemented and deployed on a microcontroller unit, and with a typical size battery the aforementioned system could run continuously for more than a month. Abedi and Jazizadeh [25] introduced a deep learning method that returns binary information about the presence in the measurement area without specifying the number of people. Another approach related to occupancy counting is described by Metwaly et al. [26]. In their application, the task is defined as a classification with the classes corresponding to consecutive natural numbers. This solution has a significant drawback in the form of an upper range limit, which disallows reading the higher value from the frame than a number of classes provided during a training phase. Kraft et al. [27] introduced a method based on density estimation. This concept is borrowed from the crowd counting task [28], which is typically implemented using an RGB sensor.
The presented work extends research conducted in [27] with examinations related to algorithm optimization and quantization for edge and embedded applications using various methods, frameworks, and tools. In addition, validation, performance tests, and comparison of low-cost hardware with resource limitations are significant parts of this study as in [26] and [29]. The principal contributions of this work are as follows: • the investigation of the influence of the shallow and plain encoder-decoder network on training and test processes, • the comparison to state-of-the-art algorithms that show the improvement of metrics and the impact of a proposed solution, • the implementation of the model on a range of lowcost, resource-constrained hardware devices enabling sufficiently fast analysis of thermal images, • the comparison of performance measurement and power consumption among tested hardware.
The paper is organized as follows: Section 2 describes the Thermo Presence dataset and the method for the density map creation based on corresponding thermal images. Additionally, the chapter defines the neural network architecture with training process specifications and includes benchmark hardware characteristics. Section 3 describes the evaluation methodology and details of hardware efficiency measurements. Section 4 presents the results and a comparison to other state-of-the-art methods in terms of accuracy, processing speed and efficiency. Finally, Section 5 summarizes the work and characterizes the pathways for further development.

A. DATASET
This research uses the Thermo Presence dataset published as part of a [27] research paper. Thermal images were collected using a single-board computer -Raspberry Pi 4. The measurement system setup contained an RGB camera and a thermal sensor. The camera had an IMX219 CMOS sensor and was used to control the experiment and annotate the data. Meanwhile, the core part was a MLX90640 far infrared thermal sensor with a 32 × 24 pixel thermal IR array. The dataset's images were gathered with a 2 Hz frequency rate. According to the datasheet [30], the measured temperature ranges from -40 • C to 300 • C with ±1 • C resolution for each pixel. The sensor was communicating with I 2 C interface, which is an additional advantage because it is widely available in microcontrollers.
The Thermo Presence dataset includes thermal images registered in the office space with corresponding annotations of persons' locations. The collection, in total, consists of 13,634 labeled examples from several distinct localizations.  The number of people in each frame was in the range of 0 to 5. The recorded images were grouped into sequences and then split according to the dataset distribution file without stratifying the number of people. Detailed data distribution with a division into training, validation and test sets is presented in Table 1.
Every frame in the dataset incorporates information about a person's center X and Y coordinates. Like the authors of [27], the ground truth mask is constructed by setting a single pixel, corresponding to an individual location, to a maximum value. Subsequently, the occupancy density map was created using the convolution of the above-mentioned image with a 2D Gaussian mask (σ = 3). Due to the relatively constant perspective across the data collection, the standard deviation was selected empirically, based on observations, to fit the blob representing a person in thermal image as thoroughly as possible. The procedure creates a Gaussian mixture distribution with maximums in the central locations of people and the mask radius equal to 3σ that covers all significant values of the mask. The visualization of a single Gaussian mixture distribution with annotated pixel values is presented in Figure 1.
The output density maps have the exact dimensions as the thermal input images. The aggregated value of the single Gaussian output varies between 45 and 55. It depends on whether the person is in the center or on the edge of the sensor field of view. The average value, adopted as a factor in further computations, is calculated using the training data according to the formula 1.
where the numerator describes the sum of the values of the created ground truth masks, and the denominator is related to the sum of corresponding reference people counts. In the above equation, the y hij denotes the i-th row and the j-th column of the h-th ground truth mask from a training set, whereas l h is a ground truth people count for an h-th image. The n defines the number of samples in the training set. For the dataset distribution adopted in the study, this value is equal to 51.35. The above-defined coefficient is used in the evaluation phase. The overall person count in the output mask is calculated by adding all elements of the predicted density map and dividing by the aforementioned α value.
The example visualization of the density map creation steps for the ground truth example is depicted in Figure 2.

B. NEURAL NETWORK ARCHITECTURE
The density map reconstruction, with the locations of people, requires the neural network with the encoder-decoder architecture. Moreover, the principle study idea is to deploy an end-to-end people count estimator on edge AI and TinyML devices. This approach requires shifting the main focus from model metrics to hardware-related parameters such as size, number of parameters, and inference time. Therefore, the U-Net [22], a convolutional neural network, was taken into consideration as a suitable component capable of being customized. Initial experiments have shown that even smaller models achieve satisfactory outcomes while allowing the considerable reduction of model complexity and the number of trainable parameters. The selected architecture is detailed in Figure 3. Since the input is a low-resolution, monochromatic image, the encoder part is much shallower than the original one, with a single-channel input. The neural network structure consists of convolutional, maximum pooling, upsampling, and concatenation layers, all of which can be translated to a wide range of embedded devices. The hyperparameter tuning was applied using a grid search algorithm to optimize model results. Finally, all convolution layers have a kernel size of 3×3, a stride of 1×1, and 1 pixel of padding to keep the output feature map size the same as the input. The initial number of filters N in the first block of convolutions is equal to 16, and the remaining filters follow the rules depicted in Figure 3. The convolution layers are followed by the Rectified Linear Unit (ReLU) activation functions.
The downsampled data representation is obtained by using a maximum pooling operation. The function uses a 2×2 pooling window and a 2 × 2 stride that changes the size of the output feature map by half on both axes. The resulting values are the maximum numbers selected from the appropriate pooling windows.
In the training phase, Adam [31] optimizer was utilized, a computationally efficient, straightforward, and relatively easy-to-configure algorithm with an initial 1e-3 learning rate. The mean absolute error (MAE) was chosen as a loss function. The aforementioned modules were implemented using the TensorFlow library. For training purposes, the NVIDIA TITAN Xp with 12 GB memory was utilized and configured with CUDA Toolkit 11.1 The batch size was set to 128 due to the small size of the input data and low model complexity. The maximum number of epochs was equal to 100. However, the early stopping callback, with patience equal to 10, was used to monitor loss function changes on validation dataset and interrupt training if no improvement was observed. Finally, the model was trained for 50 epochs, which took about 6 minutes for the above setup.

C. EVALUATION HARDWARE
With regard to potential applications in heating, ventilation, and air conditioning (HVAC) systems, inference hardware must have tiny dimensions, low weight, and an energyefficient profile. Therefore, the following devices were chosen: • Raspberry Pi 4B with a quad-core Cortex A72 processor is one of the most popular single-board computers in use. In the utilized configuration device has 2 GB of RAM and runs models in TensorFlow Lite format.
• Coral USB Accelerator as an Edge TPU coprocessor enables high-speed machine learning inference. It provides additional computing power to external systems and cooperates, for example, with the Raspberry Pi. The accelerator has an ARM Cortex M0 + and a custom application-specific integrated circuit (ASIC) -Edge TPU. The coprocessor can perform 4 trillion operations per second (TOPS  The neural network was also implemented using the TFLite Micro framework with additional support of the ESP-NN library, which contains optimized implementations of kernel functions. The summarized information about evaluation hardware is included in Table 2. Nevertheless, to run sophisticated and computationally intensive algorithms like neural networks on resource-constrained hardware, the model has to be previously optimized. For this case, conversion and deployment tools are utilized. The most commonly used include: • TensorFlow Lite (TFLite) [32] provides a set of tools to infer machine learning on-device. TFLite achieves, by hardware acceleration and model optimization, high performance on mobile, embedded, and IoT devices. Efficiency on low-cost platforms is obtained in five key aspects: latency of the model, data privacy, no connectivity requirement, reduced size, and lowered power consumption. The TensorFlow models can be easily converted to the TensorFlow Lite FlatBuffers format. This form of the neural network is efficient, portable, and supported by several programming languages.
• OpenVINO [33] is an open source toolkit maintained by Intel. It contains tools to optimize and deploy AI algorithms on a wide range of Intel devices, from the cloud to the edge. The toolkit supports hardware such as vision processing units (VPUs), central processing units (CPUs), graphics processing units (GPUs), and field-programmable gate arrays (FPGA). The goal of VOLUME 10, 2022 the package is to boost algorithm performance in applications related to either computer vision or natural language processing and automatic speech recognition.
Officially supports conversion to Intermediate Representation (IR) format from several popular deep learning frameworks such as, e.g., TensorFlow, PyTorch, or ONNX. The toolkit's principal advantage is the possibility to infer the same IR model among all supported platforms without significant code changes.
• TenosrFlow Lite Micro (TFLite Micro) [34] is part of the TensorFlow project created to bring deep learning algorithms to microcontrollers. Designers assume that neural networks should run on devices with only a few kilobytes of memory and require neither operating system support nor dynamic memory allocation. The package supports models as a C-byte array that can be converted from TensorFlow Lite using standard system tools like the xxd package in the Linux operating system.
• X-CUBE-AI [35] as an expansion pack is part of the STM32Cube AI ecosystem, which extends the STM32 CubeMX tool. The main feature of the toolkit is automatic conversion of pre-trained deep learning and machine learning models and integrating them with a project as an optimized library. The tool enables several ways of validating algorithms from the evaluation on desktop PC using the emulated environment to performance measurement on STM32 devices with pre-configurated code templates. X-CUBE-AI natively supports deep learning frameworks such as TensorFlow Keras, TensorFlow Lite, and all formats previously converted to ONNX. The package supports 8-bit model quantization and enables easy portability across various series of STM32 microcontrollers.

III. EVALUATION A. NEURAL NETWORK EVALUATION METRICS
The crucial part of the comparison with other state-of-theart methods was selecting the set of appropriate evaluation metrics. For this purpose, the mean absolute error (MAE, equation 3) and the mean square error (MSE, equation 4) were chosen.
where n defines the number of samples in the test set, 24 indicate arrays height and 32 their width. Meanwhile, y hij ,ŷ hij denote the value related to the i-th row and the j-th column of the h-th ground truth mask and the predicted occupancy map, respectively. The approximate number of people in a frame can be calculated, according to Section II-A, as the sum of elements in the output array divided by the constant α, which represents the aggregated value of a single Gaussian output. The mathematical representation is given in equation 5.
whereŷ ij corresponds to the element in the i-th row and the j-th column of the estimated density map. Furthermore, with the use of information on the predicted count, three additional metrics were adopted that are used to deal with the person count directly: where n specifies the number of samples in the test set, while count gt h and count pred h are, respectively, the ground truth and the estimated count (according to equation 5) for the h-th image. Furthermore, in relation to publications with state-ofthe-art solutions, standard classification metrics, such as Accuracy and F1 Score, were also calculated during the evaluation phase and are presented in comparison.

B. HARDWARE PERFORMANCE EVALUATION
IoT solutions require high energy efficiency due to limited power supply options and often difficult-to-reach work location. With this in mind, the devices' current and power consumption were measured. For this purpose, a Mooshimeter multimeter with a 9 Hz measurements log rate was utilized.
Another vital aspect is the size of the model. Both microcontrollers and computing accelerators have limited memory, which precludes the use of complex neural network architectures. Therefore, memory usage optimization is a crucial aspect that can be improved during conversion principally by weights quantization. The detailed summary, including model size, depending on the used framework and data type, is given in Table 4.
In IoT applications, devices spend most of the time in a sleep mode waiting for a cyclic interrupt to trigger their action. Moreover, in most cases, there is no need for continuous prediction, and periodic operation is sufficient. In our particular case, person count does not change abruptly, and the control processes using the count as an input variable have large inertia. Thus, instead of a continual approach, two work scenarios were chosen to analyze the efficiency of TABLE 3. The metrics obtained compared to other state-of-the-art solutions. Mean absolute error, mean square error, and their rounded version (in the scope of people count) for direct comparison with [26] and [27]. benchmarked hardware. The total energy consumption was calculated for both 30-second and 5-minute measurement cycles. For this purpose, equation 9 was introduced.
where T infer refers to the inference time of the device, P infer specifies the average power consumption during the prediction phase, T cycle defines the cycle time and P sleep is related to the power consumption in sleep mode in regards to the device. The Raspberry Pi, in general, does not have any selectable power modes thus energy consumption was optimized regarding to instructions described in [36]. However, to have a good comparison between chosen work cycles, the daily energy consumption for both options was calculated according to equation 10.
where d cycles means number of cycles per day and W cycle defines energy consumption during the one operation cycle.

IV. RESULTS
The introduced neural network achieves good results, even though the architecture is straightforward and adapted to limited memory hardware. To compare with other state-ofthe-art methods proposed in [26] and [27], respectively, fully connected and encoder-decoder structures were implemented and evaluated, using the same training, validation and test splits, on the Thermo Presence dataset. However, the architecture presented in [26] as the top performer -one dense layer with 512 units -achieves very poor results and turns out to be insufficient for comparison on a more diversified dataset such as the one used in the examination. Therefore, the more complex architecture, a fully connected model with two hidden layers of 512 nodes each, was selected. Whereas, the U-Net structure described in [27] was implemented according to the diagram presented in the publication. The complete comparison, including mean absolute error (MAE), mean square error (MSE), counting mean absolute error (Counting MAE), counting mean square error (Counting MSE), counting mean relative absolute percentage error VOLUME 10, 2022    (Counting MRAPE), classification metrics and the number of neural network parameters, is shown in Table 3. The introduced method outperformed other approaches in terms of all metrics, maintaining a notable decrease in size, number of parameters, and model complexity. Moreover, in regard to the person count metrics, the developed encoder-decoder model achieves two times better outcomes than the network proposed in [27] and over ten times exceeds the solution based on fully connected layers as measured by the proposed prediction errors.
To demonstrate the presented architecture's accuracy of the count predictions, the confusion matrix with percentage results of the estimated occupancy counts is depicted in Figure 4. The vertical axis describes actual values, whereas horizontal dimensions refer to an estimated number of people. The diagonal contains values correctly classified, while the remaining fields refer to misclassifications with an indication between which classes it happened. Because of the classes' imbalance in the dataset, numerical values were normalized row-wise relative to the number of ground-truth class samples and presented as a percentage score of classification. The aforementioned chart confirms the model's ability to classify the number of people for each ground-truth class corresponding to the exact person count from the dataset.
The trained model was optimized and quantized using the tools and frameworks described in Section II-C. In most cases, the INT8 data type was the target format of model weights, inputs, and outputs. However, the Intel NCS 2 with Movidius Myriad X doesn't support integer operations, thus, floating-point types: FP32 and FP16 were utilized. Moreover, for several devices, both integer and floating-point representations were evaluated to highlight the contrast between them in the terms of inference time and metrics value. In the next step, optimized implementations were deployed on hardware and evaluated on the dedicated test set. The above steps were repeated for each device under the same environmental conditions, including supply voltage. Only for time measurements, built-in platform functions were used.
The results of the performance and metrics evaluation for each device are presented in Table 5. Meanwhile, Table 6 shows the power consumption measured for three states: • inference -a device was receiving input data, predicting and sending output density map through the UART or serial port, • idle -hardware was active but did not perform any user-defined tasks or instructions, • efficient mode -microcontrollers were put into sleep mode, whereas Raspberry Pi had optimized power consumption. During the hardware evaluation phase, specifically the inference step, the current consumption measurements were gathered for all microcontrollers included in the benchmarks at the same supply voltage. The characteristics are shown in Figure 5. The measurement time on the x-axis is limited to 180 seconds to highlight peaks related to the prediction stage. Moreover, it is worth mentioning that the scales and ranges of the y-axis are individual for each subgraph.
According to the last part of Section III-B the efficiency indicators of the operation cycle were calculated. Figure 6 presents bar plots representing the energy required to perform a single work cycle. The left subdiagram refers to a 30-second operation period, while the other depicts energy consumption for 5-minute single series. However, a better view of the above approaches, in terms of daily energy consumption, is presented in Figure 7. The subframes expose the required energy for one full day of operations as follow: (a) 30-second and (b) 5-minute cycles. As can be seen by comparing the graphs, the energy consumption is slightly lower for the second duration for almost all devices.

V. CONCLUSION
Accurate occupancy estimation is one of the main aspects of HVAC systems' efficient management in Smart Buildings. The proper configuration of heating, ventilation and fans ensures not only better conditions in either offices or rooms but also allows reducing the cost of installation usage. Moreover, with the utilization of embedded devices and edge processing, the people count can be estimated onboard, reducing the use of the cloud and sending only the final number of persons. In addition, this approach protects people's privacy in two ways: firstly, by estimating the count using low-resolution thermal images, which characteristic prevents personal data gathering. Secondly, the collected frames are processed onboard and removed from the device memory after inference. Thus, any image is recorded or streamed for further analysis.
This article presents an optimized version of the U-Net convolutional neural network for the density estimation of people's presence. The main objective of this work was to benchmark algorithm implementation on a set of resource-limited devices. The investigated tests and benchmarks have shown that occupancy counting could be processed onboard on resource-constrained hardware with a limited power supply. The research carried out has proven that even microcontrollers based on low-energy chips, such as nRF52840 or ESP32-WROOM-32, are able to run a relatively fast density estimation algorithm whereby current consumption does not exceed 30 mA and 150 mW. On the other hand, the MCUs having an Arm Cortex-M7 core (Arduino Portenta H7 and STM32 H745ZI Nucleo-144 boards) can achieve close to 20 independent inferences per second, maintaining a current consumption of less than 270 mA and 1350 mW at the peak. Using the aforementioned 30-seconds or 5-minutes work cycles, it is possible to substantially reduce energy consumption by using the devices' efficient mode, often known as sleep or deep sleep mode. It turns out from the conducted tests that the LOLIN32 microcontroller equipped with ESP32-WROOM-32 chip performs best, obtaining respectively 0.92 and 8.62 J per cycle and 2.64 or 2.48 kJ daily. Moreover, it is nearly three times less than second in the list microcontroller Arduino Nano 33 BLE Sense with nRF52840 unit and approximately 22 times less than the fastest and the most powerful device from benchmarked MCUs -an STM32 H745ZI on Nucleo-144 board. From another point of view, the tested Raspberry Pi with computation accelerators may be a suitable option for more complex measurement systems where the power supply is not as important as the execution time. The measurements show, that inexpensive, consumergrade microcontrollers are capable of executing deep learning workloads with optimized neural network models. Since VOLUME 10, 2022 inference time is not critical in a wide range of IoT or edge applications, such devices are a viable option to consider, which was proven by evaluating a range of microcontrollers and microprocessors using real-life data sourced from a nontrivial application. Moreover, the microcontrollers offer good cost-effectiveness both in terms of hardware price, as well as operating costs, outperforming faster, but more complex computational architectures in the evaluation.
Although the proposed approach offers accurate results and satisfactory performance, it does not circumvent all threats to the correct operation validity. Significant changes in camera placement may affect measurements, resulting in worsening the results of the algorithm at the same time. What is more, the low resolution of the thermal camera, with a relatively large number of people in the sensor field of view, may cause a saturation problem that may strongly affect the system's capabilities. Another threat is heat sources close to the typical human temperature. Despite the fact that the authors of the dataset used additional heat sources to increase the diversity of the data as well as the robustness of the algorithm, this topic is not fully covered, and there is a possibility of false-positive results occurrence. Hence, the further development aims will include broadening the research of person detection and counting methods, e.g. using RGB sensors or testing algorithms on images containing a higher number of people. Another task concerns system installation as an HVAC perception unit and the study of its impact on the behavior and performance of the system. The aforementioned issues will be addressed in further research.

ACKNOWLEDGMENT
The code repository related to this publication is open source and available at https://github.com/PUTvision/thermohardware-benchmark.