Performance-Aware NILM Model Optimization for Edge Deployment

Non-Intrusive Load Monitoring (NILM) describes the extraction of the individual consumption pattern of a domestic appliance from the aggregated household consumption. Nowadays, the NILM research focus is shifted towards practical NILM applications, such as edge deployment, to accelerate the transition towards a greener energy future. NILM applications at the edge eliminate privacy concerns and data transmission-related problems. However, edge resource restrictions pose additional challenges to NILM. NILM approaches are usually not designed to run on edge devices with limited computational capacity, and therefore model optimization is required for better resource management. Recent works have started investigating NILM model optimization, but they utilize compression approaches arbitrarily without considering the trade-off between model performance and computational cost. In this work, we present a NILM model optimization framework for edge deployment. The proposed edge optimization engine optimizes a NILM model for edge deployment depending on the edge device’s limitations and includes a novel performance-aware algorithm to reduce the model’s computational complexity. We validate our methodology on three edge application scenarios for four domestic appliances and four model architectures. Experimental results demonstrate that the proposed optimization approach can lead up to a 36.3% average reduction of model computational complexity and a 75% reduction of storage requirements.

Abstract-Non-Intrusive Load Monitoring (NILM) describes the extraction of the individual consumption pattern of a domestic appliance from the aggregated household consumption. Nowadays, the NILM research focus is shifted towards practical NILM applications, such as edge deployment, to accelerate the transition towards a greener energy future. NILM applications at the edge eliminate privacy concerns and data transmission-related problems. However, edge resource restrictions pose additional challenges to NILM. NILM approaches are usually not designed to run on edge devices with limited computational capacity, and therefore model optimization is required for better resource management. Recent works have started investigating NILM model optimization, but they utilize compression approaches arbitrarily without considering the trade-off between model performance and computational cost. In this work, we present a NILM model optimization framework for edge deployment. The proposed edge optimization engine optimizes a NILM model for edge deployment depending on the edge device's limitations and includes a novel performance-aware algorithm to reduce the model's computational complexity. We validate our methodology on three edge application scenarios for four domestic appliances and four model architectures. Experimental results demonstrate that the proposed optimization approach can lead up to a 36.3% average reduction of model computational complexity and a 75% reduction of storage requirements.

I. INTRODUCTION
N ON-INTRUSIVE Load Monitoring (NILM) refers to the process of analyzing the aggregated energy consumption of a residential building to infer the individual consumption pattern of domestic appliances [1]. In recent years, NILM approaches have transversed from statistical analysis methods to deep learning techniques due to their superior performance capabilities. However, most of the deep learning NILM approaches are designed to be deployed in a central server instead of performing inference on the edge due to the increased computational needs [2], [3], [4]. This design methodology assumes data transfer from the data source, i.e., the domestic house, to an external entity and impacts the wider deployment scalability of NILM frameworks. Central data storage has increased costs for the service provider since the accumulation of large amounts of data requires an expanded storage infrastructure. In addition, performing inference centrally usually requires more computational resources, thus increasing the energy required to run the service and increasing the carbon footprint of the solution. Finally, apart from the heavy reliance on a stable Internet connection for data transmission, privacy concerns arise since sensitive customer information can be inferred [5]. It can therefore be argued that a transition to deploying NILM algorithms on the edge (i.e., at each domestic house equipped with a smart meter and a device with restricted processing power) is a more attractive solution that alleviates the issues of central data processing.

A. Our Contribution
In this study, we propose a performance-aware NILM optimization framework for edge deployment that takes into account the edge device characteristics. Our approach considers multiple hardware limitations and, depending on the deployment scenario, employs a different model optimization technique to efficiently preserve the limited edge device resources, resulting in an efficient resource management scheme. The basic contributions of our work are summarized below:  We explore the impact of model optimization on various NILM techniques (CNN, LSTM, Transformers) for different appliances, and we experimentally prove that, depending on the application scenario, a different level of model optimization for resource management is tolerable from a model performance perspective. The rest of the paper is organized as follows. In Section II, we present an overview of the existing work for deploying NILM algorithms on edge devices. Section III mathematically formulates the problem of performance-aware NILM model optimization, whereas Section IV describes in detail the proposed NILM edge optimization framework. Finally, Section V presents the experimental setup and results, while Section VI summarizes the main outcomes of the paper and potential future steps.
Recently, progress has been made towards the deployment of NILM and other energy-related applications on edge devices, either as part of a Home Energy Management System (HEMS) [31] or as standalone applications [32]. NILM edge inference does not require the transmission of data to an external server and, therefore, alleviates the aforementioned issues of central data processing. Approaches to deploying NILM models on the edge have been proposed, both on embedded computers, such as Raspberry Pi, and on more resourceconstrained devices. Deployment on a Raspberry Pi has been proposed [6], [7], but the deployed models either require additional metadata, such as room occupancy or utilize highfrequency features for energy disaggregation, which increases data acquisition costs. In addition, NILM models on more resource-constrained devices, such as microcontrollers [8], [9] and FPGA [10] have also been proposed, but the respective models only consider appliance state classification instead of regression and require high-frequency data to operate.
Despite the great success of deep learning in diverse applications, neural networks often possess a vast number of parameters, leading to significant challenges in deploying deep learning systems to a resource-limited device [33], [34]. The deployment of sensing devices with higher computational power has been investigated [35], [36], but the devices have high cost and high power demands, thus making them impractical for commercialization [8]. Therefore, edge inference requires compression and optimization of NILM deep learning models to account for the limited computational resources. Quantization, parameter pruning, low-rank factorization, and knowledge distillation [37], as well as combinations of one or more techniques [38] are the main approaches employed in the literature. Even though NILM-related deep learning applications have utilized state-of-the-art architectures [30], [39], [40], research on the constraints and methodology for deploying NILM deep learning models on edge devices remains limited. In [11], the quantization of a sequence-topoint (seq2point) convolutional neural network (CNN) [25] from 32-bit float model weights to 8-bit integer weights is applied. The application of multiple pruning approaches on the same seq2point [25] model has also been investigated [12], and the methods have been tested on 2 appliances from the REFIT [41] dataset. Finally, [13] explores model compression of a multi-class seq2point CNN using pruning and tensor decomposition, while the evaluation is performed for three appliances from the REDD dataset [42].
Even though the aforementioned works can be considered as an initial entry-point towards low-frequency (≤1Hz) NILM inference on edge devices, there are several limitations. First, these papers [11], [12], [13] do not take into consideration the hardware characteristics of edge devices. This can be an issue in quantization approaches, where some quantization protocols are applicable only to specific chip architectures. Second, all papers employ compression approaches on a specific model architecture (seq2point CNN). Seq2point models are less computationally efficient than sequence-to-sequence models (seq2seq), since they produce only one timepoint prediction instead of a whole window in the testing phase. As a result, significantly more forward pass iterations are required to produce the same number of outputs seq2seq models, which leads to a noteworthy increase in energy consumption. In addition, the effects of model compression on different model architectures, such as recurrent neural networks or Transformers, have not been investigated. Furthermore, the works that investigate more than one model compression strategy do not explore the impact of their combination on model performance, which can result in the optimization of different model aspects. Finally, pruning is applied on an arbitrary basis, and no framework has been proposed to interconnect performance loss after compression with model complexity. A summary of the aforementioned limitations of existing literature can be found in Table I.

III. PROBLEM STATEMENT
Under a NILM framework [14], we can assume that the aggregate consumption signal x of a domestic house with M operational appliances, at any time point t, equals to the sum of the individual appliance consumption loads y i , i = 1 · · · M , plus a noise term [43]:  1. Overview of the required infrastructure setup to perform inference centrally vs on the edge. Central inference requires upload and download of consumption data to a central processing entity, as well as increased data storage capacity and computational power. On the contrary, performing inference on the edge devices alleviates these limitations and only requires the compression of the models and their deployment on the edge device, while data exchange takes place only between the edge device and the domestic house.
To extract the consumption signal of a selected appliance a ∈ {1, . . . , M }, NILM approaches are designed to filter out all non-relevant appliance consumption signals y i ∀i = a. Depending on the chosen appliance a, the power signal y a may showcase different statistical characteristics in terms of peaks, sparsity, or duration of appliance activations, which is defined as the consecutive time interval that the appliance is turned on. As a result, not only may different model architectures have different sensitivity to model optimization approaches, but also the same model, trained to disaggregate different appliances, may showcase different behavior related to compression. It can therefore be argued that the optimization strategy must be bound to model performance, with the goal of finding an equilibrium between model complexity and performance loss.
Two different NILM infrastructure setups for central and edge device deployment are illustrated in Figure 1. Even though performing inference centrally theoretically allows for the model to utilize larger amounts of computational power, it can be easily seen that the complexity and drawbacks of such an approach are significant. On the other hand, performing inference on edge devices alleviates the need to transmit and receive data to an external server, with the only limitation being the fact that the models need to be compressed and optimized to run on resource-constrained devices.
To mathematically formulate the aforementioned approach, let a NILM model f (x ; w ) : X → Y, w ∈ R N have a performance P f . Our goal is to obtain a lower-dimensionality model h(x ; θ) : X → Y, where θ is some transformation of w, i.e., θ = T (w ), θ ∈ R C , C < N , to perform the same task with performance P h . In other words, we are trying to minimize the following function: However, Equation (2) does not take into consideration the performance loss that occurs as a consequence of model optimization. The model should be optimized to match the High-level overview of the proposed NILM edge optimization framework. deployment criteria only to the level that the performance loss is acceptable. Therefore, Equation (2) needs to be constrained with the condition that performance loss must not fall below a tolerance threshold δ. Therefore, a performance-aware model compression framework can be written as:

IV. A GREEN EDGE RESOURCE MANAGEMENT FRAMEWORK FOR NILM
Our proposed green computing framework for NILM model edge optimization is illustrated in Figure 2. The backbone of our approach is the edge optimization engine, which is responsible for the optimization of a NILM model depending on the edge deployment requirements. Since resource limitations of the edge device may vary, the optimization engine first receives the edge device characteristics, as well as any additional restrictions imposed by the user. Then, the trained NILM model to be deployed is analyzed, and an optimization strategy is set. The optimization strategy can either be static to reduce the model's storage requirements through model quantization or performance-aware to apply complexity reduction through weight pruning. Performance-aware optimization is defined as the removal of insignificant model weights not arbitrarily but by taking into consideration the respective impact on model performance. In this case, complexity reduction is performed incrementally until the edge deployment requirements are met, under the condition that the trade-off between performance loss and complexity reduction is satisfactory.
An overview of the optimization approaches employed for memory and complexity reduction is depicted in Figure 3. The following sections provide a detailed description of these techniques, as well as the proposed performance-aware iterative complexity reduction scheme.

A. Model Weights Quantization
Model quantization refers to the process where the model's weight type is changed to lower numerical precision to limit the storage and memory space required for the model. In essence, quantization can be formulated as an irreversible map- 3. Overview of the model optimization methods adopted in this study. We explore model quantization by performing MinMax quantization of the model weights, as well as histogram quantization for the activation function outputs to minimize performance loss. We also integrate magnitude pruning in our approach to remove weights with small L1-norm that contribute minimally to the model's predictions. the model weights w, stored in a floating point format, to an integer representation w . The value range of w is divided into bins and each value w i is mapped to the integer representing the corresponding bin.
Quantization is either executed post-training, meaning that an already trained model is compressed or during training, in the sense that the quantized version of the model is taken into account when the model is trained (quantization-aware training). In this work, we focus on post-training model quantization and apply a calibration phase on an indicative dataset, during which the quantization parameters are finetuned, resulting in a more accurate representation of the initial model weights w. This additional calibration step also allows for the quantization of activation function outputs.
To quantize the models, we quantize both model weights and activation outputs to further avoid floating point multiplication operations [44]. Since the activation outputs are fed to the next layer, a more sensitive quantization approach is required to minimize model performance degradation. Therefore, we have opted for min-max uniform quantization for model weights and histogram quantization for the activation outputs, where the activation values are recorded, and a different range per bin is assigned, depending on the corresponding probability distribution. An illustration of the different quantization approaches can be seen in Figure 3.

B. Model Complexity Reduction
An alternative methodology for optimizing a deep learning model is the removal of synaptic connections between model layers. This process, which is commonly referred to in the literature as model pruning, assumes that a deep learning network is over-parameterized and incorporates a subnetwork that contains most of the information [45]. In other words, model pruning is an approach to transform a model's weights w ∈ R N to a lower dimensionality representation w ∈ R M , M < N by removing non-informative model connections.
Different techniques on how to optimally remove model connections with minimal information loss have been proposed in the literature. Similar to quantization, pruning can either be applied post-training [46] or in a compression-aware training scheme [47]. The removal of weights is performed either on the overall set of model weights or by eliminating predetermined architectural blocks, such as convolutional filters [48]. In addition, different pruning approaches remove weights by evaluating different metrics, such as weights magnitude, gradients magnitude, intra-layer mutual information, or even by introducing a learnable pruning threshold [49], [50], [51].
In this work, we implement magnitude pruning and remove the model connections with the smallest contribution to the model output. Let w = {w i ∀i = 1, . . . , N } ∈ R N be a vector containing all model parameters. Then the magnitude-pruned vector w is expressed in Equation (4): F ( w i ) signifies the cumulative distribution function of weight magnitudes. In other words, after magnitude pruning, we only keep the weights with the highest 1 − p thres % magnitudes and discard the rest. Even though magnitude pruning is usually executed only once in post-training pruning, in the next section, we present an iterative variation that calculates the optimal p thres bound to the resulting model performance.

C. Iterative Performance-Aware Green Resource Management Algorithm
Magnitude pruning removes a percentage of a model's lowest L 1 -norm connections, according to a specified threshold p thres . However, finding the optimal pruning threshold p opt that represents the optimal tradeoff between model complexity and performance is often a tedious procedure that requires multiple experimentations, whose evaluation is, in most cases, subjective. Therefore, we propose Performance-Aware Optimized Pruning (PAOP), an iterative algorithm to determine the optimal pruning threshold for NILM models. Optimality must be bound in terms of performance, as stated in Equation (3). Consequently, finding the optimal pruning threshold p opt requires an objective metric that incorporates both the performance degradation of the reduced model and the gain in terms of parameter reduction. Therefore, the metrics utilized for model performance evaluation need to be first defined. Since seq2seq disaggregation is primarily a regression task and secondarily a classification task, we record three widely used metrics for model evaluation, namely, Mean Absolute Error (MAE) and Mean Relative Error (MRE) for regression evaluation and F1-score for classification performance, as shown in Equation (5).
where y andŷ are the original and the predicted appliance consumption load, respectively, and TP, FP and FN stand for the True Positive, False Positive, and False Negative classified time instances in the predicted signature. For different pruning thresholds p thres , the model performance on the test set will change. At the same time, each metric should not be evaluated independently. Instead, all metrics should be combined in a single term. Taking into consideration all the aforementioned considerations, we propose the Pruning Gain metric (PG) to quantify the tradeoff between model complexity and performance, which is formulated in Equation (6).
Pruning Gain measures the pruning-related change in a given metric as the ratio of the baseline performance of the model to the performance of the model after pruning. For metrics where a lower score is better, the terms of the ratio need to be reverted (baseline/pruned). For each metric, we record the ratio of baseline model performance (subscript b) and the model performance after pruning (subscript p), and multiply it by the ratio of change in the number of parameters of the original model and the redacted version. The idea behind PG is to combine the increase or decrease of the metrics recorded to evaluate model performance with the reduction in model size in a multiplicative way. This approach was selected to emphasize the sensitivity of changes in model performance, as an averaging operation of the individual terms would lead to the phenomenon where a positive change in one metric may envelop negative changes in the other ones. Even though the separate metrics are in different scales, the ratio of each metric captures the relative change between the baseline and the pruned version, which regularizes each ratio separately. No change results in a ratio of 1. A PG score greater than 1 means that the performance loss from removing model weights is beneficial, whereas a score smaller than 1 signifies that the performance drop was more significant than the model compression achieved. Therefore, the proposed metric captures the relative changes between the metrics and can be used to decide whether the impact of pruning on the NILM model was negligible or not.
Utilizing PG, we are now able to perform iterative magnitude pruning to optimally compress a NILM model. To take the hardware characteristics of the edge device into account, the expert needs to define computational cost goals depending on the deployment scenario. Then, iterative model optimization can begin. First, we define a selected range of pruning threshold percentages [0, p max ] that should be taken into consideration, as well as an iteration step p step . Then, for each pruning threshold p, we calculate the performance metrics, as well as the Pruning Gain PG. If Pruning Gain is, for the given pruning percentage, higher than 1, then we assume that the reduction of the weights dimensionality was beneficial and that the model can be further compressed, in which case we increment the pruning threshold with p step %. We continue Algorithm 1 Performance-Aware Green Resource Management Algorithm 1: cost goal : Expert-defined computational cost goals (MFLOPs) 2: p: pruning percentage 3: Define pmax , p step ,p opt 4: for p in range (0, p step , pmax ) do 5: Calculate performance metrics (MAE, MRE, F1) 6: Calculate pruning gain PG 7: if PG > 1 then 8: p opt = p 9: else if PG < 1 then 10: Calculate costnew 11: if costnew ≤ cost goal then 12: break, optimal model found 13: else 14: break, model is not deployable 15: end if 16: end if 17: end for the aforementioned loop until the Pruning Gain falls below 1, where the iteration stops. If the computational cost goals were met, then the previous pruning percentage p is selected as the optimal pruning for the given model. Otherwise, the model is not deployable on the edge device. The iterative algorithm is summarized in Algorithm 1.

A. Experimental Setup
The methodology to optimize a model for edge inference should depend on the application scenario and the hardware limitations inherent to the edge device. The deployment of NILM models on the edge can be achieved by connecting a smart meter that records the aggregate consumption with a Raspberry Pi 3 Model b single-board computer. Raspberry Pi is one of the most popular edge devices in IoT systems and is commonly used as a gateway to enable the deployment of AI applications in real-world settings [52]. Therefore, we have designed our methodology and experiments to use a Raspberry Pi 3 as the edge device. Raspberry Pi's run on an ARM architecture and have limited storage space and computational power, but are easy to install and use. Their hardware characteristics can be found in Table II.
The architecture to deploy the optimized models on the edge device is depicted in Figure 4. The edge solution consists of 3 different services responsible for data collection and the NILM inference service, which processes the collected data and produces the disaggregation results. The components of the data collection process are described below: • Z-Wave JS UI is an open-source dockerized service that communicates with the aggregate consumption smart meter through Z-wave protocol and forwards the collected data to the Z-Wave service through the MQTT protocol. • Z-wave-service is a custom service that receives the collected data from the Z-Wave JS UI through MQTT protocol and forwards them to the data broker service through an API. • DataBroker-service is responsible for receiving the collected data from Z-wave service and communicating with the PostgreSQL database. DataBroker service is also responsible to update (saving and deleting) the collected data in the existing database. The NILM inference service is included in a Docker container that runs continuously on the edge device. This service communicates directly with the database after a specified time interval and checks if enough data are collected to produce the disaggregation results.
Depending on the processor's architecture, there are two main quantization backend libraries that can be used, namely FGBEMM [53] and QNNPACK [54]. The term backend refers to reduced precision tensor matrix math libraries that are utilized during model compression. FBGEMM can be used to quantize a model to run on x86 architectures, while QNNPACK supports ARM processor architectures. Since the Raspberry Pi processing unit is based on an ARM architecture, we have chosen QNNPACK as the quantization backend.
To evaluate our approach, we conducted experiments on different appliances from UK-Dale [57], and REDD [42] datasets. Both datasets consist of aggregate and appliance level energy consumption measurements from five different houses in the United Kingdom and six different houses in the United States, respectively. UK-Dale was generated at a sample rate of 1 Hz for the aggregate and 1/6 Hz for individual appliances while REDD was monitored at a sample rate of 1 Hz for the aggregate consumption and 1Hz for the plug-level data. The data were resampled at a sample rate of 1/6 Hz and pose the appliance characteristics described in Table III. The models were tested on unseen data from houses not included in the training set, as shown in Figure 6. The reason for testing the models on a house not used in the training set is due to the core concept of NILM; if a house has smart meter data to record appliance consumption, there is no point in deploying a NILM algorithm to infer them, since they are already available to the consumer. Therefore, the proposed approach is to perform NILM The upper left subfigure describes a convolutional neural network [25], whereas the next two subfigures correspond to reccurent architectures with different gating mechanisms (LSTM [55], GRU [56]). Finally, the lower right subfigure presents the Transformer-based architecture [30]. Fig. 6. Train-test split for UK-Dale and REDD datasets. The models were tested on unseen houses non included in the training set. In UK-Dale, houses 1,3,4 and 5 were used for training and house 2 for testing, while in REDD house 2,3,4,5 and 6 were included in the training set and house 1 was kept for model evaluation.
on smart meter aggregate readings from a house using pretrained models, for which ground truth in terms of submetering was available for training on a centralized server, e.g., using publicly available datasets. In UK-Dale, we focused on four appliances (washer, kettle, fridge, dishwasher), while in REDD on three appliances (microwave, washer, dishwasher). The set of appliances selected represents single-state and multi-state appliances with variable load fluctuations.
To diversify our experimental evaluation and test the generalization capabilities of our performance-aware edge inference optimization framework, the models are based on different architectural philosophies. In particular, one convolutional   [25], two recurrent architectures (LSTM [55], GRU [56]), and a Transformer-based model [30] were chosen for the evaluation of our approach, and their architectural representations are illustrated in Figure 5. Even though all models initially employ a 1-d convolutional filter for feature extractions, the intermediate part of the model structure varies significantly. The models were purposely trained and evaluated on unbalanced data. The reason for not balancing the dataset is that, to mitigate the negative aspects of central data storage, model training should take place in a federated manner on edge devices, where the possibility of data balancing is limited by hardware constraints. We envision our work as part of a wider NILM framework that enables the transition from central data processing to all computations occurring on the edge to increase the privacy of customers.
In our analysis, we diversify between three edge deployment scenarios based on different edge device limitations. First, the edge device has limited storage capacity, and the edge optimization engine employs model quantization to limit the required storage space of the model. In the second scenario, the limitation is based on the edge device's processing power, and we optimize the models with Performance-Aware Optimized Pruning (PAOP) to reduce their computational complexity. Finally, we investigate the optimization scenario where the edge devices have limited both storage space and computational power. We apply a combination of performanceaware optimized pruning to reduce the number of floating point operations during a forward pass, followed by weight quantization to reduce storage requirements. We call the combination of both techniques Performance-Aware Pruning and Quantization (PAOPQ). In the utilization of the proposed performance-aware schemes, the model complexity reduction ranged between [0, 70]%, with an increment step of 5%. All optimization experiments were performed on an Apple Macbook M1 Pro to take advantage of the ARM CPU architecture to accurately simulate the deployment of the aforementioned models on a real-world setting with Raspberry Pi edge devices.

1) Scenario 1 (Limited Storage Capacity):
The first step in our analysis is to examine how the aforementioned models were impacted by weight quantization. As can be seen in Table IV, quantization of model weights leads to a significant 75% reduction in the size required to store the model on the disk. At the same time, the effect of quantization on the disaggregation performance, which is presented in Table V and  Table VI, varies across model architectures and appliances. In UK-Dale results, recurrent neural networks (LSTM, GRU) showcase minimal performance degradation when disaggregating the kettle and fridge, with a performance reduction of less than 0.5% for all metrics. However, they are sensitive to weight quantization for appliances with sparse and long appliance activations, such as the washing machine and the dishwasher. The CNN model has a small performance loss consistent across appliances, while the Transformer-based model (ELECTRIcity) is robust to quantization, showcasing a minimal performance degradation averaging across all appliances (−0.01% MAE, −1.09% MRE and −0.29% F1). The effects of quantization presented in UK-Dale are very similar when the quantized models are evaluated on the REDD dataset. Recurrent as well as Transformer architectures present a minimal performance degradation across all the tested appliances. In some cases, quantization could also lead to a slight improvement in disaggregation results as it happens in LSTM, GRU, and Electricity models on microwave appliance as well as on Electricity model on washer with an average improvement of 10.80% MAE, 8.57% MRE and 2.2% F1.
2) Scenario 2 (Limited Processing Power): Next, we would like to evaluate how the models are affected by PAOP. Applying the proposed iterative algorithm to find the optimal pruning threshold, the average number of model parameters can be decreased by 40.93% in UK-Dale and by 40% in REDD dataset. The optimal pruning threshold for each model and appliance, as well as the number of baseline parameters, are illustrated in Table VII. By comparing the obtained optimal pruning threshold with the performance metrics, as described in Table V, we observe that, in cases where the baseline model does not perform well, indicated by the low F1 score and the high MRE, our algorithm concludes to suggest the highest pruning percentage p max , whereas in cases where the model performs well, the suggested optimal pruning threshold coincides with a plausible value close to the average. An indicative example of this finding is illustrated during the pruning of the LSTM model for the disaggregation of the washing machine in the UK-Dale dataset. Since the baseline performance is suboptimal, the ratio of baseline performance to performance after pruning is very sensitive to change and even though the MRE rises by 4.3% in absolute value, the relative change is −83.22%. At the same time, however, the MAE is 33.28% better than the baseline, which can be explained by the fact that artifacts in the predicted appliance signature are no longer being produced, and, multiplied with the ratio of model parameter reduction, leads to a positive Pruning Gain value. In the example of ELECTRIcity for the dishwasher appliance, we observe that 35% of the model weights are removed without notable affecting the model's disaggregation performance. Overall, it can be concluded that the utilization of our performance-aware model compression strategy can reduce the computational complexity of a NILM model without significantly affecting its performance. The complexity reduction is validated through the reduced number of floating point operations (FLOPs) required to perform a forward pass, as can be seen in Table VII. On average, PAOP reduces the FLOPs of a NILM model by 36.3% in UK-Dale and by 31.8% in REDD.

3) Scenario 3 (Limited Storage Capacity and Processing Power):
The last experiment that was conducted was the combination of both aforementioned model compression approaches (PAOPQ). To calculate the optimal pruning threshold in this case, the model performance was evaluated after both schemes were applied. Therefore, the optimal threshold obtained is different than in the case of pruning (see Table VII). It can be easily noticed that the combination of both techniques tolerates significantly lower pruning percentages for most models. On average, the optimal pruning threshold is 37.4% lower in UK-Dale and 26.25% in REDD, compared to when weights quantization is not utilized. Therefore, it can be concluded that, without the proposed performance-aware optimization scheme, the performance degradation of the models would be significantly higher. The integration of both techniques in our scheme results in 75% less size on disk for both UK-Dale and REDD and, on average, 25.62% fewer model parameters and 22% fewer FLOPs for UK-Dale and 21.51% fewer model parameters and 21.09% fewer FLOPs for REDD, thus reducing both storage requirements and model computational complexity. Another interesting finding is that the optimal thresholds for ELECTRIcity remain the same as in the case of applying only parameter pruning and, in the case of the dishwasher, the model tolerates a higher pruning percentage, meaning that Transformer-based architectures are more robust to model compression than convolution-based and recurrent modeling approaches.
An illustration of the Pruning Gain metric values during the application of the algorithm to find the optimal pruning threshold p thres is illustrated in Figure 7. The upper part of the figure showcases the Pruning Gain distribution when only model pruning is applied, while the lower part demonstrates the distribution when both pruning and quantization are selected. The difference in both plots validates the observation that model performance is more sensitive to combining both compression approaches and that the proposed metric can accurately quantify the tradeoff between model performance and computational complexity.
Finally, we recorded the CO2 emissions of each optimized model through the CodeCarbon Python library, and the results are presented in Table VIII. All optimization approaches reduce CO 2 emissions by ≈17%. The only exception can be found for recurrent architectures (LSTM, GRU), where the CO 2 emissions are higher when quantization is involved. This can be explained by the increased complexity of recurrent layer computations, where multiple activation functions are utilized inside each memory cell. Since the outputs of activation functions are dynamically quantized during inference, the increased energy needs to perform the quantization is justifiable. Fig. 8. Comparison of optimal pruning threshold and performance between optimizing the models for each individual appliance (green) vs jointly (yellow).
Optimizing the models on the joint set of appliances (1 for all) leads to subpar optimization.

C. Discussion
As already mentioned, our approach suffers from certain limitations. First, we have observed that models that do not showcase good baseline performance tend to be overpruned by our proposed iterative algorithm during the search for the optimal pruning threshold. Even though such models should not be deployed to perform inference, as the insights that the consumer will get regarding energy consumption will not be accurate, our approach should still take into consideration such cases. The second limitation concerns the fact that, in our approach, the models are optimized for each different appliance. Therefore, to perform energy disaggregation for multiple appliances, the deployment of multiple NILM models is required. However, we have experimented with model optimization for all appliances simultaneously, as shown in Figure 8, and have found that optimizing the model for all appliances at the same time leads to over/underpruning and impacts the achievable performance. Averaging across all appliances and models, this approach would lead to a performance loss of 8.92% MAE, 7.32% MRE and 12.20% F1.

VI. CONCLUSION AND FUTURE WORK
In this work, we have proposed an efficient, performanceaware model optimization framework for edge deployment of NILM models that takes into account the edge device characteristics. We have explored three different deployment limitations, for which optimization of different model aspects is required. Additionally, we proposed an objective model optimization metric and a performance-aware model complexity reduction algorithm that constrains model optimization on performance loss. Experimental results validate that our proposed method to bind model performance with model compression, instead of performing it arbitrarily, allows for the combined utilization of more than one compression approaches on the same model without significantly affecting model performance, thus enabling the efficient deployment of NILM models on edge devices.
In future work, we would like to implement further techniques, such as knowledge distillation or tensor decomposition, in our performance-aware compression scheme. Further methods for weight quantization, as well as different magnitude pruning approaches (gradient-based magnitude pruning, information-based pruning) and structured pruning, will also be evaluated. We would also like to test our approach on different model architectures, assess the difference on models trained with balanced datasets versus unbalanced datasets, and adapt our proposed iterative scheme to account for optimal model compression of models with subpar disaggregation performance. In addition, we plan to utilize recent advancements in sparse matrix computation on edge devices to maximize the optimization potential of our methods [58]. Finally, our methodology has been structured around the limitations of a real-world scenario, and we would like to deploy Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. the compress models on Raspberry Pi devices, connected with house smart meters, to evaluate whether the simulation experiments that we have conducted are translated to real-world conditions in the same way.