OPT-NILM: An Iterative Prior-to-Full-Training Pruning Approach for Cost-Effective User Side Energy Disaggregation

Non-Intrusive Load Monitoring describes the process of analyzing the aggregate household energy consumption to infer the individual energy consumption patterns of different appliances. Although NILM research has led to substantial progress as regards the performance of deep learning models, these models require exhaustive resources for the training phase and, due to their computational demand, are not well suited for deployment on edge devices with limited resources. NILM applications on low-resource devices enhance user adoption, opening up new energy market prospects. Although there has been some work toward edge-computed NILM, the proposed compression frameworks provide a solution only for the deployment phase since they are applied to the already trained models. This study presents OPT-NILM, a novel pruning strategy to discover sub-optimal NILM neural networks before full training, which reduces computing costs for both testing and training phase, and improves disaggregation performance compared to conventional after-training pruning. OPT-NILM proposes a metric to find the appropriate pruning threshold by evenly valuing model performance and computing cost, unlike other approaches that apply compression arbitrarily. Experimental results on the UK-Dale dataset show that the OPT-NILM approach may reduce model trainable parameters by up to 95% with minimal performance loss.


I. INTRODUCTION
E LECTRICITY load monitoring for appliances is a sig- nificant task in light of current economic and ecological trends.It complements home energy management systems (HEMSs) and ambient assisted living (AAL) technologies, contributing to efficient and cost-effective energy management [1], [2].Additionally, electricity load monitoring serves as a tool for detecting malfunctioning appliances, such as identifying issues like frosting cycles in fridges with damaged seals, among other possibilities.Promoting sustainable living requires householders to adopt energyrelated behavior changes.Energy monitoring plays a pivotal role in effective energy management by enabling the monitoring of power consumption of individual appliances, thus informing the planning of technical measures to minimize energy usage.Energy disaggregation techniques can be leveraged to enable granular monitoring of power consumption at the appliance level.
Non-Intrusive Load Monitoring (NILM) or energy disaggregation algorithms aim to infer the energy consumption patterns of domestic appliances by decomposing the aggregated household energy consumption signal into the individual power signals of its corresponding appliances [3].Recently, there is a significant number of publications for NILM using deep learning models ( [4], [5], [6]).Due to many limitations, NILM approaches have not been widely used in households despite the interest from the industry.Specifically, the training process of such NILM models requires a lot of computational power and resources, so they cannot be deployed on the user side, i.e., on the edge.Instead, they require central servers or cloud computing infrastructures, which increase the cost and energy of running such a service.The current concept implies data transfer between the data source and a central server, which creates privacy problems and data storage costs [7].Deploying deep learning algorithms on the edge -at consumers' homes equipped with smart meters and low-power devices -could be a viable solution.In order to make this transition from central data processing to user-side energy disaggregation, many different edge-NILM solutions have been proposed.The main goal of all these solutions is to compress and optimize the models' structures to be able to operate with limited computational resources.One of the most common techniques used for NILM model compression is pruning [8], [9].
Pruning is a technique in deep learning that aids in the development of smaller and more efficient neural networks by eliminating unnecessary values in the final trained models' weight tensors based on their contribution to their predictions.The weights and neurons contributions can be determined by local measures such as their magnitude and L1-norm [10].However, the existing compression frameworks share the basic limitation that they are being applied to a fully trained model, and they cannot be executed before full training.Thus, the proposed edge-NILM solutions do not solve the core issues c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. of central data processing in the sense that the network has to initially be fully trained centrally by allocating all the demands of the central server and cloud computing infrastructures.As a result, current compression schemes only provide a solution to the testing phase of NILM algorithms on the edge, which is the least computationally heavy task of the whole process.However, in the machine learning community, there is an increasing interest in a new training trend according to which we achieve training acceleration that embraces the promising training-on-the-edge paradigm.
Here, we propose a prior-to-full-training NILM compression scheme, which allows for the identification of optimal sub-deep NILM networks without first requiring full training of the selected model.Following such a scheme would aid in dealing with central data processing issues and NILM real-world deployment since we are able to identify efficient sub-NILM models at their initialization stage, eliminating the training resources and creating efficient, lightweight models that would be able to run in limited resource devices.While the training phase of our framework necessitates data transmission to a central node in order to train the identified sub-network, it's essential to underscore that since the deployment is taking place in houses not included in the training set, the testing phase functions without any subsequent data transmission.All inferencing occurs directly on the edge side, bolstering data privacy and promoting user adoption, given that there's no necessity for users to dispatch their consumption data to an external entity.
These pruned models will be trained using fewer computational resources than the corresponding uncompressed ones and at the same time tested on the edge, utilizing the user's limited resource devices.The main goal of the OPT-NILM is to provide such a framework to optimally identify subdeep networks before training for a cost-effective user-side NILM.The main contribution of this work is summarized below: • Proposing a computationally efficient before-full-training pruning scheme for edge computed NILM.In contrast with the conventional pruning approaches, the proposed approach identifies optimal sub-deep NILM networks prior to full training.The proposed framework not only identifies sub-deep-neural-network structures that can be easily deployed in a limited resource device, but it also reduces the computational resources needed for the training phase of the NILM models promoting the real-world deployment and adoption of NILM applications.• OPT-NILM identifies optimal sub-networks that achieve better disaggregation performance compared to the conventional after-training pruning schemes.Deep neural networks (DNN) are known to be over-parameterized.Thus, a trained DNN for NILM contains many ineffectual parameters that can be safely pruned or zeroed out with a small or no effect on its performance.In our scheme, where these parameters are pruned before the full training, our sub-deep neural network structures are less overparameterized during the full training, reducing the computational resources needed and preserving a better trade-off between disaggregation performance and reduction in the number of trainable parameters.• Proposing a model optimization metric to determine the ideal balance between the model's disaggregation performance and compression.In NILM applications, the trade-off between accuracy and efficiency is critical.Assuming that we set a high pruning percentage, this results in a significant accuracy drop since the pruned model will not have enough representation power.OPT-NILM is both a resource-efficient and performanceeffective technique and introduces an objective model optimization metric for NILM that describes the tradeoff between the performance and the model complexity by equally weighting both these factors.Although the proposed prior-to-full training pruning scheme was inspired by the [11], this work is a pioneering application within the NILM domain.Additionally, this study offers a comprehensive comparison to other compression methods and introduces a novel metric tailored to the unique needs of NILM.From a technical standpoint, the primary contribution of this paper is the introduction of a cost-effective and interoperable deployment strategy for the proposed OPT-NILM inference phase.Our solution is anchored on a Raspberry Pi device and leverages the Z-Wave communication protocol.Originally developed for use in connected home technology, this protocol ensures reliable and robust data transmission between monitored devices and their respective gateways [12].The paper is structured as follows: Section II covers the background on low-frequency NILM with deep learning and compression methods.Section III delves into NILM problem formulation and its deep neural network modeling.The proposed solution is detailed in Section IV, while Sections V and VI present and discuss results.Section VII concludes and outlines future directions.

II. RELATED WORK
In this section, we provide a brief background on deep learning energy disaggregation approaches and a review of the compression approaches used in deep learning, as well as in deep-NILM models specifically.

A. Deep Learning Models for NILM
Deep learning has achieved enormous success in domains such as natural language processing, time-series analysis, and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
computer vision [13].Over the last few years, numerous deep learning approaches have been proposed for NILM as it has been proved that they achieve a superior performance [14], including Convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM), bidirectional (bi)LSTM, gated recurrent unit (GRU) -(bi)GRU, and Transformer models [3].RNN approaches, such as LSTM and GRU, use feedback connections to capture temporal dependencies within the power signals [15].Both LSTM and GRU architectures have been widely proposed in NILM [16], [17] since they converge fast and provide a good disaggregation performance.CNN-based architectures capture long-range temporal dependencies in time-series data, making them a successful NILM technique [18].This strategy requires large model depth and extensive filters, which increases computational complexity.NILM techniques like [19] propose hybrid recurrent-convolutional architectures, benefiting from the advantages of both types of layers.Transformer-based architectures have become another widespread approach for NILM [15], [20], [21] due to their ability to adopt selfattention mechanism and process data in an order-invariant way.However, all the aforementioned deep learning NILM approaches suffer from computational complexity issues, which increase the training cost and limit their applicability in a real-world deployment on edge.

B. DNN Compression Methods for NILM
Recent developments have driven the adoption of NILM and related energy applications on edge devices.The basic reason for that is that deploying such applications on the edge eliminates the need to transfer data between the users and a central data source, addressing the challenges tied to central data processing and privacy.The landscape of research in edge computed NILM is broad and includes different approaches, from deep learning models on edge devices [8], [9], [22], [23], [24] to feature extraction [25], [26], [27], federated learning [28] and hardware-specific optimizations such as Field-Programmable Gate Arrays (FPGAs) [29] and e-Sense device [30].Since NILM research has mainly been traversed to deep learning techniques, there is a growing interest in works that deal with NILM inference on edge devices to be deployed as part of Home Energy Management Systems [31].This trend significantly influenced our decision in this paper to delve deeper into the realm of deploying deep learning architectures on resource-constrained devices and explore the existing and new compression methodologies.However, research on compression methodologies on edge-computed NILM models remains limited.In [9], multiple pruning techniques, including magnitude, relative threshold, and entropy-based pruning, are being investigated and applied on NILM CNN sequenceto-point (seq2point) proposed in [23].These methods are tested on the kettle and dishwasher appliances from the Refit dataset [32].The application of a quantization approach has also been proposed in [24].In this work, the same seq2point CNN architecture is being modified from 32-bit float to 8-bit integer model weights.Reference [8] proposes a model compression scheme of a multi-class seq2point CNN using pruning and tensor decomposition.This approach is evaluated on 3 different appliances from UK-DALE, and REDD datasets [33], [34].In [35], a performance-aware NILM compression technique is proposed, incorporating an afterpruning approach (PAOP) and an after-pruning approach combined with quantization (PAOPQ) tested across four different architectures.Lastly, in [28], the authors introduced a cloud model compression technique suitable for edge implementation of FedNILM.This was achieved by employing filter pruning within the convolutional layers of the chosen deeplearning model.Although the aforementioned works lead the way toward edge inference in NILM, they provide some significant limitations.The basic drawback of these approaches is that the existing compression schemes are applied to already trained models.Thus, the proposed approaches do not overcome the issues raised by the high computational demand of the training phase and provide a solution only for the testing phase of the NILM models.Another limitation is that [8], [24] and [9] are being employed in a seq2point CNN architecture, which is a computationally inefficient approach since it provides only a midpoint prediction for each window.Since seq2point models are trained to predict the output signal only at the midpoint of the window, they employ a sliding window approach to construct the entire consumption signal, which increases the number of forward passes and, consequently, the computational resources required for inference compared to seq2seq models that predict the entire sequence at once [23].Finally, in both [9], and [8], compression is applied in an arbitrary way, and there is no framework that evaluates the trade-off between model complexity and performance degradation to define the optimal pruning level.

III. PROBLEM FORMULATION
This section presents NILM problem formulation as well as its modeling using deep neural networks.It also discusses some deep learning-related issues that hinder the real word edge deployment of such an application.

A. NILM Problem Formulation
The concept of non-intrusive load monitoring was first introduced by Hart in 1992 [36].According to their proposed problem formulation, the aggregate active power of a number of measured appliances m = 1, . . ., M at time t = 1, . . ., T can be formally defined as: where y m (t) expresses the power consumption of the m-th appliance and noise (t) describes the noise originating from the measurement equipment and the appliances that are not submetered during the measurement campaign [14].The goal of energy disaggregation is to solve the inverse problem in (1) and determine the individual consumption y m (t) of a selected appliance m at time t based exclusively on the measurement of the aggregate signal x(t).
NILM is considered as a very challenging problem, as power signals do not present any linearity, and the use of each appliance depends on the contextual characteristics of each household.The diverse energy consumption patterns make the implementation of robust NILM algorithms with good generalization behavior even more challenging.Finally, another challenge that NILM models should deal with is the dataset imbalance since every appliance is used with different frequencies and duration.

B. Deep Learning Modelling of NILM
Deep learning for NILM was first introduced in 2015 by Jack Kelly, with major progress on disaggregation performance and generalization capability compared to conventional approaches such as [4], [5].Solving energy disaggregation using deep neural networks is translated into a non-convex optimization problem.Specifically, learning in deep neural networks describes the process of calculating the weights of the parameters associated with the various regressions throughout the network.In order to find the parameters that give the best approximation, an objective is needed.Assuming a training set of v = 1, . . ., V values, the objective function J(•) quantifies the distance between the ground truth consumption values, y n , and the predicted ones, ŷn , as: where θ are the model parameters (or weights) and L(•) is the cost function.Note that in (2) we omit the subscript m as we describe the optimization function of a single device.The minimization process of J(•) takes place through the backpropagation step [37], where gradient descent is applied to update the parameters of the model.Deep neural networks are universal function approximators that are capable of approximating very complicated functions.However, the trade-off of this capability is the number of neurons needed.Specifically, in order to approximate a non-convex function, as it is needed to do in NILM, which is considered a very challenging problem, it requires to use of high-complexity deep learning models with many parameters [38].Although these models are considered as state-of-the-art approaches toward NILM, they increase the computational complexity and resources required to tackle this problem.

IV. METHODOLOGY
In this part, we describe the suggested OPT-NILM compression strategy as well as the standard after-training magnitude pruning, which has already been employed in [8], [9] as a way to reduce the complexity of NILM deep learning models towards edge inference.In addition, we discuss the methods and benefits of the suggested scheme, highlighting its key contributions to the acceptance and implementation of an edge NILM application in the real world.Lastly, we define a tradeoff metric for approximating the optimal pruning threshold in relation to the model's performance.

A. Magnitude Pruning
One of the most common methodologies for optimizing DNN structures is magnitude pruning.The origin of idea of pruning in artificial neural networks derives from synaptic pruning in the human brain, where axons and dendrites decay and die off, resulting in synapse elimination that occurs between early childhood and the onset of puberty [39].In analogy, deep learning pruning removes redundant parameters or neurons that do not significantly contribute to the model's predictions.Subsequently, model pruning is a technique that reduces the number of the model's weights, θ ∈ R K , to a lower dimensional representation, θ ∈ R K in which K < K, by removing non-informative model connections.
Many deep-learning pruning variations have been proposed.Specifically, pruning can either be applied after training or iteratively during the training process [40], [41].The removal of connections is performed either in an unstructured way by eliminating specific weights from each layer or in a structured one by removing larger structures such as neurons or convolutional filters [42], [43], [44], [45].Finally, pruning approaches remove weights based on different metrics such as weights magnitude, gradients magnitude, layerwise mutual information, or learned threshold via gradient descent [44], [46], [47].
In this work, we implement a post-training pruning based on L 1 -norm metric as a baseline approach since it has also been used for edge computed NILM in [8], [9].This approach removes the model's connections with the smallest contribution to its output according to a specified threshold p thrs .Given a dataset D = {(x(t), y m (t))} T t=1 corresponding to a time window t = 1, . . ., T of measured signal powers and a desired sparsity level p thrs (i.e., the percentage of removed parameters) neural network structural pruning can be formulated as the following constrained optimization problem: Here, L(•) is the defined loss function, θ 0 are the initial weight values and • 1 is the standard L 1 -norm.Thus, after magnitude pruning, the pruned model would only keep the weights with the highest (1 − p thrs )% while the rest will be discarded.

B. OPT-NILM Approach
Magnitude pruning removes a percentage of a model's lowest L 1 -norm connections according to a specified p thrs .However, the whole pruning procedure is being applied to an already trained model, meaning that excessive computational resources and data transmission to a central server are required for the training process.The proposed OPT-NILM pruning approach, which deals with the aforementioned limitations, is mainly inspired by the Lottery Ticket Hypothesis paper [11].According to this work, a randomly-initialized neural network contains a sub-network that is initialized such that -when trained in isolation -it can match the test set accuracy of the original network after training for at most the same number of iterations.The key characteristic of this approach is that pruning is being performed before full training rather than after training, as it is proposed in the existing edge NILM frameworks.Based on this idea, our proposed prior-to-full training Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.pruning technique prunes the NILM networks at the initialization stage.The first step of the proposed approach is to initialize the NILM neural network and train it for a couple of iterations while also keeping track of its initial weights parameters θ 0 .In contrast with full training, where the model should become as accurate as possible, in this stage, we are trying to determine which of the initialized parameters lends themselves to the task.In order to achieve this, the model should only be trained for a couple of iterations, which are significantly less compared to the full training.Subsequently, this slightly trained model is pruned using the same techniques that are used to prune a fully trained model.In this work, the L1-norm pruning technique is used to remove the parameters which are not helpful to the task.Since the model is not trained for a long time this technique gives an indication of not only the current parameters but also of their initialization.Thus, if a parameter is currently ineffective, its initialization is probably not part of the optimal sub-network.The final step is to reset the parameters that were not pruned back to their initialization θ 0 .
The process of training, pruning, and resetting is repeated for N N, where N stands for the epochs of the pretraining cycle and N stands for the epochs of the full training till the desired pruning level has been achieved.Once the optimal sub-network has been found, this network can be trained fully.Figure 2 provides a visual illustration of the process described above.
From a more mathematical perspective, let f (x(t) T t=1 ; θ 0 ) be a deep neural network with initial parameters θ 0 .The procedure of the pretraining process is as follows: initially, the network is trained for n = 1, . . ., N iterations until the fist desired θ Tr 1 is obtained, where the superscript Tr denotes the training state.This can be described as: where F train (•) is a function describing the training procedure of the network and θ (Rst)  n−1 are weights obtained from the network after the reset state.Afterwards, p where denotes the Hadamard (point-wise) multiplication.This is described as: where θ Pr n are weights obtained from the network after pruning.Then, the remaining weights are reset back to θ 0 as where F rst (•) is a function that replaces the non-zero index values of the pruned network with those of θ 0 .Note that the above described process is repeated for all the N epochs.
The identified optimal sub-network f (x(t) T t=1 ; θ) could then be fully trained, employing much fewer computational resources compared to the original uncompressed model.The proposed OPT-NILM pruning scheme is compactly described in the Algorithm 1.
The proposed pre-training process is able to find optimal computational light sub-networks that could be deployed on a limited resource device and trained using much fewer computational resources, providing a cost-effective embedded NILM solution for the consumers.Furthermore, experimental results show that the proposed OPT-NILM scheme manages to achieve better performance by identifying even smaller subdeep NILM networks than the conventional pruning scheme.Last but not least, this approach could increase the efficiency and enhance the design of the network by providing information about what an optimal sub-network architecture would look like in terms of layers' importance and the number of initial parameters.

C. Optimal Pruning Threshold Estimation
A basic limitation of the aforementioned works on NILM compression is that popt thrs is selected in an arbitrary way without taking into account the performance of the models.This paper proposes a metric that fills this gap and identifies the optimal pruning threshold popt thrs for NILM models by equally weighting the trade-off between model complexity and disaggregation performance.This metric incorporates both the performance degradation of the pruned model as well as the gain in terms of parameter reduction.The metric that is being used to find the popt thrs is the F1-score as presented in (7).
where TP, FP and FN stand for the True Positive, False Positive and False Negative classified time instances in the predicted signature.The reason that F1-score was the selected measure for evaluating the disaggregation performance of the pruningperformance trade-off metric is its ability to assess if the model can properly identify the appliances' activations and address the class imbalance problem of NILM.
Pruning results are presented as an achieved performance against the pruning percentage with values p thrs ∈ (0, 0.95).The optimal point of such a curve is computed as the point that has the minimal Euclidean distance from the 'ideal' points and whose coordinates are F1-score equal to 1 and pruning percentage equal to 1.This metric using: and dist(F1, where p i ∈ (0, 0.95).A visual representation of the proposed trade-off metric is depicted in Figure 3.
Utilizing the performance-sparsity trade-off metric, we are now able to identify the optimal pruning threshold of each pruning technique and use it as a baseline to compare the conventional NILM magnitude pruning with the proposed before-full training NILM pruning scheme.Although different trade-off metrics, such as the performance-sparsity rate of change, could have been used to select the optimal pruning level, the major advantage of the proposed metric is that since the performance and the sparsity axis are in the same scale, it equally weights the performance and the model complexity factors concluding to a fair trade-off metric.

D. Deployment of OPT-NILM to Consumer's Side
The objective of this paper is to introduce a cutting-edge and cost-effective framework for NILM compression.However, to ensure practical usability and consumer benefits, a deployment scenario is essential.In this regard, we propose a decentralized solution that eliminates data transmission requirements for the inference phase and addresses privacy concerns of the consumers.The developed solution is based on the Z-wave communication protocol, which is ideal for smart home solutions due to its ability to create a mesh network topology, which allows devices to communicate with each other ensuring the reliability and stability of the network as well as better coverage and communication range [7], [48].To implement our solution, several integral components are employed.We utilize a Z-Wave energy meter, specifically the Aeotec Home Energy Meter Gen 5 [49], which is capable of recording up to 200 amps with an impressive 99% accuracy, in order to monitor and transmit the aggregated consumption data to the OPT-NILM inference service.As a gateway to collect this data and execute the OPT-NILM inference service, the Raspberry Pi Model 4 [50] was used, due to its cost-efficiency, compact design for easy installation, and competency in facilitating Z-Wave communication using the Z-Wave daughter card [51] To ensure users can conveniently access the appliance-level consumption predictions while safeguarding data security, we've set up a local host Web service, negating the need for transmitting data externally.A comprehensive visual layout of the proposed OPT-NILM inference deployment strategy is depicted in Figure 4.The proposed solution comprises four distinct services all developed and deployed on the edge side.These services are tasked with gathering the aggregate consumption data and producing the disaggregated results.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• OPT NILM inference service: This service is deployed in a Docker container that runs continuously on the edge device.This service communicates directly with the PostgreSQL database at specified intervals to generate the disaggregation results that they will visualised through the developed localhost Web service.The demonstrated deployment scenario underscores the practical applicability of our OPT-NILM approach, illustrating its real-world operation.This strategy addresses privacy concerns by keeping all data transmissions confined to the user's side, eliminating the need for external exchanges during the whole inference phase.

V. EXPERIMENTAL SETUP
In this section, we give details related to the experimental setup.Specifically, we give a brief description of the dataset, the selected evaluation metrics as well as the seq2seq model architecture that was used to run our experiments and assess the performance of the proposed pruning scheme.

A. Dataset
A publicly available electrical load measurement dataset -UK-DALE [33] was used to showcase the proposed pruning methodology.UK-Dale consists of aggregate consumption and appliance-level energy consumption measurements from five different houses in the United Kingdom.The dataset was built at the sample rate of 1 Hz or one measurement per second for whole-house and 1/6 Hz or one measurement every six seconds for individual appliance consumption.UK-Dale has been widely used for bench-marking NILM algorithms as it is one of the first open-access datasets at this temporal resolution.In this paper, the appliances used to evaluate and test our algorithms include the kettle, the dishwasher, the washing machine, and the fridge due to their high frequency of use, high consumption, and presence in most houses.Furthermore, another reason for selecting these devices is their different consumption patterns, as the kettle provides an on-off consumption signal, the dishwasher and washing machine have different operational states, leading to a more complicated consumption pattern, and the fridge operates continuously.The aggregate signal was resampled to match the frequency of the appliancelevel signals at 1/6 Hz.The models were trained using the data from houses 1, 3, 4 and 5, and they were tested on unseen data from house 2.

B. Model Architecture
To evaluate and test the proposed prior-to-full training pruning scheme, we conducted experiments using a seq2seq CNN model.The model's architecture was inspired by the seq2point CNN, which was proposed in [23], and it was also used by the aforementioned NILM compression approaches.The basic reason that we decided to modify this architecture and use a seq2seq model is that seq2point models are less computationally efficient since they produce only one timepoint prediction instead of whole windows requiring much more forwards-pass iterations.The proposed model architecture employs 5 1-D convolutional layers with rectified linear activation functions (ReLU) followed by two linear layers with ReLU and Sigmoid activations correspondingly.The CNN architecture is shown in Figure 5.The foundational model outlined possesses 22,146,000 trainable parameters and takes up 84 MB of memory.While each model in this study was tailored for a particular appliance, the model's minimal memory footprint posed no issues, especially since it was deployed on a Raspberry Pi 4 with 4GB RAM and a storage capacity of 16 GB.The parameters of the model that were adjusted for optimal training cost include the weights of the convolutional and linear layers of the model architecture described above.Although the proposed pruning technique is designed to be agnostic to specific model architecture, its practical implementation might necessitate some modifications depending on the specific architecture.Our choice of a CNN structure for this work was motivated by the robust compatibility of PyTorch's pruning module with the layers present in our proposed model.

C. Evaluation Metrics
We record three widely used metrics to evaluate model performance.Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE) equations ( 10) and ( 11), were calculated using the ground truth, y t , and estimated appliance signature, ŷt , providing an evaluation of the NILM model regression performance under a specific time window t = 1, . . ., T as ŷt − y t (10) and Moreover, F 1 score (7) was also used to assess the model's classification performance.The on-off activations of the appliances were computed by comparing the appliance consumption pattern with the requirements of Table I.In this study, F 1 score is considered the most important metric, as it captures the model's ability to address the class imbalance, identify the appliances' activations and minimize the false positives.This was also the reason that F 1 score was selected to be used for the popt thrs calculation.

VI. RESULTS
The conducted experiments presented in this section compare the after-training pruning, which has been used in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.previous compression NILM frameworks [8], [9], [15] with the proposed OPT-NILM scheme.The results focus on the performance of each technique as well as on the reduction of the model's trainable parameters.It is worth noting that the OPT-NILM approach requires multiple iterations in order to identify the optimal sub-network, which may seem to extend the cumulative training duration.To delve into details, for the conducted experiments, the identification of the sub-networks took 10 cycles of a single epoch each, amounting to 10% of the full training duration that consisted of 100 epochs.Although this might seem a significant time commitment, the results are compelling.Namely, given that the proposed compression scheme prunes the model's parameters before training, it establishes itself as an efficient NILM compression framework.This is attributed to its dual benefit: it not only produces optimized models tailored for seamless deployment on edge devices with limited resources, but it also mitigates the computational burden during the initial training phase, given that training is executed on the identified sub-optimal model, thereby diminishing computational expenses.
As can be seen in Figure 6, the performance-pruning level curves indicate that the proposed prior-to-full training pruning can achieve a significantly better disaggregation performance with much fewer trainable parameters than the conventional approach.Specifically, for kettle appliance that only presents an 'on' and an 'off' state, the performance degradation, when using the proposed pruning scheme, is indiscernible even when the model only presents 5% of the initial weights.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 7. Prediction consumption diagrams using the OPT-NILM vs the after-training pruning scheme and the OPT-NILM vs the baseline model.The pruning thresholds were set equal to the popt thrs of the OPT-NILM approach both for the OPT-NILM and after-training approaches, 95% for the kettle, 80% for the dishwasher, 85% for the fridge and %60 for the washing machine.
On the other side, the impact of parameter pruning is more severe on the dishwasher and washing machine, which have a more complicated consumption signal with more operational states.Finally, OPT-NILM showcases its superiority in fridge appliance where it also manages to sustain a better performance-compression trade-off for all the selected evaluation metrics.To sum up, in all of the tested cases, the proposed pruning technique seems to perform significantly better than the conventional after-training pruning since, for the same pruning levels, it manages to achieve significantly higher performance.This assumption could also be confirmed by looking at the consumption prediction diagrams in Figure 7, which present the inferred consumption pattern of each appliance for a pruning threshold set to popt thresh of the OPT-NILM approach and compares them with the baseline and after-training pruning approach.
For the kettle appliance, our proposed pruning scheme showcases a superior disaggregation capability even with the pruning level set to 95%, as it manages to infer the corresponding consumption pattern.On the other hand; conventional magnitude pruning does not manage to detect the kettle's activation function at all, providing a very poor disaggregation performance for the same pruning threshold.Comparing the results of the proposed pruning scheme and the baseline model, we could observe that both prediction curves are very similar to each other even though the pruned model uses only 5% of the parameters of the baseline one.Specifically, the OPT-NILM method surpasses the baseline model, yielding a MAE error of 140 compared to the baseline's MAE error of 153.For the dishwasher appliance, both techniques manage to infer the appliance's consumption pattern.However, the conventional after-full training pruning provides many false positive activations contrary to the proposed technique, which successfully predicts both 'on' and 'off' states.
Comparing the identified sub-network for the dishwasher appliances between the proposed pruning approach and the baseline model, we observe a similar pattern with the kettle appliance, with prediction curves being very similar to each other even though the pruned network uses only 20% of the baseline's parameters.Notably, the OPT-NILM model achieves a MAE of 41.5, whereas the post-training pruning yields a MAE of 42.4, further demonstrating the former's superior performance.Similar behavior is also observed in the fridge and washing machine appliances, with the OPT-NILM approach managing to perform significantly better than the after-training approach and inferring a consumption pattern very similar to the baseline model for a pruning threshold set to 85% and 60% correspondingly.Based on the prediction consumption diagrams for the washing machine, the OPT-NILM achieved a MAE of 21.7, markedly better than the after-training's 542.1 and the baseline's 22.2.A similar trend was observed for the washer appliance, where the OPT-NILM registered a MAE of 214, surpassing the baseline's 232 and the after-training's 262.The hypothesis that the suggested pruning approach could result in enhanced disaggregation effectiveness identifying more computationally efficient NILM models compared to traditional after-training pruning is also confirmed by looking at the Table I, which presents the disaggregation performance in regard to the model's compression.Specifically, according to this table, the proposed technique achieves a better performance-compression trade-off (i.e., high pruning threshold and low-performance degradation) for all the tested appliances.Overall, the proposed OPT-NILM methodology consistently outperforms the traditional after-training pruning techniques and frequently produces comparable or even better disaggregation results than the baseline model.This enhanced performance is attributed to the fact that the proposed pruning approach identifies an optimal sub-structure within the initial network before the stage, manifesting augmented generalization capability on unseen data.This stands in contrast to the baseline model, which, due to potential overparameterization, may incorporate extraneous noise that undermines its performance.Conventional after-training pruning, on the other hand, operates under the assumption that low-magnitude weights are inconsequential and, therefore, dispensable.This assumption, however, is not always correct.Some of these low-magnitude weights remain pivotal to the model's core functionality.Their removal can, hence, significantly impair performance, rendering OPT-NILM a more efficacious alternative.However, despite the improvement in disaggregation performance, the main contribution of the proposed pruning scheme is the fact the model's parameters are removed before the full training of the model.This concludes with a more efficient model initialization since the identified subnetwork would need much fewer computational resources to be fully trained.Thus, the model's training cost and computational resources will be dramatically reduced, promoting the real-world deployment and adoption of such a system.The reduction in the model's complexity is evaluated using the number of trainable parameters as well as the number of floating point operations (FLOPs) required to perform a forward pass.In order to highlight the contribution of the proposed technique, we evaluate the complexity of the pruned model both before the full training and testing phase.
Table II indicates that the proposed pruning method leads to a noteworthy enhancement in computational efficiency.Specifically, the optimal sub-network for the kettle appliance retains just 5% of the initial number of trainable parameters, while the one for the dishwasher appliance retains 20% of the initial number of trainable parameters.For the fridge appliance, the optimal sub-network retains 5% of the initial parameters, and for the washer appliance, it retains 40% of the original model parameters.Similar behavior is also observed in FLOPs parameters, where they also present a significant drop.On the contrary, the conventional magnitude pruning approach does not improve the computational efficiency of the model during the training phase nor on FLOPs or model parameters.
Evaluating the computational complexity of the pruned NILM models for the testing phase, we observe that the proposed pruning technique is also superior in comparison to the standard after-training pruning.In terms of both the number of trainable parameters and FLOPs, the proposed pruning scheme seems to identify more computationally efficient networks that would be able to be deployed in a limited resource device and produce better disaggregation performance.
The superiority of our approach in terms of both computational complexity and disaggregation performance is also demonstrated by comparing OPT-NILM's overall performance across all tested appliances against two other works, [8] and [35] which employ after-training compression techniques on the same model architecture.The comparative results presented in Table III indicate that our approach achieves a better trade-off between compression and disaggregation performance, surpassing the capabilities of current edge NILM solutions and offering a more dependable and computationally effective framework for potential consumers.

VII. CONCLUSION AND FUTURE WORK
In this work, we have proposed an efficient prior-to-full training pruning scheme for edge deployment of NILM that produces significantly better results than the conventional after-training pruning approach and reduces the computational resources for both the training and testing phase.The proposed pruning scheme not only identifies sub-optimal networks with better disaggregation performance but also assumes a costeffective NILM deployment since the sub-network structures are identified before the training phase.Finally, we also introduced a trade-off metric to identify the optimal pruning threshold of a NILM model and use it to define a comparable ground between the proposed pruning scheme and the ones that have been used in past edge NILM research works.The experimental findings confirm that the proposed methodology outperforms conventional after-training pruning techniques, not only in terms of disaggregation performance but also in eliminating the computational costs of both training and testing phases, providing a framework for a cost-effective, secure and reliable embedded solution with high potential for the consumer's side.Additionally, OPT-NILM demonstrates an overall superior trade-off between disaggregation performance and compression when compared to other works, further underscoring the effectiveness of the approach.Therefore, the proposed solution presents a cutting-edge approach to edgebased NILM area that holds significant promise for real-world deployment and provides numerous advantages for consumers.
In our future research, we plan to explore additional pruning techniques, such as gradient-based magnitude pruning and information-based pruning, along with evaluating the efficacy of structured pruning.Additionally, we also plan to utilize the versatility of the developed pruning scheme and extend it to other architectures prominent in the NILM domain, like Transformers, LSTM and GRU.Finally, we aim to deploy our solution in real-world settings at a larger scale to assess the replicability of our simulation experiments under real-world conditions.

Fig. 1 .
Fig. 1.The comparison of the conventional pruning process (upper) and the proposed OPT-NILM (lower).

Fig. 2 .
Fig. 2. Overview of the OPT-NILM pruning scheme.Steps 2,3 and 4 consist the pre-training process of finding the optimal sub-networks, and they are repeated till the desired pruning level has been achieved.

= μ 1 θ Tr 1 ,
1/nthrs % of smallest magnitude weights are being pruned by applying a binary mask μ ∈ {0, 1} K such that its initialization is θPr  1 Algorithm 1 OPT-NILM Compression Scheme Initialize a neural network f (x(t) T t=1 ; θ 0 ) while n <= N do • Train the network for 1 epoch to obtain θ Tr n • Prune p 1/n thrs % of the θ Tr n by creating a binary mask μ n • Reset the remaining weights back to θ 0 , F rst (θ 0 , θ Pr n ) end while Fully train the obtained sub-network f (x(t) T t=1 ; θ)

Fig. 3 .
Fig. 3. Example of the proposed trade-off metric.The blue dot denotes the ideal point (sparsity=1.0,F 1 = 1 while the orange dot denotes the optimal point of the performance-sparsity curve.
• Z-Wave JS: This is an open-source dockerized service that interfaces with the aggregate consumption smart meter via the Z-Wave protocol.It then transmits the collected data to the Z-Wave service through the MQTT (Message Queuing Telemetry Transport) protocol.• Z-Wave service: This custom service receives the collected data from the Z-Wave JS UI via the MQTT protocol, and subsequently forwards it to the DataBroker service through an API (Application Programming Interface).• Data-broker service: This service is responsible for receiving the data collected by the Z-Wave service and communicates with a local PostgreSQL database.Additionally, the Data-Broker service is tasked with updating (saving and deleting) the collected data in the existing database.

Fig. 6 .
Fig.6.Pruning threshold vs Performance degradation diagrams.The blue dot indicates the ideal point, while the green and orange dots represent the optimal points based on the proposed trade-off metric.

TABLE I COMPARATIVE
EVALUATION RESULTS -DISAGGREGATION PERFORMANCE WITH RESPECT TO COMPRESSION THRESHOLD

TABLE II PERCENTAGE
IMPROVEMENT IN COMPRESSION METRICS DURING TRAINING AND INFERENCE PHASE OF THE FINAL PRUNED MODEL