Energy Efficient Deep Multi-Label ON/OFF Classification of Low Frequency Metered Home Appliances

Non-intrusive load monitoring (NILM) is the process of obtaining appliance-level data from a single metering point, measuring total electricity consumption of a household or a business. Appliance-level data can be directly used for demand response applications and energy management systems as well as for awareness raising and motivation for improvements in energy efficiency. Recently, classical machine learning and deep learning (DL) techniques became very popular and proved as highly effective for NILM classification, but with the growing complexity these methods are faced with significant computational and energy demands during both their training and operation. In this paper, we introduce a novel DL model aimed at enhanced multi-label classification of NILM with improved computation and energy efficiency. We also propose an evaluation methodology for comparison of different models using data synthesized from the measurement datasets so as to better represent real-world scenarios. Compared to the state-of-the-art, the proposed model has its energy consumption reduced by more than 23% while providing on average approximately 8 percentage points in performance improvement when evaluating on data derived from REFIT and UK-DALE datasets. We also show a 12 percentage point performance advantage of the proposed DL based model over a random forest model and observe performance degradation with the increase of the number of devices in the household, namely with each additional 5 devices, the average performance degrades by approximately 7 percentage points.


I. Introduction
Climate change represents a formidable challenge, and mitigating its impacts requires a concerted effort to maintain the increase in global average temperature below 1.5 • C relative to pre-industrial levels.Electrical energy production is estimated to contribute more than 40 % of the total CO 2 equivalent produced by humankind1 , 2. So, some of the necessary steps in mitigating climate change are to reduce energy consumption and subsequently its production as well as to increase the share of renewable energy sources3 that produce far less CO 2 equivalent compared to traditional power plants that burn fossil fuels.However, the renewable energy sources are mostly dependant on external conditions such as wind, sun, etc. and are thus less predictable and pose a challenge to the stability and reliability 1https://tinyurl.com/electricity-production-CO2-1(accessed 4.3.2024)2https://tinyurl.com/electricity-production-CO2-2(accessed 4.3.2024)3https://tinyurl.com/renewable-energy-doubled(accessed 4.3.2024) of the electrical power system [1].To solve this problem we have to work with the concept of demand response; change the electrical power consumption to better match the demand with the supply [2].Because of demand response, efforts are being made to monitor and manage energy consumption more effectively in residential building, which makes monitoring the activity of devices (ON/OFF events) relevant [3].
Monitoring each device separately is costly and invasive since it requires an installation of an electricity meter on each appliance.As an alternative, a non-intrusive load monitoring (NILM) supported with disaggregation methods is able to reach the same result with just one electricity meter per household and is thus much more economical [4].NILM is the process of obtaining appliance-level data from a single metering point measuring total electricity consumption of a household or a business.By subsequent processing, it is possible to decompose NILM data into individual components, and by classification we can determine the state (ON/OFF) of devices and thus monitor their activity for demand response applications.In Europe, households consume 27.4 % of all electricity produced4.Thus cutting down on their consumption would play an important role in relieving our carbon footprint.As several research studies have shown, if given real-time feedback on their electricity savings residents achieve a more comprehensive understanding of their electrical consumption and develop more energy aware behaviour.Consequently they consume 12 % less electricity than they would normally [5].With classification on NILM we can provide feedback on the device activity and help that way.
In the described application areas for classification on NILM there can be more than one device active at a time.Thus, the best approach to determine activity states of the appliances is multilabel classification, where the state of the appliance is used as the class label and the recorded readings from a single meter on the household as input samples.Multi-label classification has been attempted on NILM with numerous methods that can be divided into two categories.The first category includes single channel source separation techniques such as matrix factorization [6], sparse coding [7], dictionary learning [8], and non-negative tensor factorization [9], while the second category comprises machine learning approaches such as support vector machines (SVM) [10], random forests (RF) [11], decision trees (DT) [12]  Existing methodology: [16], [19]- [24] N max.AD and N max.AD obtained through dataset analysis and deep neural networks [13]- [16].

A. Contributions and Paper Organization
In this paper we are concerned with energy efficient ON/OFF classification of NILM data aimed at decreasing the overall energy consumption, which includes the investigation whether more complex and accurate DL approaches outweigh simpler and less consuming classical ML approaches.Improved energy consumption, encouraged for aforementioned ecological reasons, is caused by the reduction in computational cost which is highly encouraged by the cloud computing community [17] and seen as a necessity in the field [18].
We propose a new DL architecture that is inspired by the VGG family of architectures and RNNs for the multi-label device activity classification on NILM data.We prove that its performance is better than state of the art CNNs and CRNNs and state of the art classical ML algorithms, while being much more energy efficient compared to the state of the art DL models.
Classical ML and DL models such as used in [16], [19]- [23] and [24] are trained and evaluated with only five particular devices, which occur commonly in different datasets and houses within.This method is well suited for the comparison with other models, but lacks the ability to reflect realistic results of performance as it fully disregards all other devices.Those 5 devices are chosen because they each draw large proportion of energy and represent various different power signatures.As stated in [25] that can be especially problematic since all devices with smaller energy consumption specifically, are ii thus disregarded, while modern homes tend to have a lot of them.Moreover, different problem types on NILM data such as disaggregation and appliance classification already employ much more than 5 devices [13], [26]- [28].The methodology should thus be extended to more devices and it should be done according to specifics of the dataset to make it as close to realworld examples as possible.
To this end, this work examines a novel methodology depicted in Figure 1 that more accurately represents the performance of models in practical use cases.The extended methodology complements existing methodologies [16], [19]- [24], by creating a so-called Realistic evaluation dataset and an Sensitivity evaluation dataset.The existing methodologies typically assume that there are 5 devices per household and any of these can be active in the observation window.In the Realistic evaluation dataset we allow for households to have a varying number of active devices, say 5, 10, 15, ... N devices, of which any random combination can be active.In the Sensitivity evaluation dataset we further extend the possible combinations by allowing a household to have varying number of devices (i.e. 5, 10, 15, ... N) however we study what happens when only 1, only 2, etc. are active at a time separately -thus enabling the assessment of the sensitivity to various combinations of devices.
We evaluate our model on both the established evaluation methodology and the proposed evaluation methodology to assure its performance.
Our main contributions are as follows: • We propose a novel Convolutional transpose Reccurrent Neural Network (CtRNN) architecture focusing on reduced computational complexity which offers superior performance compared to the existing state of the art architectures with an average improvement of approximately 8 percentage points on mixed datasets derived from REFIT and UK-DALE and more than 23 % lesser energy consumption, making it a more sustainable solution.
• We propose a novel evaluation methodology that begins with a dataset analysis and involves generating two groups of mixed datasets that are utilized for both training and testing.By taking into account the unique properties of the original dataset when generating mixed datasets, our approach results in a more realistic evaluation of model performance, more closely reflecting real-world scenarios.• We perform a comprehensive analysis taking into account performance and energy efficiency of compared approaches for NILM ON/OFF classification.We observe an average of approximately 7 percentage point drop in F1 score for each 5 newly added devices in the household.This paper is organized as follows.Section II analyzed related work while Section III provides the problem statement and elaborates on methodological aspects.Section IV presents the proposed model and Section V provides a comprehensive evaluation with Section VI concluding the paper.

II. Related Work
In this section, we present related work focusing on multilabel classification on NILM with the use of classical machine learning (ML) algorithms and deep learning (DL) techniques.To provide a comprehensive overview of the state-of-the-art in this area, we have compiled Table I summarizing selected more important references to prior work, including the type of the problem addressed, the approach used, the number and name of datasets utilized, and the number of devices involved in each study, however in the following subsections when discussing some of these specific aspects we also refer to some further relevant works.

A. NILM Problem Type
The second column in Table I demonstrates that state-ofthe-art approaches for NILM can be categorized into three distinct types: disaggregation, ON/OFF classification, and appliance classification.The disaggregation problem is focused on decomposing the NILM signal into individual components that correspond to distinct power signatures of active appliances [26], [31].The ON/OFF classification of appliances aims to determine which devices are active and which inactive in an aggregated power signal [11], [16], [19], [29], [30].The appliance classification problem also assumes that the disaggregated signals are accessible and intends to classify the devices that generated each unique power signature extracted from the NILM signal [27], [28], [32], [33].
The focus of this paper is on the ON/OFF classification problem type, which pertains to the identification of the activity state of individual appliances from an aggregated power signal without requiring prior disaggregation.

B. Methods for Solving NILM Problems
As the analysis of the related work shows, the approaches to solving NILM involve either a two stage process in which first disaggregation is done that is followed by classification that performs automatic appliance identification or both, or a one stage process in which ON/OFF classification is done directly on aggregated data.In the last few years, several classic ML and DL methods have been proposed in this area.
Utilizing the fully supervised learning method, Wu et al. [11]  Rehmani et al. [35] demonstrated that computationally intensive deep learning (DL) algorithms, such as CNN and RNN, were not required for their particular datasets, as already classical machine learning algorithms, such as kNN and RF, yielded accuracy of 99 %.However, openly available and well documented REFIT and UK-DALE datasets, recently used in many reference works as well as in this study, do not exhibit suitable performance with the classical machine learning and are used with DL models.To address the high computational complexity and energy consumption associated with DL models, we designed a novel DL architecture based on the principles established in our previous work [36].Additionally, to ensure consistency with the latest research, we compared our approach to works by Langevin et al. [19] and Tanoni et al. [16], who also utilized the same datasets of REFIT and UK-DALE.Furthermore, we also considered the findings of Ahajjam et al. [37], who discovered that the optimal signal length varies across datasets, and hence, we adopted the same signal length as Tanoni et al. [16].

C. NILM Data for ML Model Training
As per the 5th column in Table I, the related works employed low-frequency (LF) and high-frequency (HF) datasets.The European Union and UK technical specifications suggest the use of LF smart meters with a sampling rate of around 10 seconds5 for units installed in typical households.To circumvent the need to purchase and install new HF smart meters, and instead utilize the existing LF meters whose readings are already available via the COSEM interface classes and OBIS Object Identification System6, this paper proposes the development of an ON/OFF classification model for LF meters.
Typically, the number of devices employed in different works is fixed, with the exception of Raiker et al. [26], Chen et al. [28] and Yin et al. [33], who utilized up to 11, 15 and 5 devices, respectively.In this work, however, we utilize a flexible range of up to 54 devices.

D. Energy Consumption
The energy consumption of hardware used for running DL models and the resulting energy consumption has only recently 5https://tinyurl.com/SMIP-E2E-SMETS2(accessed 4.3.2024)6https://tinyurl.com/COSEM-interface-classes(accessed 4.3.2024)become a growing concern in the community.In [38], Hsueh conducted an analysis of the energy consumption of ML algorithms and found that convolutional layers, operating in three dimensions, consume significantly more power compared to fully connected layers, which operate in two dimensions.Verhelst et al. [39] delved into the complexity of CNNs and explored hardware optimization techniques, particularly for the Internet of Things (IoT) and embedded devices.Another study by Garcia et al. [40] surveyed the energy consumption of various models and proposed a taxonomy of power estimation models at both software and hardware levels.They also discussed existing approaches for estimating energy consumption, noting that using the number of weights alone is not accurate enough.They suggested that a more precise calculation of energy consumption requires the calculation of either FLOPs or multiply-accumulate operations.

A. Problem Statement
The objective of this study is to identify which devices are currently active.The total electrical power  consumed by a household at any given moment  is calculated as the sum of the power used by each electrical device, denoted as   (), where there are   devices in total as defined in Eq. 1.Additionally, measurement noise (including any unidentified residual devices) () is also taken into account.The status indicator () determines the activity of each device, where () = 0 indicates that the device is inactive and () = 1 indicates that the device is active at the given moment .
To solve the problem and thus estimate the status indicator   () for each device, we can employ classical ML or DL for multi-label classification of devices.Devices are classified as active if the corresponding status indicator   () predicted by the model exceeds 0.5, as illustrated in Figure 2. The cardinality of the set  representing all the possible active devices, denoted by ||, indicates the number of labels that need to be recognized.iv  In the context of this paper, the value of  varies between experiments as explained in Sections III-B1 and III-B2.

B. Methodology
To this end, this work examines a novel methodology depicted in Figure 1 that more accurately represents the performance of models in practical use cases.We approached the problem by first analysing the dataset to identify the maximum number of active devices  . in the time windows that the model is trained on and the average maximum number  . .We then generated two distinct groups of mixed datasets, each comprising different sets of active and inactive devices.The first group contained mixed datasets with a fixed number of  . active devices, whereas the second group included mixed datasets with varying number of active devices between 1 and  . .To generate the groups of mixed datasets we used REFIT [41] and UK-DALE [42] low-frequency datasets, also present in Tanoni et al. [16] and Langevin et al. [19].
We propose a methodology7 that aims to assess classical ML and DL models in a realistic scenario, which differs from the approaches commonly taken in recent works [16], [20]- [23] and [24] that only use 5 distinct devices for evaluation.This limited number does not represent the typical diversity of devices encountered in real-world settings.For instance, the analysis of REFIT and UK-DALE datasets in Figure 3 reveals the presence of up to 9 and 26 active devices, respectively.Therefore, our methodology considers a wider range of devices for a more accurate evaluation of DL models in realistic conditions as depicted.
1) Sensitivity Evaluation Group: The Sensitivity Evaluation (SE) is a group of multiple mixed datasets that cover cases where there are 5, 10, 15 or 20 DiT in the household and 1, 2, ...,  . of them are AD.The number of DiT is used universally across all datasets and the maximum number of AD ( . ) is used as a parameter depending on the maximum number of active devices in the time window that we are training our model on.
Mixed datasets in SE Group provide an insight into how the model in testing performs depending on the number of AD in four general cases of DiT, i.e. 5, 10, 15 and 20.In case of the UK-DALE dataset we also give an insight into a case of 54 DiT, which significantly exceeds the maximum number of DiT in the REFIT dataset.
In our case we were using 2550 samples for training, with sample rate of REFIT and UK-DALE that results in approximately 6 hour long time window.In the given time window there is a number of AD ranging from 1 to 9 for REFIT  and from 2 to 26 for UK-DALE, as shown in Figure 3, thus  . = 9 for REFIT and  . = 26 for UK-DALE.
2) Realistic Evaluation Group: The Realistic Evaluation Group (RE) is an extension of the methodology employed in many recent works [16], [19]- [24] which account for only 5 distinct devices.We propose a group of multiple mixed datasets that cover cases where there are 5 and also 10, 15 and 20 devices in total (DiT) in the household and chosen at random.We generate mixed datasets with an equal mix of all possible numbers of ADs.Thus, we generated 4 mixed datasets, each containing samples with 1, 2, ...,  . ADs.Such RE Group presents more practical evaluation of the model by simulating a real-world scenario in which households utilize a variety of active devices rather than a fixed number.
In our case the average maximum number ( . ) of active devices was 8 for REFIT and 14 for UK-DALE as supported by the results in Figure 3. Therefore, the training data comprised time windows with ADs ranging from 1 to 8 and 1 to 14 for REFIT and UK-DALE, respectively.However, in cases where there were fewer devices in the household than active devices, the range of ADs was set to 1 to DiT-1.For instance, when there were 5 DiT in the household, the range of ADs was from 1 to 4, as was the case for both REFIT and UK-DALE datasets.Similarly, when there were 10 devices in the household, the range was from 1 to 9 ADs.Lastly it should be noted that we used an 80:20 split for training and evaluation parts of all datasets in this research.

IV. Proposed Neural Network Architecture
To solve the problem defined in Section III-A we introduce the novel CtRNN architecture based on the VGG family.Architectures from this family, adapted for time series data, have previously proved successful in NILM disaggregation tasks [43].Additionally, the hyper-parameters of the architectures were determined empirically following the principles derived from our prior work [36], determining the ratio between prediction performance and computational complexity.The computational complexity of a network, measured in floating-point operations (FLOPs), is determined by the number of layers  and the computational complexity of each separate layer  l , as described in Eq. 2. Throughout our empirical design phase, we explored networks with layers ranging from  ∈ {16, ..., 22}.
In addition to the VGG adaptation, our architecture also includes the transposed convolutional layer (TCNN) and the gated recurrent unit (GRU) layers.The transposed convolutional layer increases the temporal resolution of features while reducing the number of features from the previous layer [13], while GRU is utilised to better capture the temporal correlation in the TS and showed great potential in solving NILM related tasks [16], [19].The combination of CNNs, TCNNs and GRU layers enable the architecture to better capture the spatial and temporal correlation within the time series data.
The resulting architecture is illustrated in Figure 4, where each layer is depicted with its type and hyper-parameters.The architecture comprises four blocks, each consisting of two convolutional layers and one average pooling layer.The number of filters in each block doubles, starting from 64.Following the convolutional blocks, there is a TCNN layer and a GRU layer.Prior to the output layer, there are two fully-connected layers with 4096 nodes.The number of nodes in the output layer is adjusted to meet the specific requirements and ranges from 5 to 54, depending on the used dataset.All layers utilize the ReLU activation function, except for the output layer, which employs the sigmoid activation function.

A. Computation Efficiency Considerations
As the purpose of using NILM is to reduce energy consumption, it is logical to ensure that the process itself is as energy-efficient as possible.Thus, our goal was to design a deep learning architecture that surpasses the state of the art not only in performance but also in terms of energy efficiency.
In order to assess the energy consumption of the architecture, it is necessary to calculate its complexity.This typically involves adding up the total number of FLOPs required for each layer.
We estimate the complexity of the most energy-consuming layers, namely the convolutional, pooling, and fully-connected layers, using the equations presented by Pirnat et al. in [36].In addition, we calculate the complexity of the GRU layer using the equation proposed in [44].
We use equations from [36] to estimate the energy consumption of the proposed architecture and for comparison also for one popular reference architecture from the VGG family of architectures, i.e., VGG11 [45], and for the two architectures used as a reference in performance evaluation, i.e., TanoniCRNN [16] and VAE-NILM [19].This is done with an assumption that the architecture is trained and used on an Nvidia vi A100 graphics card and that each kWh of electricity produced results in 250g of CO 2 equivalent emissions (as it is the case for Slovenia).

B. Evaluation Datasets and Training Parameters
We first compared the performance of our model to VGG11 [45] and with the performance of the model created by Tanoni et al. [16] adapted to fully supervised DL, and to results achieved by Langevin et al. [19].This comparison was done on the standard evaluation methodology defined in [16] that comprises a total of 5 distinct devices: fridge, washing machine, dish washer, microwave and kettle.Those 5 devices were also used by many recent works [16], [19]- [24] where they didn't exactly specify the number of active devices, thus we chose to reproduce the one in [16].Samples with varying numbers of active devices from 1 to 4 are randomly interspersed throughout the mixed dataset.To make the training and evalution parts of the dataset we used an 80:20 split.
For this comparison we used a learning rate of 0.0003 and 20 epochs for our model while for the TanoniCRNN model we adopted the parameters specified as optimal in [16], which include the same number of epochs and a different learning rate of 0.002.Moreover, we used the same batch size of 128 for both models.
Subsequently we compared the performance of our model with that of VGG11 and RF on the two groups of mixed datasets described in Section III-B.We choose VGG11 as a benchmark because VGG architectures are adopted in recent works [46], [47] for classification in NILM due to their effectiveness, and VGG11 is the closest match regarding the complexity.RF was chosen as the benchmark as it was reported to be the best classic ML algorithm for ON/OFF classification on NILM data in a previous study [11].
Training parameters used for SE Group are presented in Table II for CtRNN and VGG11 as follows.Each sub-table includes information about the parameters corresponding to that combination of model and dataset, the first one addressing CtRNN with REFIT, the second concerning CtRNN and UK-DALE, while the third and fourth describing VGG11 with REFIT and UK-DALE respectively.For each model-dataset combination, parameters for training data with various DiT, namely 5, 10, ..54, for the SE group described in Section III-B1 are provided.The grouped columns include the parameter values of the architectures, namely the BS, LR and E. BS denotes the batch size, representing the total number of training packets that are sent to the architecture at a time.LR stands for learning rate, which is a parameter that determines the step size at each iteration of the architecture.Finally, E signifies the number of epochs, one epoch representing one pass of training data through the architecture.As can be seen in the table, the three parameters differ across the architecture-dataset, DiT and AD combinations.For instance, for CtRNN with REFIT, 5 DiT and 1 AD, the BS is 512, LR is 10 −4 and E is 40.
In summary, the batch size used for training both models was predominantly set to 128, with some variations of 256 or 512.The epoch count was set to 20 for both models in most cases, but it was also set to 50 for VGG11 and 40 for CtRNN, respectively, in certain scenarios, because they benefited from more training passes.The learning rate ranged changed between 10 −4 and 5×10 −5 for VGG11 and between 5×10 −4 and 5×10 −5 for CtRNN, since both required a larger or smaller step size in certain scenarios.The batch size and the number of epochs were similar to [16], and fine tuned through an empirical process.The learning rate was also empirically tuned.
The training parameters used for RE Group had less variation, always using batch size 128 and 20 epochs, therefore the replication of the study can be done with the numbers provided in this paragraph.The learning rate for CtRNN was 0.0003 on both REFIT and UK-DALE datasets; for VGG11 it was 0.0001 on both REFIT and UK-DALE datasets.In all tests the batch size used is selected as the largest we could run or the one that gives the best results, selected through trial and error.It is equal for both models in the test as performance vary only slightly depending on its size.

C. Metrics
We evaluate the performance with a combined metric average weighted F1 score (1  ), since performance evaluation based on a simple arithmetic mean of the F1 score would fail to provide an accurate reflection of the overall performance because our mixed datasets are generated in a way that does not provide each device equal representation.The use of weights ensures that all devices will affect the average score proportionally to how often they appear in the particular mixed dataset.
Average weighted F1 score is based on three metrics: true positive (TP), false positive (FP), and false negative (FN).TP represents the cases where the device is correctly classified as active, FP represents the cases where the device is incorrectly classified as active, and FN represents the cases where the device is incorrectly classified as inactive.
Using these metrics, we calculate the precision  =    +  and recall  =    +  , and from these, we derive the F1 score 1 = 2 ×  ×  + .To obtain the average weighted F1 score defined in Eq. 3, we calculate the F1 score for each device and then take the average based on their weight  ℎ =    , which is determined by the support for the specified device (SSD) and the support of all devices (SAD).

V. Performance Evaluation
Using the proposed methodology with two groups of mixed datasets, we carried out comprehensive performance evaluation of CtRNN DL architecture and benchmarked it against selected state-of-the-art architectures in terms of energy efficiency and accuracy of determining the status of devices, as described in the following.vii

A. Energy Consumption
The results of our energy consumption evaluation, performed according to the considerations in Section II-D and methodology described in Section IV-B, for training different architectures for each mixed dataset from RE Group or SE Group using batch size 128 are displayed in Table III.The rows of the table list the neural-network arhitectures considered in this work, namely the proposed CtRNN and the VGG, TanoniCRNN and VAE-NILM baselines selected in Section IV.The first column of the table displays the number of internal parameters for each of the NNs, that represent the total number of weights and biases in the NNs.The second column lists the number of FLOPs, which is the number of floating point operations needed for a pass through the NN.The third column showcases energy consumed during the training of the models.The values in second and third columns are calculated as explained in Section IV-A.Despite TanoniCRNN having the lowest number of parameters, this does not result in lowest number of FLOPs, energy consumption.In addition to the energy used during the training, energy consumed for making predictions can also be significant when the number of requests for predictions is high as depicted in Figure 5. On the x-axis the figure plots the number of predictions from 0 to 10 million while on the y-axis it plots the consumed energy in mega Joules.The results show that in making 10M predictions our model consumes 41.8MJ, while TanoniCRNN, VGG11 and VAE-NILM consume: 54.4MJ, 59.5MJ and 11.5MJ.VAE-NILM consumes notably less energy then other models, that can be attributed to Langevin et al. [19] using a window of 1024 samples for training, while we used 2550 for all others, and because it has less FLOPs.The figure also shows that for more than a million predictions, the energy consumed exceeds the energy used for training the models, with exception of VAE-NILM for which only a range can be computed based on the number of epoch provided in [19].

B. Results on Mixed Dataset with 5 Commonly Used Devices
Comparison with Tanoni [16] and Langevin [19] on datasets of 5 devices derived from the REFIT and UK-DALE demonstrates superior performance of our proposed model with a significant gap in F1 score as can be seen in Table IV.The first column of the table lists the devices, columns 2-4 lists F1 scores for the models trained on the REFIT dataset while columns 5-7 on the ones trained on the UK-DALE dataset.For each of the datasets, the subcolumns list the evaluated architectures, namely CtRNN, TanoniCRNN and VAE-NILM.Rows 2-6 present per device F1 scores while row 7 provides the weighted average of the F1 score.It can be seen from the table that the proposed model displays superior performance on all devices on both datasets with the exception of fridge in UK-DALE where its even with TanoniCRNN.That is also evident from the average weighted F1-score in the final row.Specifically, on the REFIT derived dataset, our model achieves an average weighted F1 score of 91 % compared to 83 % and 78 % obtained by TanoniCRNN and VAE-NILM, respectively, an improvement of 8 and 13 percentage points.On the UK-DALE derived dataset, our model outperforms TanoniCRNN and VAE-NILM by 7 and 26 percentage points, respectively, achieving an average weighted F1 score of 94 % compared to their 87 % and 68 %.
A closer analysis of the two best models from Table IV shows, according to the second row of the table, that our approach slightly outperforms the approach of [16] in recognizing the fridge class with an F1 of 0.93 compared to 0.92 on the REFIT dataset, while both approaches work perfectly on the UK-DALE dataset.However, the fridge class of devices is easier to identify as its consumption pattern is periodic and is the most pronounced of all appliances.The real difference in performance between the two models is seen in the detection of appliances with short consumption intervals.In row 3 of Table IV, our method outperforms TanoniCRNN [16] in detecting washing machines on the REFIT dataset by 5 percentage points, and is 11 percentage points better on the UK-DALE dataset.When detecting dishwashers in row 4, our method is 8 percentage points superior than [16] on the REFIT dataset, while it is 1 percentage point better on the UK-DALE dataset.The largest difference in performance can be observed in row 5 for the microwave class.Here, our method with an F1 of 0.89 is significantly superior compared to [16], with an F1 of 0.71 on the REFIT dataset.Something similar can be observed on the UK-DALE dataset, where our method achieves an F1 score of 0.96 compared to the F1 score of 0.80 for [16].If we look at the kettle class in row 6 of Table IV, we can again see that our method outperforms [16] by 8 and 7 percentage points in both the REFIT and UK-DALE datasets, respectively.
These results show that our proposed architecture is significantly better in detecting appliances with shorter consumption duration, compared to [16].The reason for that is, that our architecture design is superior at detecting both spatial and temporal correlation within the signal.Spatial correlations are detected by the convolutional layers while temporal by the GRU layers of the architecture described in Section IV and depicted in Figure 4.

C. Results with SE Group Mixed Datasets
The results of comparing the performance of CtRNN, VGG11 and RF models on Sensitivity Evaluation Group mixed datasets are displayed in Figure 6.Figures 6 a-c exhibit heatmaps presenting the evaluation results of the models on the REFIT dataset, whereas Figures 6.e-g depict heatmaps displaying the evaluation results on the UK-DALE dataset.Figure 6d and Figure 6h, on the other hand, illustrate the probabilities of obtaining correct results as per random guess, calculated with Eq. 4. ix As it can be seen from Figures 6a-d, for the REFIT dataset, the accuracy of the model is affected both by the increase of the number of active devices (AD), as well as increase in the total number of devices (DiT) that can be active at the same time.It can be seen that the average performance degradation of our model is 9.2 percentage points per each 5 DiT added on REFIT dataset and by 5.4 percentage points on UK-DALE dataset.However, as it can be seen for all three utilised classifiers and the theoretical calculation, the classification performance increases when the number of AD is approaching the number of DiT.All three classifier models significantly outperform the random classifier, with our approach CtRNN being the best out of the three.Looking at the the third row of heatmaps depicting results for 15 DiT in Figures 6a-c, it can be seen that our approach achieves scores above 71 %, while models based on VGG11 and RF, achieve scores above 57 % and 56 %.All models significantly outperform the random chance as it reaches numbers as low as 0.02 %.
Similar observations can be also seen in Figures 6e-h for the UK-DALE dataset.All three models significantly outperformed the random model, where again our approach achieved the highest score across all tested scenarios.Looking at the the third row of heatmaps depicting results for 15 DiT in Figures 6e-g, it can be seen that our approach achieves slightly lower accuracy scores compared to the VGG11 and RF algorithms, when the number of AD approach the number of DiT, by up to 0.02.The reason for this is due to the fact that our approach is less prone to overfitting, compared to the other two approaches, which is supported by the fact that for up to 11 AD our approach significantly outperforms the other two approaches by up to 18 percentage points.
To summarize the difference between the results of different models we calculated the average improvement (I) across all mixed datasets in SE Group using Eq. 5, our model outperforms the VGG11 model by 11.03 and 9.4 percentage points on the REFIT and UK-DALE derived datasets, respectively.Compared to the RF model, our model achieves even greater improvement with 14.15 and 13.88 percentage points on the REFIT and UK-DALE derived datasets, respectively.

𝑁 𝑑𝑎𝑡 𝑎𝑠𝑒𝑡𝑠 𝑛=1
(1       −1    )    (5) From the analysis of heatmaps in Figure 6 we notice that once the number of AD surpasses 50 % of DiT in the mixed dataset, the chance of correct classification increases.This trend is clearly visible for both REFIT and UK-DALE in the lines with 5, 10 and 15 DiT and less so in the line with 20 DiT and for UK-DALE in the line with 54 DiT.To better understand the trend, we calculate the probability of correctly classifying devices by random guess in SE Group mixed datasets with Eq. 4 and depict the results in Figure 7.The x-axis represents the number of active devices (AD) out of devices in total (DiT), while the yaxis represents the probability of guessing the results correctly.We display a curve for each number of DiT in the SE group, thus showcase the probability for all employed combinations of AD and DiT.Consider the curve for 5 DiT in Figure 7a.It can be seen that when randomly picking 1 AD out of 5 DiT the likelihood of it being correct is 20%.Next, the likelihood of correctly guessing 2AD of 5DiT is 10%, 3 out of 5 is 10% while 4 out of 4 is 20%.The results in the figures show that, for a small number of AD or a number of AD that is comparable with the number of DiT, the random guess works the best, while for the cases when the number of AD is about 50% of the number of DiT it yields the worst results.That is because probability is expressed with combinations as shown in Eq. 4, since the order of predicted active devices doesn't matter.Looking at Figure 7 we notice similar decrease and increase in performance as previously seen in rows of the heatmaps in Figure 6.

D. Results with RE Group Mixed Datasets
The results of comparing the performance of CtRNN, VGG11 and RF models on RE Group mixed datasets are displayed in Figure 8, Figures 8a-c showcase heatmaps presenting results from evaluation on REFIT, while Figures 8e-g characterize heatmaps displaying results from evaluation on UK-DALE.Figures 8d and h, contain heatmaps filled with probabilities of obtaining correct results randomly, calculated with Eq. 6.
We observe that with the increase in DiT the accuracy of classification decreases in all cases, which is consistent with our prior observations related to performance being connected with the proportion between ADs and DiTs.The average performance degradation of our model is 9 percentage points per each 5 DiT added on REFIT dataset and by 4.3 percentage points on UK-DALE dataset.Random probability of correct classification, calculated by Eq. 6 is much lower compared to the accuracy of the models.For example, on REFIT dataset in the row with 15 DiT, our model achieves a score of 71 %, VGG11 achieves 58 % and RF achieves 63 %, whereas the random probability is rounded to 0 %.
We calculate the average improvement over the entire RE Group of mixed datasets using Eq. 5. Our model reaches results that are 11.32 percentage points better than VGG11 and 9.22 percentage points better than RF on REFIT derived dataset, and 8.07 percentage points better than VGG11 and 9.46 percentage points better than RF on UK-DALE derived dataset, respectively.

E. End Performance Comparison
Assuming a quick technology selection for an application requiring ON/OFF classification would need to be performed based only on the end performance, type of ML and number of devices, we summarize in The first column of the related works in Table V lists the considered ON/OFF classification works, the second lists the type of ML approach (classical or deep), the third provides the specific method, the fourth the dataset and result when training with the fifth lists the type of sampling of the energy data while the last lists the number of considered devices.From the results it can be seen that Wu et al. achieved the highest F1 score 98% however this has been done on the HF dataset BLUED.When a signal is sampled with higher frequency, higher definition data is available therefore it is easier to recognize its shape compared to a signal that is sampled with less granularity.When compared to other similar models developed on LF data, our model reached scores around 92.5%, surpassing others.TanoniCRNN ranked second on LF datasets, with scores around 0.85% while Langevin et

F. Limitations and Future Work
The limitations of the study presented in this paper are twofold.First, we show that the approach is not robust to increasing number of devices which characterize modern households.Unlike prior works, we quantify the drop in performance that we mostly attribute to the imbalanced nature of the available datasets in which some devices occur more frequently than others.Second, the empirical design of the proposed architecture could be automated to find a superior architecture with respect to both performance and energy efficiency.
While there are already a number of works on ON/OFF classification that improve on prior ones, including this study, we see three main lines of future work as follows.
Benchmark dataset: Typically in the machine learning communities they have benchmark datasets that are used in all model evaluations.A good direction for future work is to take the existing datasets that are suitable for ON/OFF classification and generate a harmonized set suitable for training.Besides the harmonized set, also balanced versions could be created, using statistical oversampling or under-sampling methods such as ADASYN and AllKNN.Also a more general simulator than in [48] could be developed to permit complementing measured data.
Generic model development: Current ON/OFF classification models cannot just be downloaded and used in a real application.While there is recent work for transferring models across household they have limitations, especially reflecting in performance drop [15].For text, the GPT breakthrough has shown that the architecture can be trained on large amounts of unlabelled data to capture sufficient knowledge and then further trained on labelled data to better structure that knowledge.The same could be done for ON/OFF classification where a model trained on large amounts of general time series data could be adapted for the domain at hand.The development of such a model could also significantly lower the adoption barrier of such technology.
Role in smart energy management: According to [5] households consume 12% less energy if they receive specific feedback on the consumption of individual devices.Furthermore, increasingly automated energy management systems may rely on machine learning models to detect appliances and forecast usage.Quantification of the relationship between the performance of a ML model (i.e.F1-score, MSE), its energy consumption and the energy that its decision help saving is a worthy line of research.For instance, if on average the household consumption drops by 12% [5] and the model on average recognizes only 94% of the devices correctly, as in this study, what would be the impact on the trust and behaviour of the user?Also, misclassifying an air conditioning device that consumes significant energy may have a higher cost than misclassifying a water heater.

VI. Conclusions
In this paper, we propose a new DL architecture CtRNN used in our models, paying special attention to improve its energy efficiency during training and operation as well as its performance compared to state of the art and other similar models.In developing the architecture, we used a typical VGG xi family architecture as a starting point and adapted it to time series data and reduced computational complexity by reducing the number of convolutional layers in some blocks, replacing one block with a single transposed convolution layer and adding the GRU layer.We benchmarked the proposed new model with other similar models, showing that it is possible to develop new DL models for NILM ON/OFF classification that provide a major improvement in both performance and energy efficiency, which results in lesser energy consumption.
We also proposed a new methodology with two tests that more realistically assess the performance of NILM ON/OFF classification algorithms.They are using groups of multiple mixed datasets, derived from measurement datasets with specificities of the real-world use cases in mind.One group covers the numbers of active devices commonly used in the time window of the learning samples in separate datasets.The other group covers mixed number of devices from 1 to the average maximum number of devices used in the time window of the learning sample.Our findings demonstrate that the proposed methodology is necessary to obtain results, which reflect more realistic situations.The obtained results indicate that the commonly used testing methodology can lead to overly optimistic conclusions, underscoring the importance of employing a more rigorous evaluation framework.
Moreover, as part of performance evaluation we also compared DL approaches and the best classical ML approach according to related work, and we concluded that DL approaches have a much higher performance potential compared to ML.In our experiments, we observed on average approximately 12 percentage point advantage of our model compared to the best classical ML approach.

Figure 1 :
Figure 1: The proposed evaluation methodology of groups SE and RE of mixed datasets compared to the proposed evaluation methodology.

Figure 2 :
Figure 2: Classification of devices as active or inactive based on the household NILM data using classical ML or DL model to obtain   for each device; devices with   > 0.5 are classified as active, the others are classified as inactive.

Figure 3 :
Figure 3: Probability distribution of the number of active devices in a 6 hours time window in REFIT and UK-DALE datasets.

Figure 4 :
Figure4: The proposed CtRNN architecture inspired by the VGG family of architectures is explained within the figure, where "k" signifies the kernel, "f" represents the number of filters, and "s" denotes the stride value.

Figure 5 :
Figure 5: Energy used for making predictions with the proposed model in comparison to VGG11, TanoniCRNN and VAE-NILM.

Table I :
Comparison of results from other related works.

Table II :
Table of training parameters for CtRNN and VGG11, for SE Group mixed datasets derived from REFIT and UK-DALE (BS -batch size; LR -learning rate; E -no. of epochs).

Table III :
Energy used in training the proposed CtRNN model in comparison to VGG11, TanoniCRNN and VAE-NILM.
the-art TanoniCRNN in terms of energy consumption, with 23.3 % less energy consumed on a mixed dataset tested on five commonly used devices.In addition, compared to VGG11 on groups A and B, the proposed model demonstrates 29.7 % less energy consumed.viii

Table V :
[16]e V the required information.To compile Table V, we take the related ON/OFF classification works summarized in Table I and analyzed in Section II, and wex Summary of related work for multi-label classification on NILM.final results from the respective papers, except for Tanoni in the fourth row of the table where we present the results reported in this work.The exception for Tanoni is due to the fact that the original work[16]employs weak supervision so the results are slightly worse then achieved with supervised learning in all other works.For fairness to Tanoni, we re-rune their experiments in a fully supervised manner and report those results.
al. and Singh et al. had similar scores, approximately 0.7%.The result reported by Tabatabei et al. ranked the lowest at 53%.