Device and Time Invariant Features for Transferable Non-Intrusive Load Monitoring

Non-Intrusive Load Monitoring aims to extract the energy consumption of individual electrical appliances through disaggregation of the total power consumption as measured by a single smart meter in a household. Although when data from the same household are used to train a disaggregation model the device disaggregation accuracy is quite high (80% - 95%), depending on the number of devices, the use of pre-trained disaggregation models in new households in most cases results in a signiﬁcant reduction of disaggregation accuracy. In this article we propose a transferability approach for Non-Intrusive Load Monitoring using fractional calculus and normalized Karhunen Loeve Expansion based spectrograms followed by a Convolutional Neural Network in order to generate device characteristic features that do not change signiﬁcantly across different households. The performance of the proposed methodology was evaluated using two publicly available datasets, namely REDD and REFIT. The proposed transferability approach improves the Mean Absolute Error by 13.1% when compared to other transfer learning approaches for energy disaggregation.


I. INTRODUCTION
N ON Intrusive Load Monitoring (NILM) aims to extract the power consumption of each appliance of a building or a household using as input only the aggregated consumption signal [1]. While NILM is intrinsically a source separation problem three different approaches have been used to solve the NILM problem. First, pattern matching (elastic matching) techniques, which are detecting device signatures in the aggregated power consumption signal have been proposed [2]- [5]. Second, source separation methods, such as matrix and tensor factorization as well as sparse coding, have been utilized separating base components and activations [6]- [9]. Third, machine learning and deep learning based models have been used to generate data driven models to estimate the power consumption of devices from the aggregated signal [10]- [13].
The latest advances of machine learning and the development of big datasets have led to successful deep learning based NILM methodologies. NILM architectures using Convolutional Neural Networks (CNNs) [14], Long-Short-Term-Memory (LSTM) [15] and Recurrent Neural Networks (RNNs) [16] have been proposed in the literature. In detail, a causal CNN with an optimization based on gate dilation was presented in [17] and a concatenated CNN approach for high sampling frequencies was proposed in [18]. As regards LSTM, a bidirectional approach with optimization on the forward and backward path, as well as Bayesian hyper-parameter optimization was presented in [12]. RNNs have been used in combination with convolutional layers in [19] and using deep RNNs in [16]. Additionally, latest research has focused on Generative Adversarial Networks (GANs) [20], [21] and bidirectional Transformers to incorporate self-attention mechanisms and to further improve the performance of the disaggregation algorithm [22]. Specifically, the above approaches have been only evaluated on the same dataset, thus with splitting of testing and training data from the same dataset, i.e. for non-transferability setups. However, as the ground-truth appliance signals are expensive to obtain [23], the transferability capability of a NILM architecture is crucial to be implemented efficiently and cost-effective in an actual smart-meter. Therefore, the more recent evaluations have considered the transferability VOLUME 9, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ capability of the NILM architecture. However, only few papers provide distinct discussions on the specific case of transfer learning in the context of NILM [23]- [27] based on the usage of real data. In detail, in [27] an approach based on voltage/current-trajectories with dedicated feature colouring and usage of image-processing deep learning models is presented. However, the approach is only evaluated on appliances signals and not on the aggregated signal. In [25] a comparison of Gated Recurrent Unit (GRU) and CNN is performed utilizing a two-branch model layout, reporting similar performances for GRU and CNN with an advantage in terms of complexity for GRUs. However, [25] reports performances only for the transferability setup, thus the performance decrease compared to a non-transferability setup cannot be evaluated. Specifically, in [23], [26] (with [23] being an extension of [26]) explicit discussions on feature invariance for NILM are provided and a sequence-to-point (s2p) architecture is proposed in order to efficiently train a CNN model for transfer learning in NILM, reporting performances and comparisons for transferability and non-transferability setups.
In this work the idea of having invariant features for the same appliances from different data, as initially discussed in [23], will be extended by utilizing a time-frequency representation based on fractional KLE features with additional post-processing. The contribution of this paper is threefold. First, a definition of feature invariance in the context of NILM is presented, with consideration to the physical nature of the appliances. Second, a methodology for transfer learning in NILM based on invariant features will be presented. Third, a discussion on the most relevant topics for transfer learning in NILM will be provided, namely amount of training data, impact of normalization and algorithm convergence.
The remainder of this paper is organized as follows: In Section II an introduction about transferability for NILM is provided. In Section III the proposed method is presented. In Section IV the experimental setup is described and in Section V the results are presented. Finally, discussion is provided in Section VI and the article is concluded in Section VII.

II. TRANSFERABILITY FOR NILM
As transferability approaches are a relatively recent direction within the area of NILM, only few approaches have been discussed in the literature [23]- [27]. Specifically, most of the proposed approaches investigate previously published architectures in terms of their transferability capability and evaluate their performance on cross domain learning [23]. However, in order to achieve high accuracies for transfer NILM systems, the architecture and input feature vectors must be specifically optimized for the NILM problem, in order to enable accurate cross domain learning. The qualitative description of such an architecture is presented below.
Let's consider two different devices, namely a fridge (FIGURE 1a) and a washing machine (FIGURE 1b), for two different manufactures (e.g. Bosch, Siemens, Samsung, etc.) each. First, considering the time domain signal for both fridges (FIGURE 1a (i)/(ii)) it can seen, that their power consumption values are different even though they operate in the same state. In detail fridge one consumes 75 W (FIGURE 1a (i)) in steady-state while fridge two consumes 100 W (FIGURE 1a (ii)), thus a difference in scaling along the y-direction is observed. Similar observations can be made for the washing machine (FIGURE 1b (i)/(ii)). Second, there are possible shifts along the time axis, e.g. on/off transitions of the fridge or the washing machine might not be time aligned (FIGURE 1a (i)/(ii)). Third, the state probabilities are very different for the same device for each brand respectively, e.g. fridge one has by far longer off durations than fridge two (FIGURE 1a (v)/(vi)). Based on the above the following three aspects must be considered, for an accurate modelling of transfer learning in NILM: 1) Different scaling in y-direction through different power consumption values of the same device operating in the same state but being from a different manufacturer. 2) Time shifts along x-direction through different temporal patterns in different households. 3) Different state probabilities through different utilization approaches of the same device in different households. To account for these three aspects, the following three approaches are proposed in order to efficiently model the differences of the same appliances from a different manufacturer.

A. SCALING POWER CONSUMPTION
First, let's assume that similar devices from different manufactures are based on very similar electrical circuits. This assumption is reasonable as most devices, e.g. fridges or washing machines, have the same electrical components, e.g. single-phase electrical motor in case of a fridge, and these components only vary in size, e.g. according to the volume of the fridge or washing machine. From power electronics theory it is know, that the output waveforms in the frequency domain only depends on the electrical architecture and scales with the fundamental component of the current [29]. Therefore, in order to accurately capture different scaling along the y-direction the appliances' power consumption should be transfer into the frequency domain and then be normalized to its fundamental component. The effect of normalization in the frequency domain can be seen in FIGURE 1a (iii)/(iv) and FIGURE 1b (iii)/(iv) for the fridge and washing machine respectively. FIGURE 1a (iii)/(iv) and FIGURE 1b (iii)/(iv) illustrate that the harmonics of two different brands of the same device are much closer after normalization. In this paper normalized KLE representation will be used to calculate frequency transforms, since it works well especially for low-frequency signals [11], [30].

B. TIME-SHIFTS
Second, time-shifts along the x-direction should be accounted through incorporating temporal information in the architecture. Several different approaches have been proposed in literature to incorporate temporal information, including LSTM architectures [12], temporal concatenation [31], gate dilated CNNs [17], as well as fractional calculus [32]. In this implementation fractional calculus as proposed in [32] will be utilized to account for temporal shifts as it has been proven to work well with CNN architectures [32].

C. STATE PROBABILITIES
Third, as discussed before, a similar device from a different manufacture might show different state probabilities. This is illustrated in FIGURE 1a (v) and FIGURE 1b (v) for the fridges and washing machines respectively. However, these differences are mostly caused by the user, who is defining the ratio of on/off states, e.g. how often a washing machine is used per week. Conversely, once a device is started the internal active states only depend on the device itself, e.g. a washing machines runs through a cycle of rinse, wash and spin [10]. Therefore, state probabilities should only consider active states as they are device dependent and not user dependent. An example of the effect of not considering inactive states is illustrated in FIGURE 1a (v) and FIGURE 1b (v), showing that active states are much closer when not considering inactive states. In order to account for this behavior, the post-processing of the proposed architecture utilizes active device states only.

III. PROPOSED ARCHITECTURE
Let's consider a set of M -1 known devices each consuming power p m with 1 ≤ m ≤ M , the aggregated power p agg measured by the sensor will be: where e = p M is noise generated by one or more unknown devices and f (·) is the aggregation function. In NILM the goal is to find precise estimationsp m of the power consumption of each device m using an estimation method f −1 (·), i.e., Based on the discussion in Section II, the proposed architecture includes fractional calculus features to account for time shifts, normalized KLE features to account for scaling in y-direction and state correction considering only active states in the post-processing. Therefore, Eq. 2 can be reformulated in terms of feature vectors describing the frequency content in terms of magnitudes A, and phase angles .P In detail, the architecture illustrated in FIGURE 2 consists of framing, calculation of fractional power values, frequency transformation using KLE with according normalization, CNN regression for each target device m to estimate the corresponding power consumptionp m , and post-processing using state correction. Detailed mathematical description of each stage is given below.

A. FRACTIONAL CALCULUS
Considering the aggregated signal p agg (t) ∀t : t ∈ {1, . . . , T }, extending the derivation to a non-integer order is given by the fundamental where k = t−t 0 h and k is the integer part of k, h is the step width and α j are the binomial coefficients defined by the factorial expansion of the Gamma function, (x), i.e.

B. NORMALIZED KLE
Considering K fractional components α k with k ∈ {1, . . . , K } the fractional power signal can be written as D α k p agg . For transforming each fractional signal D α k p agg to the frequency domain the KLE transform was used similar as in [35]. Therefore, let P α denote one frame τ of the fractional signal D α k p agg with frame length L. Specifically, letÑ with (Ñ < L) be the order of the ACM used to separate each frame of the fractional signal P α into its Subspace Components (SCs). The ACM PP of signal P α can be written as in [35]: where R PP (n) with 0 < n < (Ñ − 1) is the auto-correlation function of the signal P α and n is a positive integer indicating the sample. Through eigenvector decomposition PP can then be decomposed intoÑ mutually orthonormal eigenvec- The KLE transform and its inverse can be written as in Eq. 7 and Eq. 8 for each fractional component α.
whereP α ∈ RÑ is the KLE-transformed signal of P α and the uncorrelated SCs ofP α are defined as p i = q T i P α q i , where p i can be approximated by the coefficients of FIR filter [36]. Sinusoidal shape is assumed for each SC [35], thusP α can be written in terms of magnitudes A α ∈ RÑ , and phase angles α ∈ RÑ . Furthermore, for each fractional component α a KLE transform was calculated resulting in a time-frequency representation of K time-slices andÑ frequency components, i.e. A ∈ RÑ ×K and ∈ RÑ ×K . Batch normalization was applied to KLE spectrum magnitude and phase as discussed in Section II.

C. POST-PROCESSING
To consider that the same appliance type can have different state probabilities, which might depend on outer parameters, e.g. user behavior, only the 'on' states of the appliance estimations are post-processed. An appliance is considered as being 'on', if the estimation of its active power consumptionp m is above a threshold θ. To determine the active device states, fuzzy c-means were used similar as in [37].
If the initial prediction of the regression model is too far from any cluster center of the c-means algorithm, i.e.: min 1≤n≤N p − s n m (9) wherep is the initial estimation of the regression model, an appliance specific error margin and s n m is the cluster-center of n th state of the m th appliance calculated by the fuzzy c-means, the estimation was updated as follows: where s n min m is the n th state of the m th appliance fulfilling the minimum condition in Eq. 9. In Eq. 10 only active device states are post-processed according to the discussions in Section II.

IV. EXPERIMENTAL SETUP
The NILM architecture based on fractional calculus and KLE as described in Section III was evaluated using the datasets and regression algorithm presented below.

A. DATASETS
The proposed architecture was evaluated using two different datasets, namely REDD and REFIT [38], [39]. These datasets were chosen for two reasons. First, REDD was chosen as it is most commonly used in the energy disaggregation task, thus the proposed architecture is evaluated on the REDD dataset to show its performance on a classical (non-transferability) approach, and compare it with the existing literature. The REFIT datasets was chosen as it is ideal for training a transferability approach as it contains 20 different houses with similar appliances, but from different suppliers (e.g. fridges from Samsung, Bosch, Beko, etc.). Short description of the datasets can be found in Table 1.
In the previously published literature mostly five appliances are considered for disaggregation [10], namely kettle (KT), microwave (MW), dish washer (DW), fridge (FR) and washing machine (WM), the study was limited to these five appliances. Furthermore, the data split of the transferability setup was based on these five appliances, in order for all these appliances to appear in the training, validation and testing data respectively. The splits are tabulated in Table 2.
It must be noted that the data was not further modified, e.g. larger gaps in the data were not removed, in order to provide a realistic scenario for the transferability setup. Furthermore, the data was normalized using means-std normalization using the same values as in [23].

B. CNN-STRUCTURE AND MODEL PARAMETRIZATION
For the regression stage a two-dimensional CNN was used, similar as in [23]. Similarly to [23] one model per device was trained using relu activations for all intermediate layers and a linear activation in the last layer. Moreover, the one-dimensional kernels were replaced by two-dimensional   ones, to account for the two-dimensional inputs (timefrequency representations) of the proposed method. Therefore, the notation of the kernel-size 'x' refers to the two-dimensional kernel (x,x). The free parameters are shown in Table 3.
The number of fractional signals K and the number of subspace componentsÑ , which are defining the input size, were optimized using grid search on a bootstrap dataset for both the conventional and the transferability setup. The results for the conventional setup are tabulated in Table 4.
As can be seen in Table 4 the optimized parameters were found to be K =8 andÑ =64. Conversely, for the transferability setup the optimal framelength, and thus number of SC components was found to beÑ =256. This is in line with the work in [23] indicating that transferability approaches need a large framelength of input samples to account for the local differences in appliances signals. TensorFlow was used to train the models. In detail, the Keras backend was used utilizing the Adam optimizer for the training of each model. The hyper-parameters of the model, as well as the parametrization of the Adam optimizer, are tabulated in Table 5.

C. EXPERIMENTAL PROTOCOLS
A total of six experimental protocols were designed, one with respect to the use of raw active power samples similar as in [23], serving as a baseline protocol, and five additional ones with respect to the discussions in Section II and Section III. In detail, the second and third protocol use the magnitudes and phases after the fractional KLE transformation. The fourth protocol uses both magnitudes and phases of the fractional KLE, while protocol five uses additionally the raw samples after the fractional calculation. Protocol six applies additional post-processing as discussed in Section III-C. The six protocols, including their dimensionality, are tabulated in Table 6.

V. EXPERIMENTAL RESULTS
The architecture presented in Section III was evaluated according to the experimental setup described in Section IV. In order to provide accurate comparison with nontransferability setups, the performance was evaluated in terms of estimation accuracy (E ACC ), as proposed in [38].
wherep m is the estimated power, T is the number of disaggregated frames and M is the number of disaggregated devices. Furthermore, to compare with transferability approaches previously published in the literature, additional accuracy metrics' namely Mean Absolute Error (MAE) and normalized Signal Aggregated Error (SAE) are introduced.
where E m denotes the total energy consumption of the m th appliance, i.e. E m = T t p m (t) · T s , andÊ m denotes its estimated value. The results are presented for both the conventional disaggregation setup, without consideration of transferability in Section V-A, as well as for the transferability setup in Section V-B.

A. CONVENTIONAL SETUP
For the conventional setup, the houses 1-4 and 6 of the REDD dataset have been used. Specifically, house 5 has been removed due to its significant short monitoring duration [40]. For each house the data has been split in half, with the first half being used for training and the second half for testing respectively. The results are tabulated in Table 7.
As can be seen in Table 7 the performance increase along the different protocols, starting from 86.2% for the baseline protocol, using only a frame of active power samples, and reaching 89.8% when utilizing all features. An exception is protocol #−3 where only phase angles are used, this protocol shows significantly lower performances, which is in line with the work in [41] reporting low accuracies for NILM setups using phase angles. Additional post-processing further increase performance by 0.3%, reaching an average performance of 90.1%. In order to compare the proposed architecture with the literature considering non-transferability approaches the work is compared in Table 8 with the three best performing approaches using house 2 of the REDD dataset and the estimation accuracy as performance measure.
As can be seen in Table 8 the proposed approach outperforms all other approaches except of the Sparse HMM proposed in [10].

B. TRANSFERABILITY SETUP
In a further step, we evaluate the proposed transferability setup according to the data splits tabulated in Table 2, the results on device level are presented for REDD and REFIT in Table 9 and Table 10 respectively. To have better comparability with other transferability approaches the results are presented in terms of MAE instead of E ACC .   As can be seen in Table 9 and Table 10 the MAE is being reduced along the experimental protocols, with exception of protocol #3, similar as for the conventional disaggregation setup. In detail, the average MAE is reduced from 16.8 to 12.8 for the REDD database, while a reduction from 40.7 to 26.0 is observed for REFIT respectively. In detail, the most significant reductions of MAE are observed for the DW and the FR in the REDD dataset (55.4% and 30.2%) and for the DW and the WM in the REFIT dataset (48.6% and 39.2%). Moreover, to assure exact comparison with the previously published literature, the following results are recalculated using the data splits from [26] for REDD and [23] for REFIT. To assure fair comparison protocol #5 is used and post-processing or state-correction is omitted as neither [26] nor [23] use a knowledge based post-processing after the regression stage. The results are tabulated in Table 11.
As can be seen in Table 11 the proposed approach outperforms the approaches from [23] on average reducing the MAE and SAE values by 3.4 and 0.11 for REDD and 1.46 and 0.39 for REFIT respectively. These reductions being equal to 13.1% and 40.7% for REDD and 10.6% and 55.7% for REFIT. Again, the most significant performance improvement can be found for the FR, WM and DW.
It must be noted that there are three instances where the proposed approach only reaches roughly equal performance for one of the performance measures, namely for the MW and DW in the REDD database and for the MW in the REFIT database. In detail, for the MW in the REDD database the MAE is improved (+2.73), while the SAE is slightly reduced (−0.03). This indicates that the proposed approach assigns less energy in total (worse SAE), but with a higher accuracy (better MAE), thus having a better false positive rate compared to [23]. A similar observation can be made for the MW setup of the REFIT database showing an improvement of MAE (+1.92) and a reduction of SAE (−0.02). Conversely, for the DW in the REDD database the MAE values are almost equal with a significantly better SAE value for the proposed approach, indicating that the approach in [26] has a higher false-negative rate.

VI. DISCUSSION
Further, to the experimental results discussed in Section V, three topics, which are crucial for transferability approaches, are discussed. In detail, the impact on training data is discussed in Section (Section VI-A), the impact on normalization is discussed in Section (Section VI-B) and the convergence and real-time capability is discussed in Section (Section VI-C).

A. IMPACT ON TRAINING DATA
With transferability approaches the impact on training data on the performance of the model is crucial, since a transfer model must capture the appliance signatures from a different data domain [23]. In order to investigate the effect of different amounts of training data, the MAE for different amounts of data is displayed for the proposed protocol #5 as well as for the approach in [23], for a conventional and a transferability setup respectively. In detail, the data from the REDD dataset was used, utilizing REDD-1 for testing and REDD-2,3,4,5,6 for training in case of the transferability approach, while REDD-1 was split in a training and testing set for the conventional approach respectively. The results are illustrated in FIGURE 3.
As illustrated in FIGURE 3 the proposed approach outperforms the reference for both the conventional and the transfer approach, showing a total reduction of MAE loss equal to 0.02 (13.3%) and 0.03 (9.9%) respectively at the end of the training. In detail, the proposed approach shows an almost constant improvement for the non-transferability approach, while the advantage of the transfer approach is most significant above 700 k data-samples. Moreover, it can be seen that the MAE loss decreases smoothly for the conventional approach, while for the transfer approach both protocol #5 and [23] show an increase of MAE between 700-900 k data samples. This is probably due to using house five of the REDD dataset for training, which is normally excluded due to large gaps and especially short monitoring duration [40], leading to an increase of the MAE scores. However, it can also be seen that the proposed protocol #5 has a significantly smaller increase in MAE than [23], thus being more robust against the negative impact of using REDD-5 during the training process.
Furthermore, not only the amount of data samples has an influence on the accuracy and the convergence of NILM models, but also the sampling rate [44]. Specifically, as shown in [44] the optimal sampling rate of each device that is being disaggregated varies. The proposed model is utilizing data with sampling period of 3 sec and 6 sec using the KLE lowfrequency features. Good performance (85% -95%) has also been shown in [14], [17] using two-dimensional signatures for sampling rates at one sample per minute. Energy disaggregation has also been evaluated on hourly level [10]. While it was shown that excessive down-sampling does lower the performance of NILM approaches [44], good performances can be obtained for sampling rates as low as one sample per minute. Even lower sampling rates (below one sample per minute), have been evaluated in [45], but with focus on retrieving the average energy consumption over time, rather than the instantaneous power consumption.

B. IMPACT OF NORMALIZATION
As indicated in Section II, as well as in previous works for transferability in NILM [23], data normalization is crucial as appliances signatures must be invariant even though different data readings might have a different scaling (e.g. two different fridges might have a different power consumption for their on state as discussed in Section II). For the normalization to be effective, the normalization values must be constant parameters for all samples, but the normalization method might vary. Three different normalization methods will be discussed and compared to the baseline system without normalization applied. First Min/Max normalization insures that all training samples are within a range [0,. . . ,1], with the same scaling value being used for validation and test data. Second,  mean-variance scaling will be applied as in Eq. 14.
wherex is the mean value of x and σ is the standard deviation (std) of x. Third, batch-normalizations will be used additionally to mean-std scaling for each layer of the CNN model, as a normalization for each batch might be useful especially when using frequency domain features as discussed in Section II. The results are tabulated in Table 12.
As tabulated in Table 12, the setup without normalization reports the worst performances across all appliances. This is in line with the theoretical discussion in Section II, stating a need for a constant scaling factor to level differences in power consumption from different appliances, and the observations in [23]. Conversely, Min/Max scaling as well as Mean/Std scaling improve the performance, reporting average MAE values of 28.2 and 21.5 respectively. Most significantly additional batch normalization further improves the performance, which is probably due of the equalization of the frequency patterns of the KLE as discussed in Section II.

C. CONVERGENCE AND REALTIME CAPABILITY
For sequence-to-point approaches with high dimensionality of the input data algorithm convergence and real-time capability is crucial. Therefore, the convergence behaviour for a fixed data size, as well as the execution times per sample have been calculated. The comparison of the convergence of protocol #5 of the proposed method is compared with the convergence of [23] and the results are illustrated in FIGURE 4. As illustrated in FIGURE 4, both approaches converge within the first 10 epochs, but with a significantly faster decay for the proposed approach. This is probably due to the larger input size and the two-dimensional time-frequency representation providing more distinct features to the CNN model. A similar observation has been made in [17], where the usage of higher feature dimensions leads to faster convergence of the CNN model.
The execution time per sample for the 5 protocols (protocol #6 has been omitted as post-processing does not show measurable differences in execution time) has been calculated on an Intel i7 7700k CPU with 64 GB RAM using two Nvidia GTX 1080Ti in SLI mode. The Average Execution Time (AET) per sample, when using GPU calculations, is compared to the approach of [23], which was recalculated using the same hardware. The results are shown in Table 13.
As tabulated in Table 13 the AET increasing subsequently with using higher feature dimensionality, but report AETs well below real-time (3 ms to 60 ms for disaggregating 1 sec of the aggregated data). Specifically, the AET of the proposed protocol #5 is roughly ten times slower compare to the approach of [23] which is due to the higher feature dimensionality and especially the two-dimensional kernel of the CNN. However, as illustrated in FIGURE 4 the convergence is approximately by a factor five faster, thus the effective difference of AET is in the order of a factor of two.

VII. CONCLUSION
In this article a low-frequency approach for transfer learning in NILM has been proposed. Specifically, the solution is based on a low-frequency frequency feature description utilizing fractional calculus and the Karhunen Loeve Expansion in order to capture device and time invariant signatures to accurately disaggregated NILM signals from different datasets. The proposed methodology was evaluated on the REDD and REFIT dataset showing a maximum performance improvement by 55.4% when being compare to the baseline architecture and 13.1% when being compared to the best performing approach from the literature. Detailed analysis of the device performances, as well as on the influences of data size, convergence and normalization have been presented. Based on the results the following three points should be investigated: First, due to NILM being an highly ill-posed problem transfer learning greatly benefits from approaches that are physical related to the problem statement, i.e. accurate descriptions of the current harmonics as in the proposed approach. Second, due to different datasets having different size and number of appliances, the influence of transfer learning on completely unknown appliances has to be investigated. Third, as sampling frequencies vary across different datasets and smart meter architectures, the effect on sampling frequencies on the performance should be evaluated.