Deep Transfer Learning-Based Feature Extraction: An Approach to Improve Nonintrusive Load Monitoring

The development of techniques that allow the efficient identification of residential loads (nonintrusive load monitoring) is a key factor for the practical implementation of demand response programs. Recently, in terms of nonintrusive load monitoring, the use of deep learning has gained attention, mainly the models based on convolutional neural networks. However, the efficient training of these models is strongly dependent on the quantity and balance of the data, i.e., characteristics that are not normally found in nonintrusive load monitoring datasets. To deal with these challenges, this paper proposes an approach based on three stages, that are: (i) time series transformation into 2D images; (ii) feature extraction using deep transfer learning; and (iii) classification/labelling of loads. Moreover, it was analyzed and defined the better window size per load in relation to the f1-score reached by the classifiers. In this sense, it was considered five loads present in the Reference Energy Disaggregation Dataset, where the proposed approach was able to obtain an average f1-score of 83.2%. From the results analysis, it was demonstrated the greater capacity of the proposed approach to infer and generalize its responses.


I. INTRODUCTION
Although the technological evolution has allowed the development/improvement of electrical devices, there is an increase in electricity consumption. In this sense, the identification of consumers' electricity usage profile becomes a valuable task, mainly to demand response programs [1]. To this end, nonintrusive load monitoring (NILM) emerged as a research area that effectively contributes to this identification, mainly for the residential sector [2]- [4]. Thus, NILM approaches make it possible to identify the operating status of each device and, consequently, disaggregate the electricity consumption.
Researches on NILM began in the 90's, where the first proposed approaches basically analyzed real and reactive power to identify the residential loads [5]- [10]. However, due to the advances in the artificial intelligence area, it is possible to verify the increase use of machine learning-based approaches [11]- [13]. From these advances, as shown in the review proposed by [14], NILM researches were divided into: (i) micro level -that considers loads' current signatures, i.e., using meters with high sampling rates that are, consequently, more expensive; and (ii) macro level -that uses currents or apparent power in RMS (Root Mean Square) and, in this case, the meters may have a low sampling rate, being more adequate as a real life application.
Within micro level approaches, the feature engineering stage (extraction and selection) is quite common [11], [12], [15]- [18]. This characteristic allows to advance the state-ofthe-art, reaching high precision results in the identification task. On the other hand, macro level approaches, mainly due to the use of RMS values, focused on the proposition of new supervised [19]- [22] and unsupervised [23]- [31] algorithms to identify residential loads.
According to the previously mentioned context, it is noted that feature engineering for macro level has become a research gap, since it is not trivial to transform RMS values into other features. However, due to the advances in deep learning algorithms for different research areas, the possibil-ity of transforming time series data into 2D images gained notoriety.
In this sense, as expected, the use of time series transformation as a feature extraction stage and deep learning algorithms were primarily performed to the micro level. In [32] and [33] the authors transformed voltage and current signals into V-I trajectories. From these 2D images, they have used a convolutional neural network (CNN) to identify the residential loads. To the best of our knowledge, in terms of macro level, only the framework proposed in [34] transforms the RMS current time series into recurrence plots (2D images), using them as inputs to a CNN.
Given the intrinsic feature extraction process performed by CNNs and the potential of recurrence plots (RP) to generate 2D images that highlight recurring actions in the state space, this paper seeks to advance the state-of-the-art by proposing a deep transfer learning feature extraction (DTLFE) stage. Thus, the proposed stage fill the research gap previously mentioned and contributes to improve the identification of residential loads. The effectiveness of DTLFE was evidenced by using classical machine learning classifiers, that are: multilayer perceptron (MLP), support vector machine (SVM) and extreme gradient boosting (XGBOOST). As a secondary contribution, it was investigated the better window size per load, which can improve the identification performance. In addition, since the windowing process is normally performed by the meter, the definition of a maximum window size is also important to determine the hardware buffer size, i.e., this information could assist in the specification of a meter to be properly used for NILM purposes.
The remainder of the paper is organized as follows. Section II addresses the most related approaches found in the literature. Section III details the proposed methodology, highlighting the DTLFE. The results are discussed in the Section IV. Finally, the conclusions are presented in Section V.

II. RELATED WORKS
This section brings the works that use data from the Reference Energy Disaggregation Dataset (REDD) to evaluate their performances, more specifically those that consider the house 3. This consideration was made, since house 3 is commonly used to validate novel approaches. The characteristics of this dataset will be presented in more details in subsection III-A.
Kolter and Johnson [35] presented the first results using REDD as benchmark dataset in 2011. They used a Factorial Hidden Markov Model (FHMM) to identify the loads, which was able to achieve an accuracy of only 0.333 for house 3. Due to this work, others were motivated to explore classifiers based on Hidden Markov Model (HMM). This way, in [36] the authors proposed a Bayesian HMM, obtaining an accuracy of 0.815, i.e., a great advance when compared to [35]. Also, it can be highlighted the HMM-based approaches of [37] and [38], which reached accuracies of 0.906 and 0.800, respectively. However, the REDD can be considered as an unbalanced dataset, since loads like the fridge are constantly on, while loads like the microwave are sporadically used. For this reason, the most recent work using FHMM [39] employs f1-score to assess its performance instead of accuracy, demonstrating a great result by reaching 0.809.
A multi-label classification framework was proposed by [40]. The computational experiments were conducted in order to evaluate the performances of k-Nearest Neighbors (kNN) and RAndom k-labELsets (RAkEL) classifiers. In addition, the authors analyzed the use of power time series and its decomposition by using a Wavelet transform. However, the directly use the power time series, when classified by the kNN, presented the better average f1-score (0.530).
In [41], it was proposed the use of a Long Short-Term Memory (LSTM) without any feature engineering stage. The accuracy obtained for house 3 was about 0.920, i.e., greater than that reached by [37].
Sparse coding-based approaches have been recently proposed [22], [42], [43], presenting the advantage of using fewer training samples. Due to this characteristic, sparse coding algorithms are less prone to the effects of unbalanced datasets. Despite this factor, these approaches were able to reach accuracies between 0.465 and 0.650. Using fewer training samples, the authors of [44] investigated the behaviour of semi-supervised learning algorithms. It was observed that Manifold Regularization presented the best overall results. For house 3, it obtained an accuracy of 0.892.
Kong et al. [45] proposed an approach that uses HMM to model the home appliances and the Segmented Integer Quadratic Constraint Programming (SIQCP) to disaggregate the consumption. This approach shows good accuracy, obtaining 0.835.
Recently, a framework based on RP and CNN was evaluated by [34]. It was possible to demonstrate the robustness of the proposed framework, which reaches average f1-score and accuracy of 0.727 and 0.956, respectively. However, as pointed out by the authors, these results clearly demonstrate the unbalance inherent to the REDD.
Based on the above-mentioned researches, the lack of feature engineering techniques that are effective for macro level is evident. In this sense, the approach proposed in this paper aims to fill this research gap, ensuring that even classical machine learning algorithms can demonstrate high performances.

III. PROPOSED METHODOLOGY
As previously mentioned, the proposed methodology seeks to demonstrate the potential of DTLFE stage to improve the identification of residential loads. A general overview of this methodology is shown in Fig. 1, being each stage detailed in the sequence.
The entire methodology was implemented by using Python programming language, considering the following packages: (i) PyTS to extract the RPs; (ii) Tensorflow/Keras to implement the CNN; and Scikit-learn/XGBOOST to run the other machine learning classifiers. The source code was published on Github 1 . For the computations, an Intel Core i7-7700 CPU (3.60GHz) with 16GB DDR4 RAM and NVIDIA Quadro M5000 was used.

A. DATASET AND PREPROCESSING
First of all, it is important to mention that REDD [35] is divided into low and high frequency measurements acquired from six houses. However, it was considered only the low frequency measurements obtained for the house 3. These data were acquired from the main distribution panel at 1 Hz. The training set was composed of data from 16th April 2011 to 16th May 2011 (70% of available data), while the test set was composed of data from 17th May 2011 to 30th May 2011 (30% of remaining data).
Next, the RMS current time series were processed using a sliding window. Thus, the definition of the window size is an important part of the NILM problem, since a small window can lose information that characterize the load operation cycle and a large window can include noise and/or redundant information. For this reason, the training and test sets were further subdivided according to the size of the window (in seconds), that are: 30, 60, 90, 180, 360, 540, 720, 900, 1080 and 2040.
Each obtained window have its class labelled using the timestamps of the main distribution panel and the timestamps of the measurements acquired from the individual loads. In this sense, it was generated a binary array with 5 dimensions, representing single-state (microwave) and multiplestate loads (fridge, dishwasher, washer dryer 1, washer dryer 2). These loads were chosen in accordance with their energy consumption in the house 3, i.e., the loads with more contribution on the aggregated energy consumption.

B. RMS CURRENT TIME SERIES TRANSFORMATION
From the windowed RMS current data, it was possible to transform them into RPs, which are commonly used to identify patterns in nonlinear and dynamic systems (mainly when dealing with time series). Considering that these patterns are recurrent [46], the RP allows to visualize them [47] as 2D 1 https://github.com/diegocavalca/phd-thesis images. An RP is defined as a M xM matrix generated from a time series with M samples, being expressed by (1): where x ∈ n ; i, j = 1, 2, ..., M ; M is the number of states (x i or x j ) considered; n is the immersion dimension; ε is the radius of the neighborhood (threshold) at the sample (x i or x j ); and Θ(.) is the Heaviside function. Therefore, if R i,j = 1, the state is recurrent and a black pixel is marked on the graph; and if the R i,j = 0, the state is non-recurrent and a white pixel is marked. In this way, each RP has differences in terms of texture (isolated pixels, diagonal, vertical and horizontal lines) and typology [48].
In this paper, the RPs were specifically parameterized for NILM applications, being ε = 10% and n = 1. These parameters were obtained after exhaustive tests.

C. DEEP TRANSFER LEARNING FEATURE EXTRACTION
Considering the RPs (2D images) as inputs, this stage was proposed to extract features from them. For this purpose, a CNN was used, as this model is able to capture the singularities present in the images, usually arranged in large dimensional spaces [49]. Despite its advantages in recognizing patterns on images, the computational resource and the volume of data necessary for training are limiting factors for real life applications, as is the NILM case. For this reason, a transfer learning strategy was adopted. Thus, a pre-trained model (CNN VGG16) [50] was reused.
However, in this paper, the CNN VGG16 was just used to extract features. Its original architecture is divided into five convolutional layers and three subsequent fully-connected layers, in addition to the softmax output layer. Thus, its last fully-connected and softmax output layers were removed, resulting in the convolutional architecture showed in Fig. 2. Thus, the first fully-connected layer was maintained, since it represents a feature vector (embeddings) used as input to train and validate the ML models presented in the sequence.

D. CLASSIFICATION USING MACHINE LEARNING
In the last stage of the proposed methodology, the feature vector extracted by the DTLFE was presented to the following classifiers: MLP, SVM and XGBOOST. In order to assess the performance of the DTLFE in conjunction with the classifiers, two other comparative approaches were also considered: (1) based on the extraction of features by means of recurrence quantification analysis (RQA), using the indicators called determinism and recurrence rate as inputs to the three same classifiers; and (2) based on the use of RPs as inputs to a CNN classifier, such as proposed in [34]. A block diagram representing the comparison test is presented in Fig. 3. Since this paper is focused on improvements in terms of feature extraction, the classifiers were parameterized with default values, that are: MLP (α equals to 1e −3 , hidden layer size equals to 10); SVM (radial basis function kernel); XGBOOST (number of estimators equals to 100).
From the identification of loads reached by the machine learning classifiers, their performances were evaluated according to the most used metrics for NILM, accuracy (acc) and f1-score (f1), which can be described as: being P recision = T P T P +F P , Recall = T P T P +F N , T P is the number of true positive results, T N is the number of true negative, F P is the number of false positive and F N is the number of false negative.
In the sequence, the results were analyzed and discussed to demonstrate the robustness and effectiveness of the DTLFE against the other two approaches considered in this paper and the state-of-the-art works.

IV. RESULTS AND DISCUSSIONS
The results were obtained considering the previously mentioned test dataset for different window sizes. Thus, they were analyzed in terms of: (A) overall performance using a fixed window size to compare the DTLFE with the two other approaches highlighted in Fig. 3; (B) DTLFE generalization capacity; (C) impact of window sizes on the performance; and (D) comparison with the related works.

A. OVERALL PERFORMANCE ANALYSIS
As can be seen in Fig. 4, the DTLFE provided the best average accuracy and f1-score for a fixed window size of 2,040 seconds (the maximum window size considered). Since the dataset is unbalanced (i.e., the amounts of samples per class are different), the use of accuracy to measure the performance of classifiers can be biased. In this sense, the use of the f1score metric becomes adequate. Thus, the performance gain of the DTLFE when compared to the other two approaches that use RQA and RP as feature extraction methods, ranged between 6.4% and 22.4%. Analyzing the compositions between feature extraction approaches and classifiers, the dominance of the DTLFE was also noticed, as shown in Table 1 (ordered by f1-score). The combination of DTLFE with the MLP classifier obtained the best result in relation to f1-score and the difference in relation to the other approaches ranged from 10.6% to 32.3%. This result shows that the proposed approach was efficient even in view of the different loads considered and the evident unbalance between the classes (labelled as "on" or "off") present in the REDD. The results of MLP for each load and the number of "on" and "off" samples are shown in Table 2. Considering the best performance presented by DTLFE in conjunction with an MLP-based classifier, a generalization analysis was performed, as shown in the next subsection.

B. GENERALIZATION ANALYSIS
In order to assess the generalization of MLP learning, it was submitted to a test dataset with randomly chosen samples.
For this evaluation, a 10-fold cross-validation strategy was adopted. Thus, each of the feature extraction approaches were analyzed, where only in the case of using RP a classifier based on CNN was considered instead of the MLP. The probability distributions obtained for accuracy and f1-score are respectively presented in Figs. 5 and 6.  In general, it was possible to observe that the DTLFE presents a good probability distribution, indicating that it has the best learning and generalization capacity when compared to the other two approaches discussed in this paper. In time, it is interesting to note that the approach using RQA features provides the most unstable results in terms of f1-score. This way, such features do not provide enough information for MLP training and, consequently, impair its generalization.

C. IMPACT OF WINDOW SIZE
Despite the use of fixed window size is common in NILM researches, as previously stated, this size is a factor that directly affects the performances of some approaches. This is because the window size must be adjusted so that it is not so small as to lose load operating cycles and not so wide to have great overlap with other loads and to maximize the VOLUME 4, 2016 data unbalance, being this a non trivial task. Based on this statement, in this paper, it was analyzed the impact of window size per load, considering the DTLFE in conjunction with an MLP classifier. The results are shown in the graphs of Fig. 7.
It is possible to observe that the window size has an impact on the f1-score, since the results vary, from fixed to variable sliding windows, in a range between 3.8% and 7% (Table 3). Furthermore, in average, there was an increase of 3% (in terms of f1-score) when using different windows per load.

D. COMPARATIVE ANALYSIS WITH RELATED WORKS
Morever, Table 4 presents a high-level comparison with the state-of-the-art papers that considered the house 3 of REDD, independent on their methodological aspects. The proposed approach achieved the best performances for f1-score and accuracy, considering both fixed or variable sliding window per load. Therefore, the use of the proposed DTLFE demonstrated to be a robust feature extraction technique.

V. CONCLUSIONS
The use of a methodology based on multiple stages (time series transformation into images, automatic extraction of features using deep transfer learning and classification) has shown effective results regarding the task of identifying loads to disaggregate their electricity consumption. The proposed DTLFE approach was able to extract features that contribute to better classify the loads and to improve the generalization of an MLP classifier. In addition, the impact of window size on the classifier's performance was mitigated by using different sizes of window for each load.
Given the results obtained by the proposed methodology, it is still possible to analyze, in future research, the capacity of the pre-trained model to perform the transfer learning in relation to other houses contained in REDD or consider a fine-tuning strategy for the same domain (i.e., using other public NILM datasets). Moreover, it can be analyzed an optimized parameterization of RPs and other techniques to overcome the obstacles generated by the unbalanced data.