Real-Time Energy Disaggregation Algorithm Based on Multi-Channels DCNN and Autoregressive Model

Energy disaggregation refers to the process of obtaining the energy consumption of several appliances in a house by disaggregating the aggregate power consumption measured by an electrical meter. Currently, deep learning methods are widely applied in this field. Real-time energy disaggregation is an important branch of energy disaggregation. Based on the Short Sequence-to-Point (Short Seq2point) (Odysseas) network structure, a real-time energy disaggregation algorithm based on multi-channels deep convolutional neural networks (MC-DCNN) and autoregressive model (AR) is proposed in this paper, which obtains theenergy consumption of appliances at the current time point by disaggregating the historical aggregate power consumption to achieve delivering disaggregation results in real-time. The proposed method takes the original aggregate power sequence and differential power signal as the input of the network, and extracts the information of different time lengths in the sequence using multi-channels deep convolutional neural networks with a modified concatenate layer, so that the network can adapt to different appliances with different operating modes. In addition, the traditional autoregressive model is added as the linear component for solving the problem that the scale of the output is insensitive to the scale of the input in the neural network model. Finally, the proposed method was tested on the UK-DALE and REDD datasets, and the experimental results show that the method has good disaggregation performance on both datasets, has a small number of parameters and achieves fast inference.


I. INTRODUCTION
Energy disaggregation (also referred to as non-intrusive load monitoring (NILM)) was originally proposed by George Hart [1], [2] and refers to the process of extracting the energy consumption of individual appliances from the total energy consumption of all appliances in a residence. Compared with intrusive load monitoring, NILM does not require sensors to be installed on each appliance to monitor its operation, but The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . only requires a sensing device to be installed at the entrance of the home to obtain the aggregate power information, which can then be disaggregated algorithmically to obtain the electricity consumption of each appliance, with the advantages of convenience and low cost [3].
There has been a lot of studies in NILM, and pattern recognition-based methods are one of the major research directions, which include supervised and unsupervised learning algorithms. Supervised learning algorithms require the labeling of each device to enable the energy disaggregation system to identify the devices, including artificial VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ neural networks (ANN) [4], decision trees [5], support vector machines (SVM) [6], [7], K-nearest neighbors (KNN) algorithms [8], [9], and so on. For unsupervised learning algorithms, the labeling process is not required for system modeling, but the accuracy is a little lower than supervised learning, and the main methods adopted are k-means clustering [10], Hidden Markov Model (HMM) [11], Expectation Maximization (EM) [12], etc. Both unsupervised learning algorithms and supervised learning methods require feature mapping and transformation techniques to extract deviceindependent features so as to obtain robust features for effectively modeling NILM systems. Deep learning algorithms have recently been widely adopted in the fields of computer vision [13], [14], speech recognition [15], [16], and natural language processing [17], [18] with very excellent results, which are able to extract the intrinsic features of the original data without the need for specialized knowledge, which has prompted many researchers to carry out researches related to energy disaggregation based on deep learning. Kelly et al. pioneered the idea of sequenceto-sequence into this field and designed three deep neural network architectures on the basis of convolutional and recurrent neural networks, namely Long short-term memory, Denoising autoencoder and Rectangle that regresses on the beginning and ending times of activation and the average power consumption of every device. All these networks outperform the combinatorial optimisation (CO) and factorial hidden Markov model (FHMM) algorithms on the UK-DALE dataset [19]. The adjacent windows of the input sequences of the sequence-to-sequence model overlap each other, resulting in predicting every element of the output sequence many times; also, the model cannot use all nearby elements of the input sequence for predicting elements at the edges of the window. To address these problems, Zhang et al. proposed the sequence-to-point (seq2point) model, with the input being the aggregate power sequence of window length and the output being the target equipment power value at the middle point of the window, and the disaggregation performance is better than that of the sequence-to-sequence (seq2seq) model [20]. Yang et al. [21] proposed a sequenceto-point model based on temporal convolutional networks, using dilated convolution to obtain larger receptive field and introducing residual blocks to avoid degradation problems, which significantly improved the network performance and reduced the model parameters. Zhou et al. [22] proposed a multi-scale residual network, which consists of dilated convolutional residual blocks as the basic structural unit, residual blocks are sequentially connected into a residual block body, and multiple residual block bodies of different depths are connected in parallel to form multi-branches structure for learning mixed-data features. The results show that the model has improvement on disaggregation performance and model complexity across different devices. Antoine et al. proposed an energy disaggregation method based on the variational autoencoders framework, which consists of two parts, the encoder extracts the target device information from the input signal and the decoder reconstructs the power signal of the target device. The method achieved excellent performance on the UK-DALE and REFIT datasets [23]. Considering the difficulty of obtaining large amount of labeled training data, Cui et al. proposed a method for estimating power consumption via background filtering, which uses only synthetic aggregate data to train the neural network, reducing the difficulty of obtaining training data and obtaining better performance [24].
In order to select more effective features from numerous appliance features, several researchers have introduced attention mechanisms into NILM. Chen et al. proposed a novel neural network architecture called scale-and context-aware network (SCANet), which utilizes a multi-branch architecture to extract multi-scale feature, a self-attention module to integrate context information and adversarial loss and state augmentation to improve accuracy. The experimental results showed a significant improvement in model performance compared to the state-of-the-art models [25]. Based on bidirectional encoder representations from transformers (BERT), Yue et al. proposed a structure called BERT4NILM, which utilizes multi-head attention for energy disaggregation. With the proposed loss function and masking training procedure, the proposed method outperforms the state-of-the-art models in various metrics on the UK-DALE and REDD datasets [26].
In order to estimate the power value of the appliance at moment t, the above studies used future data as part of the input, which contains the future operation state of the appliance (e.g., the appliance is turned on or off; the appliance is operating in another mode), and this information can help the network to perform more accurate disaggregation, but it also means that it needs to wait for the meter to collect the future power data before disaggregation can be performed, and these are low-frequency sampled data, which leads to significant delay in disaggregation, so these schemes are not suitable for real-time disaggregation scenarios where the users need to receive the disaggregation results with the shortest delay as possible. For example, in a dynamically priced grid, a user may turn on a high energy-consuming appliance at a time when electricity is expensive. If real-time notification is given, the user could choose to postpone the use of the appliance for the purpose of saving money and reducing the network load during peak hours [27]. Christos et al. proposed a novel multi-class real-time identification system using high-frequency data sampled at 100 Hz as input, and the system updates power data every 6 seconds and identifies devices. Moreover, by using KNN classifiers, the system can add new devices without retraining [28]. However, high-frequency sampling requires complex hardware, which can lead to additional costs during the monitoring process [29]. In contrast, Odysseas et al. still used low-frequency data, but used the total power over a period of time before moment t as input for predicting the power consumption of the appliance at moment t. Three network architectures that use sliding windows for real-time energy disaggregation were proposed, namely, Long-Short Term Memory (LSTM) networks, Gated Recurrent Units (GRU) networks and Short Sequence-to-Point (Short Seq2point) networks, which were more effective on multi-state appliances than on two-state appliances [30]. Similarly, Virtsionis et al. [31] took only past data as input. They proposed a lightweight deep neural network based on attentional mechanism, which is called Self-Attentive-Energy-Disaggregation (SAED), using attention mechanism to focus on the most important features. Additive and point-attention mechanisms are compared, and the results show that the performance of these two attention mechanisms are comparable. The network is capable of fast training and inference.
All the above studies took the aggregate power series as the network input; however, some researchers found that the differential signal obtained from the original sequence through the differential process contains information about the state change of the electrical equipment, and using it as input could enhance the network disaggregation performance. The literature [32] proposed a composite deep LSTM based method to perform load disaggregation. It takes the aggregate power and differential power information as input, and then encodes, separates, and decodes them to achieve regression from one sequence to several sequences. Comparing to the single sequence to single sequence method, this method simplifies the procedure of disaggregation and enhances the disaggregation efficiency. In the literature [33], a NILM-based EMS and a convolutional neural network model that uses the differential signal as input are proposed. It is pointed out that the differential operation is performed implicitly in the neural network-based models that use raw data as input, but this is inaccurate and computationally expensive. Experimental results show that using differential sequences as input improves the disaggregation performance of the neural network, while the number of parameters of the network is greatly reduced.
In this paper, based on Short Sequence-to-Point network, we propose a real-time energy disaggregation algorithm based on multi-channels deep convolutional neural networks (MC-DCNN) [34] and autoregressive model (AR). First, the original total power sequence is differenced to obtain the differential signal, and then the differential signal and the original sequence are input to different channels of the network, so that the network can directly learn the on/off information of the equipment contained in the differential signal without simple and explicit differential operation; at the same time, the remaining useful information contained in the original sequence can be learnt. Then, feature extraction of the time series is performed using MC-DCNN to learn the amplitude and state change information of the appliances from the sequences of the two channels separately; furthermore, to compensate for the information loss caused by the max pooling layer and to adapt the network to different appliances, features of different time lengths extracted at different stages in the channels are concatenated as the input to the multilayer perceptron (MLP). Finally, a conventional autoregressive model is added for solving the scale insensitivity problem in the neural network model. The proposed method is validated on the UK-DALE and REDD datasets, and the results show that the proposed method has good performance. The main contributions of this paper are as following: • MC-DCNN is adopted for solving this multivariate time series regression problem, where the aggregate power series and the differential series are fed into different channels, so that the amplitude and state change information of the electric appliance are learned from the sequences of the two channels respectively. The energy disaggregation performance is improved.
• The features of different time lengths extracted from two channels are fused to compensate for the information loss caused by the max pooling layer in the feature extraction process and enable the network to adapt to different appliances.
• The autoregressive linear model is used as the linear component for addressing the scale insensitivity problem in the neural network model.

A. PROBLEM FORMULATION OF ENERGY DISAGGREGATION
Given that the total power consumption in time period T as At each time step, the aggregate consumption can be expressed as the summary of the power consumption of all devices, as follows: where ε t denotes the Gaussian noise with zero mean and variance σ 2 t , and m denotes the sum of the number of appliances in the room. Assuming that we are only interested in household appliances that are widely used in most households, the power consumption from other appliances can be expressed as S = (s 1 , s 2 , . . . , s T ), and (1) can be rewritten as: Energy disaggregation is obtaining the sequence of power consumption of the appliances Y 1 , Y 2 , . . . , Y n through the aggregate power.

B. SHORT SEQUENCE TO POINT LEARNING
Seq2point takes a partial sequence of the total power sequence Xt − w/2 : t + w/2 as input to estimate the energy consumption of the target device at the intermediate time point y t . Data after t moment are utilized, which is not suitable for online disaggregation scenarios. For this problem, Short Sequence to point takes the aggregate power sequence segments before the target moment Xt − w/2 : t as input, and defines a neural network F, which maps the window sequence Xt − τ : t to the device power consumption at the target moment y t : where ε denotes Gaussian random noise. In this paper, we take the aggregate power series and the differential series as input, i.e., the inputs are multivariate time series M = {m 1 , m 2 , · · ·, m T }, where m 1 ∈ R N , N is the number of variables, and here N = 2. (3) can be reformulated as follows:

III. PROPOSED METHOD
In this section, we will give a complete explanation of the method proposed in this paper, and the overall network structure is shown in Fig. 1. It includes a nonlinear part and a linear part, where the nonlinear part is the MC-DCCN with feature fusion integrated, and the linear part is the autoregressive model.

A. MC-DCNN
MC-DCNN is designed to solve multivariate time series classification problems and has achieved excellent results among several multivariate time series datasets. Since the binary time series consisting of the aggregate power series and the differential power series are the network inputs, we treated the MC-DCNN as the backbone of the nonlinear part. The structure of the MC-DCNN network is shown in Fig. 2. First, the aggregate power series and the differential power series are fed into two channels, one channel focuses on learning the amplitude information of the appliance contained in the series, and the other focuses on learning the state change information of the appliance among the series. In each channel, a feature extractor consisting of multiple stages learns hierarchical features from the univariate time series. Each stage consists of a one-dimensional convolutional layer with RELU as the activation function and a max pooling layer.
Convolutional layers are used to obtain local time information of the sequence. The input of every convolutional layer is a time series where l denotes the layer from which the input comes, i denotes the channel to which it belongs, n represents the number of channels, i.e., the number of univariate time series, len l i and m l i denotes the length and dimensionality of the input series, respectively. The convolution layer contains k l i filters, the width of each kernel is equivalent to the dimensionality of the input m l i and the height is h l i . The j-th filter scans across the input matrix and generates: where * indicates the convolution operator, and x l+1 ij denotes the output. RELU (x) = max (0, x) is used as the activation function. After each convolution operation, as the size of the output matrix decreases, the output matrix will lose information at a large number at the edge positions, and subsequent convolution operations will be adversely affected. Thus, we decided to zero-fill the input matrix x l i , so that the output matrix has the same length as the input matrix. The size of the output matrix of the convolution layer x l+1 i is len l i × k l i . A max pooling layer is connected after each convolutional layer, which subsamples the output matrix of the convolutional layer x l+1 i : where MaxPooling denotes the 1-D max pooling layer, s l i denotes the output matrix with sizeslen l i × k l i , slen l i = len l i /stride and stride denotes the stride length of the max pooling layer. Then, the learned features of each channel are concatenated together. In particular, the method proposed in this paper uses a feature fusion module instead of the original simple feature concatenate layer to fuse information of different time lengths in the channel.
Lastly, the obtained features are input to the fully connected layer to obtain the output of nonlinear part y N t .

B. FEATURE FUSION
This module is integrated into the MC-DCNN to fuse the input series and the features extracted at different stages in the two channels. Usually, time series data can be considered as a onedimensional or flattened image, and convolutional neural networks(CNN) are utilized for extracting signals or characteristics from the time series. It is pointed out in the literature [35] that the data sophistication of time series is usually much lower compared to images, and the effective variables are much less, and the pooling layer shrinks the parameter dimensionality in the process of down-sampling the data, which may result in losing too much useful information. The experimental results indicate the model performance always decreases after introducing pooling layers, which proves that pooling layers have negative effects. However, eliminating the pooling layer would result in too many elements of the feature map to be processed, and more convolutional layers would need to be stacked to make the output features of the last convolutional layer contain the overall information of the input, which will make the model very large. In addition, the operating states and running times of different household appliances can vary, among which, kettle and microwave are short duration type appliances, fridge is medium duration type appliances, and washing machine and dishwasher are long duration type appliances. If only the features obtained in the last stage are taken as the input of the fully connected layer for regression, the network will not be able to learn information of different time length, which will result in not being able to take into account the operating characteristics of different household appliances.
To solve these problems, we decided to concatenate the output matrixes of each stage (convolutional and pooling layers) in the channel s l i and the original input sequences of the network without removing the pooling layer. The deeper the level, the larger the observation window, the information over a larger time length range can be extracted. The output matrix s l i of different stages contains information in the receptive fields of different sizes, representing patterns of different time lengths in the time series. The model is able to adapt to different appliances with different operating modes by learning features of different time lengths, at the same time, is able to learn from the output of the previous stages some of the useful information that is lost due to down-sampling.
The length of the output matrix varies from stage to stage. In order to concatenate the features obtained at different stages and the original input sequence, the output matrix is padded with zeros at the end to make its length slen l i equal to the length of the input sequence len. The sequence padding and concatenation process is as follows: where pad l i ∈ R len×k l i , con ∈ R len×(i+ a=i a=1 b=l b=1 k b a ) , len denotes the length of the input sequence.
These feature sequences have different effects on the disaggregation results, so they should have different weights, which need to be learned through training.
A CNN kernel is used to scan the concatenated features to catch the dependent patterns between different time series. The width of the kernel w is equal to the dimension of the fusion feature sequence con, and the height of the kernel is h. Specifically, the k-th convolution filter sweeps over the input matrix con and obtain: Rk = RELU (Wk * con + bk) (9) where the output vector of the filter is Rk.We pad the input matrix with zeros, and the output matrix of the convolutional layer is R ∈ R len×q , q denotes the number of filters. The max pooling layer is connected after the convolutional layer, compressing the sequence and extracts the very long patterns: where, u ∈ R len/stride ×q .

C. AUTOREGRESSIVE
Owing to the nonlinear properties of convolutional neural networks, the model suffers from the disadvantage that the scale of the output is insensitive to the scale of the input, resulting in a substantial reduction in the prediction accuracy of the model on datasets where the scale of the inputs is changing in an acyclic manner [36]. To address this issue, in the literature [36], [37], the researchers have incorporated a conventional autoregressive model to the nonlinear neural network and demonstrated that it can make the model more robust to time series that are in violation of scale changes.

VOLUME 10, 2022
The aggregate power series and the differential series do not have significant periodicity, and thus similar AR models are introduced in this paper as well. The autoregressive model is formulated as follows: where the output of the autoregressive model is y L t ∈ R, and the autoregressive coefficient of the model is W L ∈ R window , and the deviation is b L ∈ R, window denotes the size of the time window of the AR model (also referred to as the order of the model).
The final output values are derived through integrating the output of the neural network with that of the AR model: whereŶ t indicates the final output of the model at moment t.

IV. EXPERIMENTS
The hardware environment for this study is a 64-bit computer with 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz, 16G RAM and NVIDIA GeForce RTX 3050 Laptop GPU. The software platform is WINDOWS 10 Professional OS, Python 3.8.12 (64-bit) and TensorFlow-gpu 2.4.0 deep learning framework. In the proposed model structure, the convolution kernel size is 3 and the stride size is 1; the pooling size is 2. When training the model, the batch size is set to 128, the mean square error is adopted as the loss function, and the Adam optimizer is used with a learning rate of 0.001.

A. DATA SET
We evaluated our proposed method on two datasets, UK-DALE [38] and REDD [39]. The UK-DALE dataset collects electricity consumption data from five UK houses, while the REDD dataset collects data from six US houses. The UK-DALE dataset was created by Kelly and Knottenbelt in 2015, which contains electricity consumption data for five UK houses. All data was recorded at 6-second intervals from November 2012 to January 2015 and contain total power consumption and measurements for 4-54 devices. Five appliances such as kettle, microwave, fridge, dishwasher and washing machine were selected for disaggregation in the experiment.
The REDD dataset was created by Kolter and Johnson in 2011. The data for different households spanned 23-48 days, with appliance and mains readings being recorded every 3 seconds and 1 second, respectively. Three appliances such as microwave, fridge, and dishwasher were selected for disaggregation in the experiment.
We divided the dataset using the same way as in the literature [19], [20], [30], and the houses that were used to train and test are listed in Table 1.

B. DIFFERENTIAL PROCESSING
In the energy disaggregation, the aggregate power of the instant load is taken as the observed sequence. The aggregate power value of the current point in time is subtracted from the aggregate power value of the previous point in time to obtain the differential value. The aggregate power and differential power are shown in Fig. 3. Each non-zero value in the differential signal represents a state change of the appliances. The literature [33] points out that existing neural network models that use raw data as input perform the differencing process implicitly and automatically, however, there is an error between the calculated differential value and the actual value, thus, it is inaccurate for the neural network to perform the differential operation. Therefore, in this paper, we take the raw data and the differential signal of the raw data as the input, so that the power variation of the target device can be extracted more easily. The differential signal is calculated as follows: where Xt indicates the aggregate power consumption at moment t, Xt − 1 indicates the aggregate power consumption at moment t − 1, and Xt denotes the result of differential operation.

C. SLIDING WINDOW PROCESSING
The aggregate power and differential signals are processed using sliding windows, taking the sequence segments in the range [t − w, t]as input and the appliance energy consumption at moment t as output. The window sizes of every appliance and network are shown in Table 2.

D. DATA NORMALIZATION
Normalizing the data can eliminate the effect of the magnitude between indicators. In this experiment, the original data are processed using min-max normalization to constrain the size of the data to [0, 1]: (15) where x * denotes the normalized value, x max denotes the max value, x min is the min value. Appliance activations are extracted using NILMTK, the obtained related arguments of appliance are presented in the Table 3.

E. METRICS
For comparing the performance of these methods, appropriate evaluation metrics should be chosen for evaluation. We adopted the mean absolute error (MAE), the relative error in total energy, Energy F-score and total energy correctly assigned (TECA) as the energy disaggregation evaluation index: (16) relative error in total energy The ability to accurately identify the on/off state of an appliance is another aspect of measuring the performance of an energy disaggregation network. When the power disaggregation is completed, we can discriminate the status of on/off through the threshold value of the device. Four event detection evaluation indexes were chosen, namely: recall, precision, F1-score, accuracy: Among them, mean absolute error (MAE), relative error of total energy, recall, precision, F1-score and accuracy are general metrics used for classification/regression problems, while energy F-Score and TECA are metrics proposed specifically for energy disaggregation.

V. RESULTS
The models without using future data (LSTM [30], GRU [30], Short Seq2point [30], SAED-dot [31], SAED-add [31], MC-DCNN [34]) and the model with using future data (BERT [26]) are adopted for comparison. We report the results of the evaluation metrics for the UK-DALE and REDD datasets in Table 4, Table 5, Table 6 and Table 7. On the UK-DALE dataset, washing machine was not tested in the same house as other appliances, so TECA value cannot be calculated and we do not report it in Table 5. The best result is highlighted in bold in the tables. We tried to replicate the experiments in [26], [30], and [31], but could not achieve the results they reported, and to respect their work, we directly used the results in the reference, and we emptied metrics that are not reported in these references.
In UK-DALE, for event detection performance, BERT outperforms other models on dishwasher, fridge and kettle, SAED-add performs better on microwave compared to other models, and the proposed model in this paper has the best performance on washing machine. For energy disaggregation performance, BERT has the smallest MAE value on  microwave, washing machine and fridge, Short Seq2point has the smallest prediction error on kettle, and the proposed method in this paper has the smallest MAE value and relative error in total energy on dishwasher, and it has higher Energy F-score compared to MC-DCNN.
In REDD, for the event detection performance, the model proposed in this paper has the best performance on all three appliances. For energy disaggregation performance, BERT has the smallest MAE value on dishwasher and fridge, the proposed method in this paper has the smallest MAE value    on dishwasher and the largest Energy F-score and TECA, and SAED-add has the smallest relative error in total energy on microwave and dishwasher.
Generally, the proposed method performs slightly worse than BERT on UK-DALE, but better than other models; it performs well on REDD dataset, especially for event detection performance. As mentioned earlier, the reason for the better performance of BERT is the use of future data, the equipment operation state in the future period helps model perform a more accurate disaggregation. However, the proposed method has much less number of parameters than BERT and is therefore more suitable for deployment in smart meters.
We measured the inference time for each model on the REDD dataset by using time.time() function, and the inference times for the test set and for each sample are shown in the Table 9. The window size of the model input affects the total sample size of the test set, which affects the inference time of each model on the test set. Therefore, we mainly compare the inference time for each sample. Compared with LSTM, GRU and BERT, our method is much faster and the inference time differs very slightly from the remaining methods. Moreover, BERT needs to wait until the future data is collected before disaggregating, and the delay is significant. On the contrary, other methods can disaggregate a sample in much less than the sampling period as soon as the data at the target moment is collected, which fully meets the requirement of real-time disaggregation to deliver disaggregation results with short delay.
We further compare the three convolutional neural network-based models, Short Seq2point, MC-DCNN and the proposed method. The power disaggregation comparison results of these five target appliances for these three methods on UK-DALE is illustrated in Fig. 4. As can be seen in Fig. 4, for the dishwasher, the proposed method and MC-DCNN more accurately identify the entire operating cycle, where the proposed method more accurately estimates the time of the appliance state change; however, Short Seq2point only identifies the previous activation of the appliance. For the fridge, the estimated power of MC-DCNN and Short Seq2point fluctuate above and below the threshold value (50 watt), and the estimated power of the proposed method are much closer to the actual power values and fit much better. For the kettle, all three methods are able to identify the device activation relatively accurately, and the Short Seq2point fits slightly better. For the microwave, the predicted power value of the device operating by the method proposed in this paper is closer to the actual value than the other two methods; however, after the appliance stops working, all the methods mistakenly assume that the appliance is working again and classify the negative samples as positive, which leads to a large FP value and correspondingly Precision is very low. For the washing machine, Short Seq2point is able to predict that the device is working, but the predicted value of power is only slightly above the threshold (20 watt), which is much lower than the actual value. The predicted values of the other two methods are closer to the actual values, but the power values of MC-DCNN fluctuate drastically. It can be found that the comprehensive performance of the two methods (MC-DCNN and the proposed method) that incorporate differential power as input is superior, and we believe that the on/off state information contained in the differential signal plays a role, as we will further demonstrate in section VI. In addition, the number of parameters in Short Seq2point is tens and hundreds of times higher than the first two methods, respectively, indicating that doubling the number of filters in the singlechannel model does not achieve the same results as in the multi-channels model, and is also unnecessary.
In summary, the proposed approach achieves good performance on both datasets, which illustrates the generalization capability of the model structure. In terms of model capacity, it is a lightweight model that can be easily deployed on smart meters. In terms of speed, the inference speed of the model can fully meet the demand of real-time disaggregation.

VI. ABLATION STUDY
To validate the effectiveness of our proposed method, we performed a careful ablation study on UK-DALE dataset. First, to verify that adding differential power as input enables the network to directly learn the information about state changes of devices to improve the disaggregation performance, we took the Short Seq2point model as an example and compared the performance in both cases with(Wi/Dp) and without(Wo/Dp) adding differential signals, and the comparison results are shown in Fig. 5. We do not compare the Energy F-score, because it was not reported in [30].
As seen in the figure, the performance in most of the appliances is slightly improved after the addition of the differential signal, where the ability of identifying switching states is improved on all appliances. It indicates that the information of appliance on/off status contained in the differential sequence does help to improve the disaggregation performance of the network, and the difference operation of the neural network is inaccurate, which is consistent with the view of Yuanmeng Zhang et al [33].
Moreover, comparing the performance of the Short Seq2point model with the addition of differential signals, MC-DCNN and the proposed method on UK-DALE, it can be found that the latter two multi-channels structure models possess superior performance, which indicates that it is necessary and effective to process the two signals separately.
Then, we removed one part at a time in the model structure of the proposed method. We name the models with different parts removed with the following names: • Wo/Fusion: The model without feature fusion which fuses the information of different lengths of time.
• Wo/AR: The model without Autoregressive mod-el(AR). The comparison results are shown in Fig. 6. From these results it is clear that: • The disaggregation performance of the method proposed in this paper is better than the other two models on most appliances, where the ability to identify on/off states is optimal on all appliances.
• Removing the AR component from the completed model(wo/AR) results in a degradation of disaggregation performance on most appliances. It is shown that the AR component plays a key role in the over-all.
• Not fusing information of different time lengths leads to worse results of the model(Wo/Fusion) on most appliances, which demonstrates the importance of learning patterns of different time lengths and shows that the feature fusion part does allow the model to learn some of the information lost due to the pooling layer. Overall, this ablation study clearly demonstrates the necessity of adding differential information as the input of the network, as well as the effectiveness of our model design, with all components contributing to enhance the performance of the model.

VII. CONCLUSION
In this paper, a novel lightweight model is proposed, which takes aggregate power sequences and differential power sequences as the inputs to the network, and combines multichannels deep convolutional neural networks with feature fusion integrated and autoregressive model as the nonlinear and linear components, respectively. By conducting experiments on the UK-DALE and REDD datasets, the results show that the method proposed in this paper has good performance on both datasets, and is able to deliver results in real time. In addition to this, we demonstrated the efficiency of the proposed model architecture through an in-depth analysis.
In the next step, we are ready to implement our system in a real scenario. As mentioned earlier, this is a lightweight model, so we plan to embed the model into smart meters. In addition, considering that when adding new appliances, a large amount of data needs to be collected and a new model needs to be trained, we plan to store the collected data and retrain the model with the help of cloud servers. JUN ZHANG is currently a Senior Engineer with Nari Technology Nanjing Control Systems Company Ltd., Nanjing, China. His research interests include grid dispatch and power the Internet of Things sensing technology.
CHIZHI HUANG is currently an Engineer with Nari Technology Nanjing Control Systems Company Ltd., Nanjing, China. His research interests include power equipment online monitoring and power the Internet of Things sensing technology.