Residential Power Forecasting Based on Affinity Aggregation Spectral Clustering

,


I. INTRODUCTION
Forecasting is widely used by power utility companies to prepare for future consumption needs and optimize scheduling [1], [2]. Understanding future consumption needs allows the utility to plan power production, and in the cases where demand will be greater than supply, purchase additional power from other utility companies. Additionally, if there is a predicted excess in supply, then a utility may want to sell the excess power to others utilities in need. Without the ability to forecast, it is impossible to institute other schemes such as peak shaving or load shifting to prevent grid brownouts [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Filbert Juwono .
Additionally, accurate forecasting helps plan energy conservation programs. If households can reduce their energy consumption by an average of 14% then we can meet our COP21 Paris Climate Agreement goals for the household and commercial economic sector [4].
Most of the existing forecasting methods in the literature make predictions on the aggregate power signal directly, relying on the temporal dependence of the aggregate power signal. Such methods include estimating trends [5]- [7], fuzzy logic [8]- [10], Kalman filtering [8], [11], [12], support vector regression (SVR) [13], artificial neural networks (ANN) [8], [14]- [17], autoregressive integrated moving average (ARIMA) models [14], [18]- [20], similar profiles load forecast (SPLF) [21], [22], and the load fluctuation and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ feature importance profile clustering [1]. According to the results presented in those papers, the aggregate power profiles of a very large number of houses (i.e. a power grid) typically have regular patterns that are more amenable to forecasting, so the accuracy of the above methods grows with the number of power profiles in the aggregate signal, as stated in [23]. However, forecasting accuracy at the level of microgrids or single houses is still a challenge. The above forecasting methods are all based on predicting the aggregate power usage. An alternative strategy would be to disaggregate the power signal and apply prediction on individual components. In the econometric literature, such issues have been looked at in the context of macroeconomic forecasts, and the theory shows that under certain conditions, forecasting disaggregated components is more accurate than forecasting the aggregate [24]. The results we report in the present paper confirm this observation from econometrics for the case of power forecasting at the residential or microgrid level. The reason for this is that a stronger temporal dependence may exist in power signals of certain individual appliances, compared to the aggregate signal. This is easily seen in the case of cyclical appliances such as refrigerators, which turn ON and OFF roughly periodically. When the power signals of different appliances are aggregated, such temporal dependence may be disrupted, hence forecasting the aggregate power profile of a whole house may be less accurate than forecasting the power usage of individual appliances. For larger populations, at the level of the grid, aggregate forecasting, such as the methods listed above, is likely the only practical solution anyway, and it seems to work well in that case.
There is an added benefit to having any kind of individual appliance-level forecasting. Utility company initiatives such as real-time demand side management rely on appliance-level information to evaluate how much energy can be saved. Then, specific opt-in requests like unbalancing requests, load shedding, or energy price [25], [26] could to be sent to specific customers who could meet those requests. Additionally, direct load control becomes a feasible strategy for real-time demand side management [27]. Deferrable loads (such as HVAC systems, clothes dryers, and EV car charging stations) in opt-in houses can be identified through appliance-level forecasting and scheduled for automatic shutdown via utility remote control in real time.
Considering all of the above, a few residential power forecasting methods [2], [27]- [29] have been recently proposed, where the aggregate power signal is first decomposed into individual appliance signals via non-intrusive load monitoring (NILM), 1 then each appliance's power is forecasted separately, and finally, the total power forecast is formed by aggregating forecasted power levels of individual appliances. These existing NILM-based forecasting methods tend to be more accurate for residential level power forecasting than the traditional forecasting methods that use the aggregate signal [28]. However, all existing NILM-based forecasting methods [2], [27]- [29] assume individual appliance behaviour is uncorrelated with the behaviours of other appliances, which is not the case in practice [31]. As an alternative, we have recently proposed a new NILM-based forecasting method [29] based on graph spectral clustering (GSC), which incorporates appliance level correlations using ON/OFF state duration patterns. Hereafter, we refer to the method from [29] as GSC-NILM. In the present paper, we extend our previous work in [29] as described below.
Different from [29], in the present paper, mathematically, we capture appliance usage patterns into time-of-day and state duration probabilities, then affinity aggregation spectral clustering is used to perform the forecasting based on those probabilities. Human behaviours that affect residential power consumption are influenced by calendar context (working days, Saturdays, holidays) [22], [32], [33] and seasonal context (winter, spring, summer, fall) [29], [34]. In addition to the seasonal context used in [29], in this paper, we take both seasonal and calendar contexts into account in order to enhance the forecasting performance. Through extensive experiments we show that our proposed forecasting method offers higher accuracy compared to state-of-the-art forecasting methods based on GSC-NILM [29], NILM [27], ANN [14], ARIMA [14], and SPLF [22]. The main contributions of this paper are as follows: 1) We use a graph to model statistical relationships between appliances. Specifically, the ON/OFF time-ofday and state duration probabilities are used to compute graph edge weights and establish statistical relationships among appliances. This allows for powerful tools from graph theory and graph signal processing to be used in appliance behavior analysis and prediction. 2) We develop a new power forecasting method that combines NILM and affinity aggregation spectral clustering to improve residential power forecasting performance. 3) We incorporate human behaviour and environmental influence in terms of calendar and seasonal contexts into the proposed method in order to enhance the forecasting performance. 4) Experimentally, we examine how well the proposed forecasting method can generalize appliance behaviours from one house to another. For this purpose, we select common appliances from multiple houses, and perform training and testing on disjoint subsets of those houses. The outline of the paper is as follows. In Section II, we briefly review affinity aggregation spectral clustering and the NILM method used in this paper. Next, a discussion of time-of-day and state duration probabilities is presented in Section III, followed by our proposed method in Section IV. Finally, experimental results and conclusions are presented in Sections V and VI, respectively. The main symbols and notation that will be used in this paper are as follows: Spectral clustering is a graph-based unsupervised learning technique where data is embedded in a vector space of different dimension prior to being clustered. Affinity aggregation spectral clustering (AASC) [35], [36] is an extension of spectral clustering that uses more than one similarity (or distance) metric among data points. This form fits our purposes better, because it allows us to use both time-of-day and state duration probabilities in the forecasting process. AASC consists of three steps: graph construction, spectral embedding, and clustering.

1) GRAPH CONSTRUCTION
For a given set of n data points to be clustered, a fullyconnected undirected graph G = (V, E), is formed, where V is the set of nodes and E is the set of edges. The weights of edges reflect the similarity between two nodes as a combination of different distance metrics between them. If there are L distance metrics used, there will be L affinity matrices A 1 , A 2 , . . . , A L ∈ R n×n , each defined as: is the distance between nodes i and j in terms of the k-th distance metric. Typically, these distance metrics are chosen in an application-specific manner. A k [i, j] = A k [j, i] represent the ''similarity'' (or ''affinity'') between nodes i and j in terms of the k-th distance metric. Similarly to [37], in this paper we select σ k in (1) as the standard deviation of the values in D k . Construction of D k 's will be presented in Section IV. Finally, the aggregate affinity matrix A is defined as: where α k is a weight associated with affinity matrix A k . These weights reflects the importance of various metrics for a particular application.

2) SPECTRAL EMBEDDING
The objective of this step is to map graph nodes into a vector space where high affinities translate into small Euclidean distances, so that one can run conventional clustering [35]. For this purpose, a new vector representation for each node is defined using the spectrum of the normalized Laplacian matrix L of the graph G. First, L ∈ R n×n is determined as [35]: where W is a diagonal matrix whose diagonal entries are summations of the corresponding columns of A: Then, eigenvectors v 1 , v 2 , . . . , v m corresponding to the m largest eigenvalues of L are used to construct a matrix X ∈ R n×m , as X[: where · 2 is the 2 norm. Now, the i-th row of Y, Y[i, :], represents node i in the R m space [35], [37].

3) CLUSTERING
The rows of Y are treated as points in R m and clustered using K-means clustering. Since Y[i, :] represents node i, a cluster of nearby (in Euclidean distance) rows of Y represents a cluster of nodes with similar affinities in the original node space. The number of clusters (K ) is determined based on the eigenvalue difference distribution of the matrix L as follows [38]: where λ i denotes the i-th largest eigenvalue of L. We treat individual appliances as nodes of a fullyconnected undirected graph, and the aggregate affinity matrix A will represent correlation of appliance behaviours in terms of their time-of-day and state duration probabilities. Details are given in Sections III and IV.

B. NILM METHOD
NILM is a technique used to determine how much power each appliance is using from an aggregate power signal [30]. A literature review shows there are a number of recent NILM methods for residential appliance identification [34], [39]- [50]. The method in [39] is based on a super-state hidden Markov model to disaggregate multi-state loads. Load event matching-based methods are proposed VOLUME 8, 2020 in [44], [45], [47] to perform load disaggregation. Moreover, several NILM algorithms were presented in [48], [49] based on deep neural networks. The external weather and environmental conditions were used to improve the NILM performance in [48]. Furthermore, generalized NILM algorithms across houses were proposed in [48], [50]. In [40], a NILM algorithm has been proposed based on aided linear integer programming. A temporal multi-label classification approach is used for load disaggregation in [41]. A graph signal processing (GSP)-based approach for NILM has been proposed in [43]. Recently, we have proposed a NILM method [34], which is an extension of [42], that takes appliance usage patterns such as time-of-day probabilities (Section III-A) into account. Incorporating time-of-day probabilities significantly improves NILM accuracy [42]. Therefore, we utilize [34] as the NILM method in the proposed forecasting approach. A brief review of [34] is given below.
The aggregate power signal is partitioned into nonoverlapping observation windows (OWs), and in each OW, five principal components are extracted using Karhunen-Loeve Expansion (KLE). Algorithm 1 presents the main steps for a given OW. As seen in the algorithm, up to five iterations are executed, one for each principal component. In each iteration, two elimination steps are conducted, where each step removes some of the candidate sets of appliances. This is followed by the Maximum a Posteriori (MAP) estimation of the most likely set of appliances (S MAP ) that is turned ON in that OW. [42] for a Given OW 1: Set i = 1 2: Set execution = 1 3: Apply preprocessing; 4: while execution do 5: Take the i-th principal component; 6: Apply the two elimination steps; 7: Conduct MAP estimation and find S MAP ; 8: if P[I S (t) = 1|Z i ] > 0.99 OR i == 5 then Let Z i be the event that the first i principal components are those observed in the first i iterations in a given OW. MAP estimation finds the most likely set of appliances S that could have led to such an event. From the Bayes' rule,

Algorithm 1 Appliance Identification Algorithm From
where t refers to the mid-point of the given OW. In the MAP estimation step, we consider only appliances sets S that have not been eliminated prior to the MAP step. Let us denote this set as F = {S 1 , S 2 , . . . , S n }. The term P[Z i ] in the denominator in (6) is common to all S ∈ F, so removing it won't affect the maximization. Finally, the MAP problem becomes finding the set S that maximizes The term P[Z i |I S (t) = 1] is computed in the last elimination step before the MAP estimation, as in [42]. while proxies for joint probabilities P[I S (t) = 1] are estimated following the procedure in [34] using time-of-day probabilities. Finally, the right-hand side of (7) can be computed for all sets S ∈ F, and the set that maximizes it is selected as S MAP . Appliances in this set are predicted to be ON at time t at the power levels associated with their states. Note that multi-state appliances are decomposed into multiple ON/OFF appliances with different ON-state power levels, as described in the next section. As discussed in [51]- [53], personal data can be inferred from disaggregated load profiles. While privacy issues are somewhat orthogonal to the main objectives of the paper, in the case of the proposed NILM-based forecasting method, we feel that privacy is not as much of a concern. For example, when forecasting the power requirements of a group of houses, the proposed method disaggregates all appliances, but does not need to know which appliance belongs to which house.

III. TIME-OF-DAY AND STATE DURATION PROBABILITIES
There are three types of appliances that we encounter when disaggregating: binary state, multi-state, and continuous varying. A binary state appliance exhibits only one ON state and an OFF state (e.g., refrigerator). Multi-state appliances have an OFF state and several ON states, where each ON state represents a different power level (e.g., clothes dryer with two ON states -rotating drum and heating element). When modelling multi-state appliances, we create a separate binary state appliance for each ON state allowing us to simplify our model. Continuous varying appliances (e.g., variable speed motor) have no clear power-level state boundaries, so we first quantize them into N states following [39], and then create N −1 virtual binary state appliances, as we did for multi-state appliances. Henceforth, we use ''appliance'' to mean binary state (ON/OFF) appliance, whether it is an actual binary state appliance or a virtual binary state appliance.
A. TIME-OF-DAY PROBABILITIES Certain appliances have a higher chance of being used at certain times of day. For example, a toaster would often get used in the morning and likely never overnight. In this paper, we utilize such time-of-day ON patterns to forecast future power demand. Therefore, we compute a time-of-day probability for a set of appliances S as follows: which is the probability of all appliances in S being ON 2 at time t, where t is a certain time of day, N is the number of days in the training set, and m S (t) is the number of days in the training set in which all appliances in S were turned ON at time t. Here, S could be a single appliance (S = {a i }), a pair of appliances (S = {a i , a j }), or a larger group of appliances (S = {a i 1 , a i 2 , . . . , a i n }). For an illustrative example, N = 4 days are considered for training and typical ON/OFF status 3 of the appliance set S of these four days are shown in Fig. 1.
According to the example, the appliance set S is ON three out of four training days at time t, so m S (t) = 3, and hence P[I S (t) = 1] = 3/4 = 0.75. Figure 2(a) shows time-of-day probability for a desktop computer. This example was computed from the fifty-day profiles in tracebase [54]. This figure shows that desktop computer clearly has a distinct time-of-day probability.

B. STATE DURATION PROBABILITIES
When appliances run (or turn ON) they tend to run for some time. For example, a washing machine often runs for between 30-60 minutes to wash a load of laundry, and an electric kettle takes a few minutes to boil a pot of water. We take advantage of this notion of expected ON duration patterns to forecast future power demand. For a given appliance a i , the ON duration is defined as T {a i } representing the appliance's elapsed ON time. A set of appliances S is said to be ON if and only if all the appliances in the set are ON. The ON duration of set S is defined as T S and is the set's elapsed ON time during which all appliances are ON.
We treat state durations as random quantities whose probability distribution can be computed as: where n S is the number of times within the training set when all appliances in S are ON, and n d S is the number of times that all of them are ON for at least d time units. 4 According to the four training days shown in Fig. 1, the appliance set S is ON one time per day for day 1, day 3, and day 4, and it is ON two times in day 2. Then all together, the appliance set S is ON five times within the given four training days. Hence n S = 5 for this example. Further, as seen from Fig. 1, the appliance set S is ON three times for at least d time units, so n d S = 3, and hence P[T S ≥ t] = 3/5 = 0.6 for this example.
The set of appliances S is said to be OFF if at least one appliance in the set is OFF. OFF durations are also treated as random quantities with probability distribution computed as: where T S is the set's elapsed OFF time, n S is the number of times within the training set when S was OFF, and n d S is the number of times that S was OFF for at least d time units. The elapsed OFF time T S is measured from the time instant when the first appliance in the set S turned OFF.
The ON duration probability distribution for a desktop computer is shown in Fig. 2(b) and its OFF duration probability distribution in Fig. 2(c). Fifty-day appliance traces in tracebase [54] were used to compute these probability distributions.

IV. PROPOSED METHOD
In this section we present the three main stages of our forecasting method: appliance state identification, ON-set prediction, and aggregation. To aid with the presentation, we define the following terms: t is the current time, t p is previous time (t p < t), t f is future time (t f > t), d c = t − t p is the time elapsed between t p and t, and d f = t f − t p is the time elapsed between t p and t f (see Fig. 3).

A. APPLIANCE STATE IDENTIFICATION
Using the aggregated power signal of a house, ON/OFF states of appliances are identified at the current time t. Next, we identify the most recent start time instance of the current state of each appliance. We use one of our NILM methods [34], which is briefly discussed in Section II-B, to identify the state of each appliance at any given time up to the current time t. The disaggregated appliance state information is used with time-of-day probabilities (Section III-A) and state duration probabilities (Section III-B) to predict the ON state a future time t f .

B. ON-SET PREDICTION
We define the ON-set as the set of all appliances that are ON at a particular time. We predict the ON-set at future time t f with the help of AASC, which was reviewed in Section II-A.

1) GRAPH CONSTRUCTION
First, a fully connected undirected graph G = (V, E), called an appliance graph is formed, where V is the set of all appliances (nodes of the graph) and E is the set of edges. An example of an appliance graph is shown in Fig. 4. Edge weights are aggregate affinities between the corresponding appliances. The aggregate affinities are derived from two distance metrics that are functions of appliances' joint timeof-day probabilities and state duration probabilities. The first distance metric is a function of appliances' joint time-of-day probabilities at a future time t f . Specifically, for appliances a i and a j , we set the distance matrix D 1 as: so that large joint time-of-day probability means small distance.
We choose the second distance metric as a function of appliances' joint state duration probabilities at a future time t f . Suppose a pair of appliances {a i , a j } in the graph is currently ON. Based on the history of appliance state identifications, we can determine the most recent time instant at which these two appliances turned ON. Let t p be that time instant. Now consider the probability of these two appliances remaining in the ON state until the future time t f or beyond. The conditional probability of this event, given that these two appliances are currently ON, is .
From the definition of d c and d f , it is clear that so the required conditional probability becomes The two terms on the right hand side of (13) are found from the state duration probabilities (Section III-B).
Finally, the distance D 2 [i, j] between appliances a i and a j at time t f is defined as Having a pair of appliances {a i , a j } that is currently OFF (as defined in Section III-B) means that either a i or a j or both are currently OFF. If the pair turned OFF at some past time t p , we can use the same reasoning as above to compute the conditional probability of the pair staying OFF until a future time t f as: where the right-hand side probabilities are found from the state duration probabilities (Section III-B). The distance D 2 [i, j] between appliances a i and a j at future time t f is defined as: Having obtained D 1 and D 2 , affinity matrices A 1 and A 2 are constructed from (1). Next, the aggregate affinity matrix A is obtained from (2), where α 1 = α 2 = 1 in this work.

2) SPECTRAL EMBEDDING AND CLUSTERING
After constructing the appliance graph at future time t f , spectral embedding and clustering discussed in Section II-A are performed. The spectral representation of the graph (matrix Y in Section II-A) is obtained as in (3)-(4), where m is found by eigen-gap analysis similar to (5).
Matrix Y is a spectral representation of the appliance graph, where appliance a i is represented as a point in R m by the i-th row of Y, i.e. Y[i, :]. Distances (in R m ) between different rows of Y reflect ''affinities'' between appliances from matrix A, which are based on probabilities of corresponding appliances being ON at the same time. The smaller the Euclidean distance between two rows, the higher the affinity between the corresponding appliances, which means the higher the chance they are ON at the same time. In essence, matrix Y allows us to use Euclidean distance-based methods such as K-means clustering [56], where we would otherwise have to use computations that involve joint probabilities. Following (5), eigenvalue differences of the matrix L determine the number of clusters (K ) where the j-th cluster represents appliance set S j . For each cluster the average Euclidean distance AED j of the vectors associated with S j is: where N j is the number of appliances in S j , and c j is the centroid of the vectors representing those appliances: . As discussed in [29], [34], the value of AED j is inversely related to the joint probability of appliances in S j being ON at future time t f . However, for clusters that contain a single appliance, AED j = 0. After clustering, if any singleton clusters (e.g., S j = {a l }) exist, we add a dummy appliance a l to S j such that S j = {a l , a l }. Now, D 1 [l, l ] and D 2 [l, l ] at future time t f are computed as: where Once dummy appliances have been added to singleton clusters a new graph is constructed, new spectral embedding is created, and new AED j 's are computed. Finally, the appliance set with the smallest AED j is predicted to be ON at future time t f . Our ON-set prediction method is summarized in Fig. 5.

C. AGGREGATION
The mean power level of each appliance is obtained as in [42]. The total forecasted power level at future time t f is the summation of the mean power level of each appliance in the predicted ON-set. We can then scale up our method to forecast the power demand of a set of houses, or a microgrid, by the summation of the forecasted powers for each house. A summary of the whole forcasting algorithm for a microgrid with M houses (denoted h 1 , . . . , h M ) to predict the total power demand at a future time t f is given in Algorithm 2.

D. CALENDAR CONTEXT AND SEASONAL CONTEXT
Human behaviours that affect residential power consumption are influenced by calendar context -working days 5 (WD), 5 Days from Monday to Friday without special holidays. Saturdays 6 (SD), holidays 7 (HD) - [22], [32], [33] and seasonal context -winter, spring, summer, fall - [29], [34]. For example, during a holiday, people are likely to watch TV for longer than on a working day; in summer, people are likely to use a fan for longer than in winter. In order to take such human behaviours into account, we extend the forecasting model (including the NILM method discussed in Section II-B) to include seasonal and calendar context. A regular context-free model is trained over the whole dataset, independent of the season and calendar, while seasonal and calendar context-based models are trained separately on each season (winter, spring, summer, fall) and each calendar context (WD, SD, HD). When both contexts are used, we get a fairly refined information about the likely behaviour (e.g., working days in winter, holidays in summer, etc.). As shown in the next section (section V-B), this type of context information improves the forecasting accuracy.

V. EXPERIMENTAL RESULTS
To measure how well our forecasting method works, we compare against the following state-of-art forecasting methods: GSC-NILM [29], NILM [27], ANN [14], ARIMA [14], and SPLF [22]. Experiments were ran using MATLAB R2015b (without code optimization) on a 2.2 GHz MacBook Pro with Intel core i7 processor and 16GB memory. We use the Mean Absolute Percentage Error (MAPE) to measure prediction accuracy: where n is the number of time samples, y(t i ) is the forecasted power level at time t i , and y(t i ) is the ground truth power level at time t i . MAPE is a widely used performance metric. However, it becomes unstable when y(t i ) values approach zero. To supplement it, we also use Root Mean Square Error (RMSE): Data from publicly available datasets (REDD [57], Rainforest Automation Energy Dataset (RAE) [58], Almanac of Minutely Power dataset version 2 (AMPds2) [55], and tracebase [54]) was used to test the performance in four case studies described below. In order to have a consistent sampling time interval across datasets, REDD, RAE, and tracebase data was down-sampled to 1-minute intervals to match the sampling interval in AMPds2. Twenty appliances out of 122 were selected from tracebase, as shown in Table 1. For each case study, we use inputs for training and testing from one of the above datasets. In terms of training, to ''train'' a forecaster means computing ON/OFF time-of-day and state duration probabilities in the proposed method, computing ON/OFF duration probabilities in [29], determining ON/OFF duration patterns of individual appliance for the method in [27], and obtaining model parameters for the two methods in [14]. Moreover, aggregate power profiles for previous n days are used as training inputs for methods ANN [14], ARIMA [14], and SPLF [22]. However, individual power profiles for previous n days are used as training inputs for the methods NILM [27] and GSC_NILM [29], and the method proposed in this paper. In addition, for the context-based versions of the six methods, calendar and seasonal contexts of those n days are used as training inputs for all six methods.
In terms of testing, the current aggregate power profile (i.e. power signal from the current time to one day before the current time) is used as the test input for all six forecasting methods. For context-based versions of the six methods, calendar and seasonal context corresponding to the time t f are also given as inputs for all methods.

A. CASE STUDY 1
A total of seven houses were used in this case study -six from REDD and one from RAE. For each REDD house, the first 26 days of active power profiles were used for the training, allowing us to use the next 30 days for testing. For the RAE house, the first 25 days were used for training and the next 38 days for testing. During training we computed the time-of-day probabilities in (8) and ON/OFF-state duration probabilities in (9)-(10) for our method. We also determined the ON/OFF duration patterns of individual appliances for the method in [27], and obtained model parameters for the two methods in [14]. For [22], the training set represents the search space of similar profiles.
We performed a 180-minute ahead forecast using 1-minute increments with our method and existing methods [14], [22], [27], [29]. We measured the performance of each method using MAPE and RMSE metrics. The results are shown in Table 2, with the most accurate predictions indicated in bold. The results show that our proposed method greatly outperforms existing methods [14], [22], [27] on all houses, with a 44-72% reduction in MAPE and a 47-74% reduction in RMSE. Moreover, the GSC-NILM in [29] is the next most accurate forecasting method, still better than the existing methods. Comparing the full and simplified version of our method, we see that using both state duration and time-ofday probabilities offers better accuracy, with MAPE reduced by 9-21% and RMSE reduced by 10-20% compared to using state duration probabilities only.  MAPE is an average absolute percentage error (APE) per sample, while RMSE is the square root of the average squared error (SE) per sample. Hence, MAPE is a sample average while RMSE is derived from the sample average. We ask the question whether the observed difference in these sample averages between different methods is statistically significant? We use a two-sample t-test with unknown variance [59] to answer this question. Specifically, we compare the set of APE (SE) samples of the proposed method and APE (SE) samples of the next-best method (with next lowest MAPE (RMSE)) using the aforementioned t-test, with the null hypothesis that both sets of samples come from distributions with the same mean and unknown variance. All the p-values related to this case study are reported in Table 4. Typically, p < 0.05 is taken as an indication of statistically significant difference. As seen in the table, all p-values are less than 10 −3 , indicating that the errors produced by the proposed method are indeed significantly lower than those of the next-best method (and thereby other methods as well).
A sample of forecasting results for REDD House 1 is shown in Fig. 6a, where it is clear that our method tends to be more accurate than the alternative methods. The existing forecasting methods are generally able to predict upward and downward swings in power consumption. However, predictions of power levels is less accurate, often with prediction VOLUME 8, 2020 lag or lead. Moreover, Fig. 6a and 6b indicate that relatively large errors are made (by our method, as well as other methods) near transients. We focus on errors made by the proposed method, as the most accurate one among the tested methods. The distribution of large errors (> 1kW) relative to the nearest transient for all test signals used in this case study is shown in Fig. 7. As the figure shows, over 70% of large errors occur within a minute of a transient, and over 90% of large errors occur within 5 minutes of a transient. This shows that, indeed, transients represent a challenge for the forecasting methods, including ours. In terms of average execution time shown in Table 3, the ANN forecasting method [14] runs fastest, seconded by the NILM forecasting method [27], and followed closely by the SPLF forecasting method [22] and our method. The ARIMA forecasting method [14] is about three times slower, on average.

B. CASE STUDY 2
Our goal in this case study is to evaluate the effectiveness of using seasonal and calendar context in power forecasting on the AMPds2 house. AMPds2 contains two years of data sampled at 1-minute intervals. We train on the first year and test on the second year. We train each method in four ways (Section IV-D): (1) context-free (CF); (2) using seasonal context (SC); (3) using calendar context (CC); and (4) using both SC and CC.
We performed a similar 180-minute ahead forecast using 1-minute increments as we did in Case Study 1. We measured the accuracy of each forecast for each CC per season from April 2013 to March 2014 in AMPds2. Results are presented in Table 5 using MAPE and RMSE metrics. Comparing against existing methods [14], [22], [27] our method performs significantly better, having MAPE reduced by 51-73% and RMSE reduced by 49-67% using a CF model. As seen in the results, both SC and CC are able to bring improvements to the corresponding CF model, for all forecasting methods. Specifically, CC is slightly more accurate than CF, while SC  is considerably more accurate than CF. However, the best accuracy (indicated in bold) is achieved when both SC and CC are used (SC+CC in the table). Under SC+CC model, the proposed method still outperforms all others with MAPE reduced by 52-76% and RMSE reduced by 57-75%.

C. CASE STUDY 3
Next, we evaluate the performance of aggregated power forecasting for a larger set of houses, like a microgrid. The datasets used previously only contain data for a small number of houses, which means that we need to create a data for a large number of houses. Using appliance traces in tracebase, we created a large set of ''virtual'' houses. The appliances we selected from tracebase (see table 1) have power traces greater than 120 days, allowing us to use the first 50 days for training. Each virtual house contained 12 out of 20 randomly chosen appliances. For each appliance chosen, a ten consecutive day power trace was randomly selected to represent that house's appliance ten days of usage. This allowed us generate ten-day power traces for all 400 virtual houses.
Again, we performed a similar 180-minute ahead forecast using 1-minute increments for all 400 houses. Results are reported in Table 6 using MAPE and RMSE metrics. Two sets of results for the methods in [14], [22] are shown: (1) we forecast each house separately then sum to get the aggregate forecast (called ''micro-forecast''), and (2) forecasting the aggregate directly (called ''macro-forecast''). Results show that our method significantly outperforms the other methods in all cases. MAPE is reduced by 63% and RMSE is reduced by 58% compared to the best alternative, which is the NILM forecasting method [27].
In terms of the average execution time per sample to predict 180-minute ahead ( Table 6) there are some things to note. Given that there are 400 houses, we need to forecast each house separately and sum to get the total forecasted power. The forecasting of each house is done independently which means that we can forecast houses in parallel -similar to a map-reduce problem. Therefore, we record the maximum execution time for a given prediction across all 400 houses and calculate the average maxima over all predictions. The resulting average is reported as the average execution time 8 in the table. Again, we found the ANN forecasting method [14] to be the fastest, but our method still has a practical and viable average execution time, and is much more accurate. 8 There are 400 houses in case study 3. For the proposed method, [27], and ''aggregating forecast'' version of [14], [22], we need to calculate the forecasted power level for each house separately and obtain their summations to calculate the total forecasted power. In practice, the forecast for each house can be performed in parallel. Therefore, similar to our recent work on [29], we measure the maximum execution time for a given prediction (180 minutes ahead) across all 400 houses and calculate the average of these maximuma over all predictions. This average is reported in the table.

D. CASE STUDY 4
In this case study, we examine how well the forecasting methods can generalize. We select four houses (Houses 1, 2, 3, and 5) in REDD, which have six common appliances: microwave oven, refrigerator, dish washer, washer dryer, lamp, and water kettle. The first 26 days of active power profiles were used for training. For common appliances, data from one of the houses was used for training. This training house changes in different test cases (Tables 7-10). The non-common appliances were trained separately in each house. The next 30 days of the remaining three houses were used for testing.
Power consumption was predicted 180-minutes ahead in 1-minute steps using our method as well as the methods in [14], [22], [27]. The MAPE and RMSE results are shown in Tables 7-10 when the training house rotates among REDD Houses 1, 2, 3, and 5. The best results are indicated in bold. As seen in the tables, our method outperforms others by a considerable margin, with MAPE reduced by 21-44% and RMSE reduced by 18-41%.
Comparing the results in Tables 7-10 with the corresponding results in Table 2, we see that the accuracies of all methods have dropped compared to those in Table 2. This is no surprise -the models are now asked to generalize appliance behaviour from one house to other houses. Nonetheless, the forecast accuracy of our method in this scenario is still better than the accuracy of other methods even when they are trained and tested on the data from the same house (Table 2).
Similar to Case Study 1, we performed two-sample t-tests with unknown variance for all other case studies in this section. All p-values were less than 10 −2 , indicating that the forecasting errors in the above tables produced by the proposed method are significantly lower than those of the next best method, and thereby other methods as well.

VI. CONCLUSION
We proposed a novel forecasting method that exploits correlation among appliance usage patterns through joint state duration and time-of-day probabilities. Current and previous appliance states are identified and used to perform the power forecast for each appliance. The forecasting results of each appliance are summed up to produce the aggregated power forecast. The proposed method is applicable to power forecasting for a single house, or a group of houses, such as a microgrid. Our method was tested using four publicly available datasets and compared against five state-of-theart forecasting methods from the literature. Superior accuracy was achieved in each case. The proposed framework allows for seasonal and calendar context-based forecasting and the results show that both contexts clearly help improve the forecasting accuracy. Further, we demonstrated that our method has better generalization ability than other methods, by training on appliance data from one house and testing on data from other houses. However, relatively large errors are made by all forecasting methods, including our own, near transients of the power signal. Therefore, in the future, we aim to extend the proposed method to improve the forecasting performance near transients. He has been a Software Engineer for over 24 years working for various local/international industry clients. He is currently a registered Professional Engineering (P.Eng.) with Engineers and Geoscientists BC. His research interests include computational sustainability and the understanding of socioeconomic issues that pertain to technological advancement. He is an expert in data engineering, software engineering, and a world-renowned researcher in non-intrusive load monitoring (NILM) and disaggregation. He is currently the Vice-Chair of the IEEE Signal Processing Society Vancouver Chapter. He also serves as an Editorial Board Member of Nature's Scientific Data Journal.
IVAN V. BAJIĆ (Senior Member, IEEE) is currently a Professor of Engineering Science and the Co-Director of the Multimedia Lab, Simon Fraser University, Burnaby, BC, Canada. His research interests include signal processing and machine learning with applications to multimedia processing, ergonomics, compression, and communications. He has authored about a dozen and coauthored another ten dozen publications in these fields. His articles have won awards at the IEEE ICME 2012 and the IEEE ICIP 2019, and other (e.g., top n %) recognitions at ICME, ICIP and CVPR. He