Toward Transparent Load Disaggregation—A Framework for Quantitative Evaluation of Explainability Using Explainable AI

Load Disaggregation, or Non-intrusive Load Monitoring (NILM), refers to the process of estimating energy consumption of individual domestic appliances from aggregated household consumption. Recently, Deep Learning (DL) approaches have seen increased adoption in NILM community. However, DL NILM models are often treated as black-box algorithms, which introduces algorithmic transparency and explainability concerns, hindering wider adoption. Recent works have investigated explainability of DL NILM, however they are limited to computationally expensive methods or simple classification problems. In this work, we present a methodology for explainability of regression-based DL NILM with visual explanations, using explainable AI (XAI). Two explainability levels are provided. Sequence-level explanations highlight important features of predicted time-series sequence of interest, while point-level explanations enable visualising explanations at a point in time. To facilitate wider adoption of XAI, we define desirable properties of NILM explanations - faithfulness, robustness and effective complexity. Addressing the limitation of existing XAI NILM approaches that don’t assess the quality of explanations, desirable properties of explanations are used for quantitative evaluation of explainability. We show that proposed framework enables better understanding of NILM outputs and helps improve design by providing a visualization strategy and rigorous evaluation of quality of XAI methods, leading to transparency of outcomes.

approaches have seen increased adoption in NILM community.However, DL NILM models are often treated as blackbox algorithms, which introduces algorithmic transparency and explainability concerns, hindering wider adoption.Recent works have investigated explainability of DL NILM, however they are limited to computationally expensive methods or simple classification problems.In this work, we present a methodology for explainability of regression-based DL NILM with visual explanations, using explainable AI (XAI).Two explainability levels are provided.Sequence-level explanations highlight important features of predicted time-series sequence of interest, while pointlevel explanations enable visualising explanations at a point in time.To facilitate wider adoption of XAI, we define desirable properties of NILM explanations -faithfulness, robustness and effective complexity.Addressing the limitation of existing XAI NILM approaches that don't assess the quality of explanations, desirable properties of explanations are used for quantitative evaluation of explainability.We show that proposed framework enables better understanding of NILM outputs and helps improve design by providing a visualization strategy and rigorous evaluation of quality of XAI methods, leading to transparency of outcomes.

I. INTRODUCTION
L OAD disaggregation or Non-intrusive load monitoring (NILM) is the process of algorithmically inferring the energy consumption of individual electrical appliances from the aggregate metered power consumption of a residential building [1].There is a growing interest in NILM deployment due to growing energy costs, energy efficiency initiatives and national smart metering roll-outs.Deep learning based implementations for NILM have grown sharply over the past few years with very good performance demonstrated via domain-agnostic accuracy metrics, such as the popular Mean Absolute Error, across a wide range of real-world datasets [2].However, using accuracy metrics as a standalone determinant for selection of an AI technology is inadequate for wider consumer adoption, as put forth in [3] and [4].The latter recommends that, in order to ensure Trustworthy AI, robustness, fairness, transparency, and privacy need to be addressed.Indeed, the European Commission has recently published seven principles of Trustworthy AI [5], which include transparency as one of the key elements of trustworthy AI systems.Transparency is closely linked to traceability of the datasets, as well as explainability of the technical processes of the AI system and the related AI decisions, and finally communication of AI system's level of accuracy and limitations to the end-users and system developers.
For AI-based NILM, the majority of work has focused on addressing technical robustness in the form of accuracy, reliability and reproducibility across different datasets [2], [6], [7] and data transparency through the use of public, peer-reviewed and well-documented datasets [8], [9], with limited research in the area of privacy protection [10], [11], [12] and technical explainability [13], [14], [15].The majority of deep learningbased NILM approaches are designed as "black-box" systems due to their inherent algorithmic complexity and absence of explainability.Since the underlying mechanics resulting in NILM predictions are not interpretable or explainable, deep learning (DL) based NILM cannot be fully trusted, which somewhat hinders wider deployment of NILM systems [3].As the adoption of smart home devices and energy management systems continues to grow, the necessity to ensure these technologies are both transparent and understandable to consumers grows concurrently.By developing and evaluating XAI methods for NILM, the research community can contribute to design of AI solutions that adhere to consumer standards such as the EU's vision of ethical and responsible AI [5] and foster consumer trust in these emerging technologies, empowering users to make informed decisions about their energy consumption.Furthermore, understanding the produced outputs can help improve the design, provide a better overview of the model accuracy, and facilitate better understanding of failure scenarios.Thus, the role of explainability is to ensure a transparent inference process of the AI system by providing decisions that are understood and traceable.As a result, algorithmic transparency facilitated by explainability has been identified as a paramount challenge in the present landscape of NILM research [3].
The wider problem of explainability of DL models has recently gained traction, leading to the emergence of the field of Explainable AI (XAI).Recent literature [16], [17], [18], [19], [20], [21], [22] suggest that XAI can facilitate trust by providing algorithmic transparency, support assessment of levels of bias, and improve the overall understanding of the inner workings of deep learning models.The majority of XAI work, predominantly tackling computer vision tasks, primarily centers around the integration and development of techniques that analyse the outputs of the model and visualise the importance of the input features.Such work frequently illustrates that explainability can enhance the understanding of the model and foster trust in the AI systems [21].However, many existing XAI techniques can lead to unstable explanations in real-world scenarios due to limited, qualitative evaluation [23], [24], [25], [26].Addressing such issues is particularly important for systems that can reveal personal information, such as temporal appliance patterns of use, generated by NILM.XAI approaches for NILM are still in their infancy, with limited literature available [13], [14], [15], [27].As XAI-based solutions for NILM continue to grow, it is of vital importance to properly evaluate their explainability components.This assessment can serve as a way to assert that the used explainability techniques are truly able to be deployed in the real-world scenarios and help with understanding of model outputs.Therefore, XAI system design that incorporates robust qualitative and quantitative evaluation procedures for explainability techniques used in the real-world environment is of crucial importance for the successful adoption in NILM.
The main contributions of this work are summarized as follows: • A new multi-temporal XAI visualisation technique for regression-based DL NILM, taking into account the need for different levels of visualisation granularity.• Definition of three core properties for evaluation of explainable NILM system: faithfulness, robustness, and complexity, that quantify the quality of XAI NILM visualisations with respect to the ability to identify important features of the signal, deal with noisy inputs, and be human understandable, respectively.• Demonstration that the proposed approach can provide visualisations and quantify well the quality of XAI NILM systems using two public, well documented datasets and five XAI approaches.The rest of the paper is organised as follows.A detailed literature review is presented in Section II to position our contributions with respect to the state-of-the-art.The proposed explainability framework is described in Section III followed by the experimental results and key findings in Section IV, before we conclude in Section V.

A. NILM Problem Formulation
Let y = (y 1 , y 2 , . . ., y T ) be a sequence of aggregated power consumption from M appliances, captured at time t = {1, 2, . . ., T}.Given a measurement of aggregate power y(t), the goal of a NILM algorithm is to determine the individual power contribution x i (t) of appliance i ∈ {1, 2, . . ., M}, such that the aggregate can be represented as: where n(t) denotes noise caused by the measuring equipment and unknown appliances contributing to the aggregate.NILM can be treated as a regression problem if the task is to directly infer x i (t) based on the aggregate signal y(t).On the other hand, it can be regarded as a binary classification problem if the task is to determine the on/off state of appliance i at time t, based on the aggregate signal y(t).Formulated in this manner, NILM can be solved in a range of supervised and unsupervised approaches and eliminates the need for appliance submetering, leading to a reduction in costs [28], while still enabling a diverse set of applications such as energy usage feedback [29], anomaly detection [30], and load shifting [31].
In terms of algorithmic approaches, CNNs are the most widely used architectures in the latest NILM literature according to the recent review of [32].Reference [33] use an eventdriven CNN for load disaggregation of residential appliances, while [34] employ a CNN to perform unsupervised domain adaptation.However, of all CNN-based works, sequence-topoint (seq2point) learning represents one of the most cited approaches [35].Given an input sequence of aggregate signal, the seq2point algorithm predicts the midpoint of the output (i.e., appliance) signal instead of the whole sequence.This approach has shown to be a better approximation of the target distribution compared to previous approaches and consequently provides advantageous predictive performance [36].

B. XAI for NILM
Algorithmic transparency in AI systems is often characterized by the clarity of decision making processes implemented by AI algorithms [37].From an engineering perspective, works focusing on algorithmic transparency fall in the category of XAI [17], [18], [19], [20].Despite the increased need for algorithmic transparency and extensive research in XAI, the majority of current AI systems lack the ability to provide clear explanations of how the AI model generated an output.
XAI for decision-tree based NILM was demonstrated in [27], whereby Partial Dependance Plots and Individual Conditional Expectation were used to explain the predictions of the NILM multi-class classifier by highlighting feature importance for individual appliances.However, the remaining XAI approaches for NILM focus on explainability of DL-based NILM.The first XAI approach for NILM, proposed in [13], focuses on occlusion sensitivity, and provides visual insight into important features of the prediction of a regression-based NILM AI algorithm.Explanations are generated by first occluding parts (i.e., setting to zero value) of the signal with a sliding window across the time series.Then, for each window position, model output at a single point is calculated.The information about the resulting outputs is used to determine the importance of individual time steps and create the explanation heatmap.The sliding window is slid over the whole sequence that largely contains Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
power levels under 500 W.However, the proposed approach suffers from issues of computational complexity due to the nature of the sliding window approach.Furthermore, occlusions that are set as zero values are rarely observed in practice due to the baseload presence, making the presented methodology exposed to potential out-of-distribution inference scenarios, which can result in unstable predictions.A comparison between a gradient-based technique, GradCAM [19], and an occlusion-sensitivity approach for visualizing the important features of a NILM classifier is examined in [14].However, [14] uses a less challenging NILM approach based on multi-class CNN to only determine the existence of an appliance in the input time-series, without detecting on/off states, using a single dataset.[15] propose a learning mechanism that utilizes XAI techniques for training of DL NILM models using the paradigm of knowledge distillation [38].Authors explored the transfer of knowledge in the Teacher-Student scenario, identified the main inconsistency in the transfer of explainable knowledge, and proposed a modification to the knowledge distillation loss function to improve the model performance by minimizing the inconsistencies between the Teacher and Student explanations.
Despite the recent advancements, there are several gaps in the literature with respect to XAI for NILM that warrant further exploration.One notable gap is the scalability and computational complexity of current XAI visualization methods for regression-based NILM.For instance, existing techniques for regression-based NILM, such as occlusion sensitivity [13], are computationally heavy, limiting their feasibility for largescale datasets or real-time applications.Another critical gap is the lack of standardized evaluation metrics for assessing the quality and usefulness of XAI techniques in the context of NILM.All existing work in NILM relies solely on qualitative evaluation of XAI methods.However, developing a comprehensive set of benchmarks for domain-relevant aspects of explainability would enable better comparisons between different XAI approaches and facilitate the identification of best practices.Even though the aforementioned works can be considered as an entry-point towards explainability in NILM systems, to the best of our knowledge, there is still no work in NILM literature that evaluates XAI methods in a quantitative manner.This suggests a lack of rigorous evaluation of the quality of generated explanations, which is a requirement to ensure trust in the explanation outputs of XAI-based AI systems [22], [24], [25], [26], [39], [40], [41], [42], [43].

C. Explainability Methods
In this study, our focus lies on post-hoc XAI methods that aim to explain outputs of a trained DL model by assigning attribution or relevance values to each input feature.Given an input to a DL model and a target concept, attribution-based XAI aims to map the importance of each input feature to the target concept.The target concept is either a class of interest in classification tasks or an output value in regression-based problems.We refrain from using feature-based approaches such as LIME [44] and SHAP [45], due to their instability and computational complexity [24], [46].Instead, we examine five popular families of methods that best exemplify the variety of algorithmic approaches contained in the field of XAI, namely GradCAM [19], LRP [20], SmoothGrad [18], and Integrated Gradients [17].
1) Gradient-Weighted Class Activation Mapping (GradCAM): GradCAM is an XAI technique used to create an explanation for a prediction of a target concept (e.g., a class or a signal sequence) by computing its gradient w.r.t the final convolutional layer of a CNN network [19].In order to generate an explanation map h c ∈ R W×H of width W and height H for a target concept c, the gradient of the output for the target concept y c w.r.t the kth feature map activations A k of the last convolutional layer is computed, i.e., ∂y c ∂A k .Next, a global average pooling operation is applied over the height and width (indexed by i and j, respectively) on the computed gradients, to obtain neuron importance weights [19]: . ( The generated weights represent the importance of feature map k for the target concept c.In order to compute the explanation map h c , weighted combination of feature map activations, followed by ReLU function, is performed [19]: 2) Improved Gradient-Weighted Class Activation Mapping (GradCAM++): GradCAM++ is an extension of the original GradCAM method that has been shown to provide better visual explanations for CNN models [47].The main improvement lies in the calculation of the neuron importance weights, which now considers not only the first-order partial derivatives but also the second-order partial derivatives to capture higherorder interactions among feature maps.The updated neuron importance weights for the target concept c in GradCAM++ are computed as follows [47]: such that the partial derivatives w.r.t.A k ij are as follows: Where the final explanation map h c is computed as in Eq. (3).Comparing with Eq. ( 2) and (3), GradCAM++ reduces to GradCAM if ∀i, j, α kc ij = 1 W×H .GradCAM++ has been shown to produce higher quality and more precise visual explanations compared to the GradCAM method, allowing for better interpretation of CNN models [47].
3) Integrated Gradients (IG): IG [17] aims to generate an explanation for a prediction of a target concept, via counterfactual reasoning.Absence of a cause for a certain prediction informs the generation of the importance features by creating a single baseline value used to compare the outcomes.Generally, the baseline is modeled as a space where predictions are neutral.In computer vision, this would typically be a black image, while in the case of time-series data this can be represented as absence of the signal.Formally, explanation map h c ∈ R W×H of width W and height H for a target concept c, considering input x ∈ R W×H and baseline value x ∈ R W×H , is created by constructing a set of interpolations along the i th dimension between x and x [17]: which can be approximated by a finite summation of gradients at small intervals along the path from x to input x [17]: where N is the number of steps in the Riemann approximation of the integral.4) SmoothGrad: Driven by the premise that instability of gradient-based explanation maps can be corrected by smoothing of a gradient with a Gaussian kernel over a large number of local perturbations, SmoothGrad calculates an average of gradients w.r.t N alterations of the input, by adding a small amount of random noise [18].Given that the method computes gradients with respect to input x, i.e., m c (x) = ∂y c ∂x , explanation map h c is calculated as [18]: This technique aims to reduce the visual noise, and can be combined with other methods to create smoother heatmaps.

5) Layer-Wise Relevance Propagation (LRP): LRP [20]
computes the explanation heatmaps by using the layered structure of the neural network to produce relevance scores in an iterative manner.Given two consecutive layers, j and k, propagation of a relevance score R from a higher to a lower neuron is achieved by means of purposely designed local propagation rules.For example, given an input activation a j and weight w jk connecting neuron j to neuron k, LRP-rule is defined as [20]: is a regularization term -high values help stabilize the relevance scores when contribution to the activation of neuron k is weak or unclear, leading to less noisy explanation maps.

D. Evaluation of Explainability
Traditionally, the quality of attribution-based explainability has been evaluated by qualitative, subjective assessment.This constitutes determining subjective levels of satisfaction with the usefulness of explanation, which is evaluated by a developer or end-user of an AI system.However, driven by the need for more rigorous and objective evaluation strategies, recent advancements in the field have focused on the development of quantitative metrics for assessing the degree of the quality and trust of XAI methods.
A key challenge in evaluating XAI methods is the lack of ground truth.Given that the information about how a model generates a prediction can rarely be known a priori, efforts in evaluating the quality of explanations tend to approach the problem indirectly.Concretely, with the end goal of measuring if explanations correspond to the predictive performance of the model, [25], [39], [40] propose various methods for measuring faithfulness, based on the notion that removing or obscuring Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
important input features should have a significant negative effect on performance, or confidence of the prediction.The degree of faithfulness is quantified by measuring the difference between the probability scores of a classifier predicting on perturbed and original input, where more faithful methods lead to larger differences in scores.Faithfulness has also been referred to as sensitivity-n [22], selectivity [41], fidelity [42].
Unreliability of backpropagation-based XAI methods has long been an issue, as discussed in [22], where concerns over the fact that XAI methods can lead to unstable and unintelligible explanations are discussed.To mitigate the issue, sanity checks are proposed [43], comprising a set of techniques geared towards evaluating the trustworthiness of explainability methods by comparing the results of applying them to trained and randomly initialized models.Furthermore, with the aim of addressing the aforementioned issues of unreliability, the notion of the robustness of explainability methods has been suggested [39].Their findings suggest that slight changes in the input, simulating adversarial noise, could lead to dramatic differences in generated explanations, while retaining the same predictions.Driven by the need to formulate the relationship between input data and reliability of XAI methods, [39] evaluate robustness of explanation functions under slight perturbations of the input, and derive measures for determining their ability to deal with small modifications of the input.The notion of robustness has been explored in other works and referred to as sensitivity [42], continuity [39] and stability [41].
The end goal of explanations is to be understandable to humans who are interpreting them.As a result, explanations that deem all of the features as important, even if faithful, have limited utility as their interpretation might be too difficult for a human to understand.As a way of measuring the conciseness of explanations, authors in [25] proposed a measure of complexity.The low complexity of generated explanations suggests that they highlight only the most relevant features and that understanding them does not present a difficult task.Complexity has also been presented as sparseness in [26].

III. NILM EXPLAINABILITY FRAMEWORK
The backbone of our proposed XAI framework for NILM is the proposed visualization procedure, illustrated in Fig. 1, that facilitates the generation of human-interpretable explanations of NILM model outputs.Since the desired granularity of explanations can vary, the visualization procedure offers an ability to generate explanations for both sequential-level, as well as point-level predictions.The sequence-level explanations highlight the areas of the signal most responsible for the prediction, while the point-level explanations display the reasoning behind a prediction of a particular point in time.These two layers of explainability can be used interchangeably as they offer varying degrees of specificity.In the visualization procedure, we utilize five distinct XAI techniques to formulate explanations.Subsequently, the created explanations are subjected to a quantitative evaluation of quality.Taking into consideration a diverse set of needs and possible deployment scenarios, the quality of an explanation is defined as alignment with three desirable properties of explanations, specifically: faithfulness, robustness, and low complexity.

A. Visualization via Heatmaps
We demonstrate how to integrate XAI in the popular seq2point DL-NILM implementation of [35] trained for load disaggregation of various appliances, via regression, on two popular datasets: UK-DALE [9] and REFIT [8].The full procedure is illustrated in Fig. 1.First, to account for the nature of the seq2point algorithm, sliding windows are used to split the input signal into small, overlapping segments, and generate the point output predictions.Then, for a seq2point model with input size δ, for each generated point along the sliding window, a point explanation heatmap of size δ is created via GradCAM, LRP, SmoothGrad, or IG, as per Section II-C.If a step size of 1 is used, and the length of activation window of interest is ω, the total number of generated heatmaps is: Following this procedure, we observe that a single time step along the activation window ω can receive up to δ importance scores.However, this does not hold for all points in ω, in particular the ones at the edges of the window.For example, two points at the far edge (left and right) of the activation window receive only one computed importance score.To ensure that each point along ω captures δ importance scores, we expand the activation window by δ − 1 on both sides.Thus, we create a window of size: Given that the size of activation window of interest, ω, is larger than the model input size, δ, to map the N resulting heatmaps to a single, sequence-level heatmap of size ω, which corresponds to the activation of interest, we need to transform the results into a new representation.To create a heatmap of size ω, we first generate a zero matrix of size ω × (N + 2 * (δ − 1)).Each generated heatmap is added to the matrix based on its position relative to the activation of interest.For example, the first row of the matrix contains the first heatmap that is followed by zero values, acting as padding, until reaching ω samples.The first element in the second row is set to zero, followed by the second heatmap, and finally zero values afterward until reaching ω samples.This procedure is repeated until the last row.
Before populating the matrix, we apply a weight function to mitigate the presence of noise and promote smoothness of heatmaps.Given that the temporal dimension of the middle point of the input corresponds to the output point of prediction, and is highly influential to the prediction, we apply a triangular weight function to the heatmap defined as: where m represents the middle point value, and p min and p max are the lowest and highest weight values, respectively.The maximum value p max is placed at the middle point, while the values drop linearly in both directions when moving away from the middle point, with the lowest value p min at points 0 and 2m.For the purpose of this work, the weight function holds the maximum value of 1 at the middle point, with the two furthest points holding a weight of 0.8.
To further reduce the noise, we aggregate the results by first sorting the matrix column-wise in descending order, corresponding to the time step in the window of interest, and then creating a vector of size ω by computing the non-zero mean value of the top 40% of values per each column of the matrix.In the last step, we transform the window to size ω by clipping the generated vector by δ − 1 on both sides.Following this procedure, the importance heatmap of the target window of interest is obtained, containing the cumulative importance for each of the predicted points of the signal.

B. Property of Faithfulness
The proposed faithfulness evaluation strategy quantifies the extent to which explanations attest to the predictive performance of a model.In other words, faithfulness aims to determine if the feature importance scores, generated by the visualization procedure, are indicative of importance w.r.t.prediction.Given that a ground truth explanation can rarely be known, faithfulness is measured indirectly, by observing the impact of a feature removal on the generated prediction.To measure the faithfulness of an XAI-enabled NILM approach, the following steps are taken: 1) Generate a sequence-level feature importance map of an input signal of interest, as in Section III-A.2) Partition the sequence-level maps into sorted, nonoverlapping segments based on the sum of importance scores over a certain period, to determine the most important areas of the input signal.3) Evaluate the faithfulness of the derived explanations by performing an iterative perturbation of features by changing the input signal values in the segments of interest, starting with the segments of highest relevance.The perturbation of input segment is performed by replacing the power level of the initial signal by the signature of low consuming appliances (e.g., a combination of TV, Lights and Fridge, equaling around 250W).This perturbation ensures that the activation signal is attenuated, while keeping the input data distribution within the space that the model has learned on, as opposed to setting the power level to zero, which would constitute an unfavorable case of an out-of-distribution scenario.4) To establish whether there is a significant impact on the predictive performance, after each perturbation of features we measure the difference between the performance metrics calculated on predictions of non-perturbed and perturbed signals.5) To convey the degradation of performance, we consider both classification and regression-based performance metrics.As a way of capturing the classification performance, we convert the regression output to a step function and calculate the F 1 score as: where TP stands for True Positives, FP for False Positives, and FN for False Negatives.To quantify the disaggregation performance, we utilize mean absolute error (MAE) between the true (E i ) and predicted ( Êi ) consumed energy of the appliance of interest where MAE is calculated as follows: 6) After each perturbation step, compute the difference between performance metrics of altered and original input.The faithfulness score is the resulting area under curve (AUC) after a set number of iterations, where more faithful XAI methods correspond to a higher AUC score.The classification faithfulness showcases the difference in F 1 score values, while regression faithfulness depicts the difference in MAE values.Iterative perturbation of features that leads to a sharper increase in the difference between the F 1 and/or MAE scores (and thus higher faithfulness score) suggests that the feature importance scores generated by the XAI method successfully assign scores to the highly relevant input features and are indeed indicative of predictive performance of the model.

C. Property of Robustness
The growing body of literature in deep learning theory [48] suggests that robustness of neural networks is closely related to the value of its local Lipschitz constant.Intuitively, a Lipschitz constant represents the value by which neural network's output is allowed to change relative to its input.It has been used as a hard constraint to enable adversarial robustness, better generalization and training of generative adversarial networks.Moreover, it has been suggested as a technique for evaluating the robustness of explanations [39].Given a slight modification of input, and consequently negligibly small effect on the prediction, a robust explanation should not differ drastically compared to those created from the unmodified input.We aim to investigate the (in)stability of existing XAI methods w.r.t.slight modifications of household aggregated consumption signal.Given an explanation function h(•) and input aggregate signal x, we expose the signal to zero-mean Gaussian noise with controlled standard deviation σ to create modified input aggregate signal, x.We define local Lipschitz constant estimate as [39]: where μ represents a small value added for numerical stability (μ = 1e −6 ).For validity, we repeat this procedure n times and report the averaged robustness score (RS).Methods with low Lipschitz value scores display a characteristic of being stable under the presence of noise and should be favoured.In the context of NILM-like data it is important to note that we assume Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
bounded input space, i.e., that maximum change in the function value is finite, which can be assumed for NILM signals as the magnitude of the aggregate power signal is bounded.

D. Property of Complexity
One of the core principles of XAI is to provide human understandable explanations.Previous studies in the area of research focusing on applying XAI in the energy sector have reported mixed results when applying XAI tools to real-world energy data [49].Yet, none of these studies have delved into the evaluation of explainability methods, particularly the complexity of explanations.We argue that this property is one of the most desirable ones, as it quantifies the entropy of the XAI output.If most of the input features are deemed important, it does not provide an adequate level of clarity and lowers the human interpretability of explanation.To measure the conciseness of explanation output, we measure the statistical dispersion of the output map.The output map is first sorted in ascending order, and indices of the sorted values are determined.Finally, the conciseness of explanation is formulated as a Gini index computation [26]: where h a is the a-th point in the sorted XAI output of length of ω, i is the rank of values in the ascending order, and κ = 1e −8 is a small value added for numerical stability.A Gini coefficient takes values in the range of [0 − 1], with coefficient of 0 expressing equal contribution of all features, and 1 expressing that only one feature contributes to the resulting heatmap.Evaluation of explainability is in general a two-step process, where at first an explanation result is generated using an XAI method considering the input of the model and the model itself, followed by the measurement of the desirable property of explanation result.In this sense, explanation sparseness points to the dispersion of the distribution of the output of the XAI method (i.e., the complexity of explanation).However, it disregards information about the complexity of the input variable.We argue that this is highly important for systems that include time-varying data, as the presence of noise is a common phenomenon, and the system's ability to deal with it is of particular interest.Consequently, explanation sparseness in the context of NILM does not reflect one of the most common challenges of working with time-series.One of the existing measures that capture the percentage of noise in data sample, noise-aggregate measure (NAR) [50], is defined as: We adapt the formula to measure the noisiness of one particular window and appliance i of interest defined as: We observe that the explanation complexity is often similar for inputs with varying degrees of noise.To establish the relationship between the complexity of an input variable and the complexity of explanation, we introduce an additional term to the explanation complexity that reflects the "noisiness" of the input.Thus, to quantify the complexity of explanation in the context of NILM, we define the "effective complexity" measure as: ( IV. EXPERIMENTAL RESULTS: QUALITATIVE AND QUANTITATIVE EXPLAINABILITY

A. Experimental Setup: Datasets and Model Training
For transparency, we used the most widely used [2] and well documented REFIT [8] and UK-DALE [9] public datasets.These datasets contain real-world active power measurements obtained from residential buildings, exhibiting a realistic spectrum of appliance ownership and usage patterns.To evaluate explainability across appliance activations with different levels of power and activation periods, we focus our attention on popular multi-state and single-state appliances, namely: Washing Machine, Dishwasher, Microwave, and Kettle.The aggregate data were pre-processed using normalization with mean and standard deviation values computed from the training set.All models were trained and evaluated by reproducing the procedure outlined in [35].Houses were chosen based on the condition that they must contain measurements of all four aforementioned appliances.For UK-DALE, we use houses 1, 3, 4, and 5 for training, while house 2 is used for testing.In the case of REFIT, houses 2, 3, 6, 11, 13, and 15 were used for training, while the test set contains data from house 5.
The explainability dataset is created by randomly sampling 30 days when appliances of interest are running and selecting a window of size ω samples centered around the appliance activation window from each chosen day.Given a dataset with granularity of 8 seconds, ω is determined from the typical operation time of the appliance of interest.For appliances with lengthy duration, i.e., Washing Machine (WM) and Dishwasher (DW), activation length ω = 1024 is chosen, which represents roughly 2 hours and 15 minutes of measurements, in line with the average length of a duty cycle of most WM and DW devices.For the Microwave (MW), activation length ω = 80 was chosen, which corresponds to around 10 minutes.Finally, Kettle (KT) activation length ω is set at 40, corresponding to around 5 minutes.If the total length of the activation length of interest is larger than ω, the first ω data samples are selected.

B. Interpretation of Faithfulness, Robustness and Complexity Scores
Faithfulness is of particular importance to an algorithm designer, as it facilitates understanding of how feature importance scores influence the prediction.Conversely, robustness provides an indication of the change in prediction if the input to the DL model changed slightly (e.g., due to appliance model fluctuations, appliance settings and influence of unknown appliances), which is a crucial indicator of scalability.Finally, complexity reflects the human comprehensibility of the visualization.The relative significance Fig. 2. Explanations generated for positive activation of dishwasher in UK-DALE dataset.We can observe unreliable results from GradCAM, while other methods offer more accurate and concise explanations.

TABLE I COMPARISON OF EXPLAINABILITY AND PREDICTIVE PERFORMANCE OF SEQ2POINT MODEL FOR UK-DALE DATASET
of each score is determined by the use-case, i.e., which property is most desirable to an algorithm designer, system developer, consumer or technology enthusiast.Explainability scores (see Sections III-B-III-D) obtained for four different appliances are presented in Tables I and II, for the UK-DALE and REFIT datasets, respectively.Regression (R) and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II COMPARISON OF EXPLAINABILITY AND PREDICTIVE PERFORMANCE OF SEQ2POINT MODEL FOR REFIT DATASET
Classification (C) scores are calculated as the AUC for MAE and F1 scores, as described in Section III-B.For long duration appliances (WM and DW), we perform 75 perturbation steps, while for MW and KT we perform 10 and 5 steps, respectively.To calculate the sorted, non-overlapping segments of importance (as per Section III-B), for appliances with a long activation period, we choose segments containing 40s of measurements, while other appliances contain 24s of measurement.High faithfulness score indicates that the explainability method is able to correctly identify the important features of the input signal, thus leading to a large drop in prediction accuracy after perturbation.The Robustness score is calculated as mean and standard deviation of n = 35 computations of Lipschitz constant estimate, defined in Eq. (15), where μ and σ values of Gaussian noise are 0 and 0.1, respectively.Low robustness score indicates the ability of the explainability method to deal with noise.The Effective complexity is calculated as per Eq.(19).High effective complexity suggests that the explainability method is able to generate explanations that are concise and human understandable.
Tables I and II suggest that LRP-achieved the most success across the proposed properties that explainable NILM systems based on sequence-to-point learning should satisfy.This can largely be attributed to the ability to deal with gradient noise as the relevance is propagated through the layers of the network.We report a strong relationship between the choice of parameter and the results in performance metrics, where value should be guided by the noisiness of the dataset.As the REFIT dataset is known to be significantly noisier than UK-DALE, we set the parameter to be a large value ( = 1) compared to UK-DALE ( = 0.1).Contrary to previous studies in the energy sector that recommended GradCAM as the best XAI method [49], our analysis indicates that GradCAM is not the ideal XAI approach for time-series NILM applications employing sequence-to-point architectures.Notably, GradCAM's faithfulness scores for dishwashers were significantly lower compared to other methods, implying an inability to identify crucial signal features.This observation is further supported by Fig. 2 and the results for the noisier REFIT dataset in Table II, where faithfulness scores for both washing machines (WM) and dishwashers (DW) were unsatisfactory.In an attempt to improve the score, we explored guided gradient technique used for GradCAM, but our findings point to further degradation of performance.On the other hand, our findings reveal that GradCAM++ method does outperform the original GradCAM, achieving better faithfulness and robustness.However, while the results demonstrate significant enhancements of GradCAM++ over GradCAM in these two aspects, the complexity of explanations generated by Grad-CAM++ is observed to be less than ideal.This finding suggests that the enhancements in faithfulness and robustness of GradCAM++ may come at the cost of increased complexity.Intriguingly, IG exhibited excellent performance for the complex signals (i.e., WM and DW) within the REFIT dataset.This implies that a zero signal is an appropriate choice for the baseline value of the IG algorithm for NILM-like data.Meanwhile, SmoothGrad (SG) produced robust results across most scenarios due the nature of the algorithm.
We acknowledge certain limitations in our work that necessitate further exploration.A primary constraint of the proposed evaluation framework is its inability to present specific steps for enhancing the effectiveness of explainability techniques.Nonetheless, our approach facilitates the comparison of various XAI methods, which remains valuable for identifying their strengths and weaknesses and guiding future research and development efforts.Furthermore, a crucial aspect involves examining the relationship and trade-offs between faithfulness, robustness, and complexity in XAI for NILM systems.Striking a balance among these metrics is vital for ensuring the utility, transparency, and, ultimately, trust in XAI NILM systems.Additionally, a key assumption in the context of XAI methods that were used in this work are that the proposed methods assume feature independence, which is a well-known issue in the field of XAI.To mitigate this, a new field of causal discovery has emerged; however the field is in infancy and its practical usefulness is still limited.Another assumption is related to robustness measure where we assume continuity, i.e., that small changes in the input (through introduction of Gaussian noise) will lead to small changes in the output explanation.Furthermore, to calculate the robustness score, we assume bounded input space, i.e., that maximum change in the explanation function is finite, which can be assumed for NILM signals as the aggregate function is bounded.

C. Visualisation via Heatmaps
The proposed approach enables two levels of explainability.On one hand, point-level explainability provides visual understanding of how a prediction of a single time step was made.It is specific to a point of reference.On the other hand, the visualization algorithm generates another, sequence-level explanation, showcasing the aggregate importance of the input signal for the prediction of the output, and acting as a more general representation of the importance.Point-level explanation is preferred to illuminate the features that have contributed to an individual point of the prediction especially if that point prediction is an outlier.Sequence-level explanations are more appropriate if trying to comprehend the decision on inference of a complete appliance duty cycle, such as why a time-series sequence was predicted as a Washing Machine.
Our visualization approach offers several advantages over the previously proposed methods.We tackle the more challenging regression scenario for the NILM problem compared to earlier work, which utilized a multi-class CNN for the simpler task of detecting appliance presence without recognizing on/off states [14].Moreover, our method has been rigorously validated on numerous real-world datasets, demonstrating its adaptability and generalizability across diverse contexts.Unlike previous work that relied on a single dataset, our approach handles varied energy consumption patterns and appliance configurations, ensuring its practicality and resilience.In comparison to the regression-based XAI visualization method in NILM [13], our approach is more computationally efficient, as gradient-based methods require fewer iterations and calculations than occlusion sensitivity, making them well-suited for real-time applications and large-scale datasets.Additionally, our approach avoids the introduction of out-of-distribution scenarios caused by setting parts of the input signal to zero, ensuring that the generated explanations are more faithful to the model's behavior.A key strength of our method lies in its ability to provide multi-temporal explanations, offering insights into both local and global patterns at various levels of granularity, such as point-level and sequencelevel explanations.This enhanced interpretability facilitates a better understanding of the NILM model's decision-making process and allows users to make more informed decisions based on the model's output.Furthermore, the gradient-based XAI methods can be applied to a wider range of DL-based NILM algorithms.
Fig. 2 provides an example of point explanations for a Dishwasher signal prediction from the UK-DALE dataset.This is a true positive prediction where the primary features contributing to the prediction of the middle point (marked with a blue pentagon) are displayed in a form of heatmap.We observe that most XAI methods highlight the true positive part of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
input signal.However, different XAI methods produce varying heatmap visualizations, underscoring the necessity for their quantitative quality evaluation.Comparing the results in Fig. 2 with the results displayed in Table I, LRP and SmoothGrad indeed showcase the best performance.We observe that both heatmaps highlight the truly important parts of the signal, suggesting high faithfulness, and that explanations are concise, pointing to low complexity.On the other hand, GradCAM shows the lowest faithfulness score, which can be observed from Fig. 2 as the GradCAM visualised explanation highlights an area that is not related to high activity of the dishwasher signal, suggesting a case of instability.To a smaller extent, this phenomenon is also observed in the case of IG.While the localization of feature importance scores in GradCAM++ improved compared to GradCAM, we observe a higher complexity of generated explanation.Comparing to LRP and SmoothGrad, we observe that the explanation heatmaps of GradCAM, GradCAM++, and IG cover a larger area of the input signal, and are of noticeably higher complexity, which is a finding that is reinforced by the complexity evaluation scores.Another scenario showcasing the mechanism behind a false positive prediction of a NILM DL model is presented in Fig. 3.In this example, a DW signature is incorrectly predicted as WM.We observe that the general explanation (on the top) enables us to assign the importance scores to the areas of the signal that the network deemed as indicative of a WM duty cycle.Looking further, the point-level explanations (a and b) enable us to understand that the DL model recognizes that there may be multiple cycles in a typical WM signature, which is supported by high importance score assigned to past signal spikes that look similar to a WM duty cycle.This can help the algorithm designer to improve the training and tuning process or adopt a multi-classification approach to better distinguish these multistate appliances with similar power level, duty cycle and duration.

V. CONCLUSION AND FUTURE WORK
This paper proposes a methodology for determining the explainability of a time-series deep neural network regression non-intrusive load monitoring (NILM) problem.Specifically, we propose visualization via heatmaps approach by integrating XAI methods into the DL NILM and quantify explainability via faithfulness, robustness and complexity scores.As a way of overcoming the problem of transparency inherent to DL algorithms, the proposed approach provides a dual mode of explainability, one at a general, sequence level, and other at a specific, point level.Both levels of explainability can be used interchangeably based on the use case, as they provide varying degrees of specificity, i.e., they can deal with different scenarios when the decisions of NILM systems are unclear or difficult to explain.We show that this can be achieved without changing the architecture of the model.Furthermore, we define the core properties that should be considered when designing explainable NILM systems, and provide a strategy for quantitative evaluation of their explainability.We show that XAI methods, such as LRP, that have an inherent ability of dealing with noise, can lead to explanations that satisfy properties of being faithful to the performance of the model, robust to slight changes of input, and offer unambiguous interpretation of resulting heatmaps.The choice of the most appropriate methods should be guided by the target user of explanation, be it a domain expert, researcher, or customer, considering the trade-off between the aforementioned properties.By using the proposed method, the diverse set of needs of various users of the system can be satisfied, while maintaining the predictive performance and facilitating trust in the NILM system deployed in a real-world scenario.
In future work, it is important to extensively explore the relationship and trade-offs between the properties of faithfulness, robustness and complexity in XAI NILM approaches.For example, a highly faithful explanation that closely reflects the model's behavior may be more complex and harder to understand.Conversely, a simpler explanation may be more accessible but less faithful to the model's true decision-making process.Similarly, there may be cases where faithful explanations are sensitive to small changes in input data, resulting in a trade-off between faithfulness and robustness.Thus, striking the right balance between the metrics of explanation quality is crucial to ensure the usefulness of the XAI system.Our research focused on applying XAI to a CNN NILM algorithm.Future studies can extend this work to other NILM algorithms, including other deep learning-based approaches, to better understand the impact on the explainability performance and the generalisability of our findings.Another possible area of research could be combining different XAI techniques to create hybrid explanations, which may offer more comprehensive insights into NILM model behavior.Additionally, as one of the challenges in deploying NILM systems is the need for real-time processing and interpretation of energy consumption data, investigating the feasibility of real-time XAI methods for NILM applications would be a valuable contribution to the field, enabling more practical and actionable insights for users.Further work might also explore the relationship between visualizations and explainability performance for multi-appliance classification and regression.Finally, this framework can be extended to other applications in the energy sector to further promote reliable and safe integration of XAI in the smart grid.

Fig. 1 .
Fig. 1.Visual outline of the proposed approach showcasing the mechanism for visualization of importance at two levels of specificity, leading to point-level and sequence-level explanations for an input sequence of interest.

Fig. 3 .
Fig. 3. Visual outline of the proposed approach showcasing an example of false positive prediction of washing machine for UK-DALE dataset, and the derived explanations using LRP.Two levels of explainability provide general, sequence-level (top image), and specific, point-level explanations (a and b), under a test scenario of signal incorrectly predicted as a washing machine.