By Topic

• Abstract

SECTION I

## INTRODUCTION

By combining physical electrical systems with digital information and communication technologies, smart grids aim to provide highly efficient, intelligent, and environmentally-friendly electricity services. This ambition makes smart grids rather different from legacy power systems in many aspects such as widely used renewable energy sources, two-way electricity flow, distributed power sources, and dynamic pricing and load control. To achieve these challenging goals, a smart grid system usually contains many different functional sub-systems to form a highly complex system. As shown in Fig. 1, a smart grid system generally consists of seven major domains: bulk generation, transmission, distribution, customers, service providers, operations, and markets. Systems in bulk generation, transmission, and distribution domains focus on energy generation and delivery from power plants to customers. According to the power demand level, customers are usually classified to three categories: commercial, industrial, and residential customers. Unlike legacy power systems, customers in a smart grid are both electricity consumers and providers. They can freely sell their locally generated electricity to the system at a dynamic price determined by the market domain in real time. A service provider provides a variety of services (e.g., billing and account management) to both customers and utility companies. All these domains are connected with the operation domain for central control and management.

Fig. 1. Illustration of a typical smart grid system [1].

To achieve high efficiency and intelligence for such a large complex system, real-time power consumption statistics of the entire system are essential to all domains in estimating the current system state and predicting future conditions. Decisions are cautiously made according to the analysis of customers' power consumption data. For example, the information of fine-grained customer's power consumption is required by the market domain to determine the electricity price in real time, and the bulk generation domain also needs the information for power demand prediction.

In a legacy power system, the total power consumption of a household is collected by a utility company every two weeks or monthly. There is no data privacy concern in this situation since power consumption details are hidden in the aggregated readings. In a smart grid system, however, the high efficiency and intelligence features demand more granular power consumption data for real-time system monitoring and management as discussed before. This transformation of information granularity leads to severe privacy issues. Fig. 2 shows a customer's daily power consumption trace (PCT) that was obtained by a utility company. With simple observation, an attacker can easily find out what time and how much electricity is used in a customer's home. With advanced power signature analysis tools such as the non-intrusive load monitoring (NILM), the attacker can easily find out what types of appliances are used at any time and learn more detailed information about customer's daily activities. Examples are shown in Table I. This behavior mapping enables the derivation of a detailed profile of a customer's daily activities from his/her power consumption trace.

Fig. 2. Mapping from the power usage to personal activities [2].
TABLE I BEHAVIOR MAPPING FROM THE POWER USAGE TO DAILY ACTIVITIES

Various entities are interested in this type of sensitive information. First, thieves may exploit the information to study the behavior of a customer living in a target house. They can easily break into the house when they derive the information like “nobody is in the house tonight.” Commercial companies are also interested in the power consumption behavior of customers. For example, advertising household appliances will become more effective when sales people figure out which appliances are widely used by customers in their daily lives. Besides, a smart grid system may allow many third party companies to access customers' power consumption data to help them manage energy usage. As a result, third party applications may become another potential source of privacy abuse. Without proper protection, customer privacy and even security will be severely threatened, especially when the public network (e.g., the Internet) is adopted for data transmission in a smart grid system.

To solve this problem, previous work has focused on cryptography-based solutions, where customers' power consumption data are protected under certain cryptographic schemes. One of the common problems of this type of solutions is how to efficiently share real-time power consumption data with different functional domains. Key-sharing among different parties is a potential problem with security concern. In addition, frequent encryption and decryption operations harm system's efficiency. Another problem is related to key management. In a smart grid, there are typically millions of customers located in different regions for one utility company. The conventional public key infrastructure cannot meet the key management requirements of such a large scale network [3]. Privacy-protection schemes from other perspectives have been studied recently, yet they have issues towards practical deployment. That is, there is no efficient way to extract the statistical information of power consumption data for critical features needed for power consumption analysis, forecasting and dynamic pricing. More related work will be reviewed in Section II.

In face of the privacy issue in smart grids and drawbacks of existing solutions, we propose a distortion-based privacy-preserving metering scheme. To be more specific, we introduce random noise on purpose to distort customers' power consumption data at the smart meter in a way that data recovery becomes infeasible. Under the assumption that all other parties except smart meters are semi-trusted, the proposed scheme protects customers' data privacy to a high degree. With distorted power consumption data and prior knowledge about the added random noise, we develop an efficient algorithm for power consumption distribution reconstruction needed for power demand analysis and prediction. With this proposed algorithm, any party in a smart grid can derive complete power consumption statistics of certain geographical area without privacy invasion. To make our scheme more complete, the billing issue is also addressed. We evaluate the proposed scheme with both real world and synthesized power consumption data, and demonstrate that it is robust and efficient in practical implementations.

The rest of this paper is organized as follow. Section II discusses related work on privacy issues in smart grids. The proposed metering scheme is presented in Section III. Privacy protection and efficiency analysis of the proposed scheme is analyzed in Sections IV and V, respectively. Section VI presents experimental results and, finally, Section VII concludes this paper.

SECTION II

## RELATED WORK

The NIST Cyber Security Working Group published three-volume guidelines [4], [5], [6] on security and privacy issues of smart grids in 2010. The guidelines focus on high level requirements and potential technologies in the design of secure smart grid systems. In the same year, the importance of data privacy in smart grids was once again emphasized in the context of smart grid interoperability [7]. In 2011, IEEE published a trial-use standard [8] on a cryptographic protocol for cyber security of substation serial links. The protocol provides integrity and optional confidentiality for communications between substations. However, there is no standard in protecting customer's data privacy in smart grids up to now.

Cryptography-based privacy-preserving data aggregation schemes have been proposed for smart grids, e.g., [9], [10], [11], [12]. Garcia and Jacobs [9] introduced an advanced partially homomorphic encryption scheme to prevent the data aggregator from accessing individual readings. To address the privacy concern of billing service, Molina-Markham et al. [13] proposed a zero-knowledge-based protocol to privately derive and prove the correctness of bills. Along the similar technique, Rial and Danezis [11] proposed a cryptographic protocol for general calculations on meter readings. These studies mainly address privacy concerns from the data aggregation process or the billing service. However, none of them can solve both of them at once. Recently, Lin et al. [14] proposed a lightweight cryptographic scheme with the support of a trusted platform module for both privacy-preserving billing and load monitoring purposes. However, the load monitoring task is limited to collecting the sum of metering data over a certain period of time. Generally speaking, privacy-preserved data sharing and real-time power consumption analysis and forecasting are not well addressed in the above-mentioned work. Furthermore, cryptography-based schemes generally demand a complex key management process, including key revoking, distribution and updating, particularly for smart grids where scalability is a major issue [3].

In addition to cryptography-based schemes, several other ideas have been proposed recently to protect customer data privacy in a smart grid. Efthymiou and Kalogridis [15] introduced an anonymization scheme by assigning each smart meter two different identifiers, the real identity and the pseudonym. The real identity is used for readings associated with billing while the pseudonym for anonymous readings. The escrow, which is the only trusted third party, knows the relationship between the two identifiers. Although the scheme provides a way to hide customer's identity, the trust relies on the third party. Besides, Jawurek et al. [16] showed that basic data mining and pattern recognition techniques can break this anonymization scheme by building connections from customers' pseudonyms to their real identities. Another interesting idea is to mask fine-grained power consumption information with the help of re-chargeable batteries. Schemes proposed in [17] and [18] fall into this category. Both schemes use re-chargeable batteries to hide, smoothen or obscure the fine-grained power consumption trace by charging and discharging operations. However, battery's lifetime is short while its cost is high, which is approximately$1000 for a household [18]. Energy loss in battery storage results in higher energy consumption. Furthermore, real-time power consumption analysis and dynamic pricing are difficult due to the lack of the real original power consumption data. SECTION III ## PRIVACY-PRESERVING METERING SCHEME In this section, we propose a lightweight privacy-preserving metering scheme. The main motivation is to enable effective interoperation of power consumption data by different parties while protecting customer's privacy. Without requiring extensive cryptographic operations on metering data and extra hardware support such as batteries, our scheme is more efficient and practical as compared to previous schemes. The fundamental idea is to distort customer's metering data with random noise in such a way that data recovery is impossible while the extraction of statistic information for regional power usage is easy. In this way, distorted readings can be shared among different parties for different purposes, e.g., power consumption analysis, forecasting, dynamic pricing, and so on. ### A. System Overview and Trust Assumptions Fig. 3(a) shows the system model based on the widely accepted wireless-wired multi-layer architecture [3], [19], [20], [21]. A smart meter in each household aggregates the power consumption data from various appliances through the home area network (HAN). Before sending to a data collector of the neighborhood through a wireless mesh network, the smart meter first distorts the aggregated fine-grained power consumption data. After receiving the distorted data, the data collector located in the neighborhood transmits the data to different parties through the public network directly or to a central data center which can be accessed by those parties. Fig. 3. The system model of the proposed privacy-preserving metering scheme. The following three assumptions are made in the proposed scheme. 1. The smart meter has a security mechanism to maintain the integrity of its computing code and prevent illegal reading and manipulating of energy readings and its secrets (e.g., private keys). Furthermore, it has sufficient computing and storage capability to guarantee that the aggregation of original readings is accurate and reliable. 2. Both data collectors and other entities in the smart grid (e.g., service providers, markets, and 3rd-party companies) are semi-trusted. They follow a protocol to provide services and meet users' service requirements. On the other hand, they also attempt to retrieve the information of customer's behavior as much as possible. 3. Authentication mechanisms are adopted for data transmission among smart meters, data collectors, and other entities in the system, e.g., each smart meter is embedded with a public/private key pair. The private key is protected by the device while the public key can be trusted by the data collector or other entities. The proposed scheme consists of three parts: (1) power consumption data distortion conducted by smart meters, (2) aggregated billing conducted by the utility company, and (3) power consumption distribution reconstruction by interested parties. They are detailed below. ### B. Power Consumption Data Distortion Power consumption data distortion is implemented at the smart meter, where random Gaussian noise is added to the aggregated power consumption data of various appliances. For simplicity, we assume each electricity consumer is bound to one smart meter. For an$M$-consumer power grid, we have a group of smart meters distributed in different places. Each smart meter reports its distorted power consumption data every$T$seconds, where$T$is known as the reporting period. The sequence,$r_{i}[{0}],r_{i}[{1}],r_{i}[{2}],\ldots r_{i}[j],\ldots$represents the sample sequence of the aggregated power consumption of the$i$th customer from time$t=0$to$t=nT$. The distorted data can be expressed in form of TeX Source $$d_{i}[j]=r_{i}[j]+n_{i}[j],\quad i=1,2,\ldots, M,\quad j=0,1,2,\ldots,n,\eqno{\hbox{(1)}}$$ where$n_{i}[j]$is a sample of random Gaussian noise with zero mean and variance$\sigma_{N}^{2}$. The noise power significantly affects the performance of the proposed metering scheme, which will be discussed in detail in Section VI. Fig. 4 gives an example of a distorted power consumption trace in a single day. The red curve is the original power consumption trace while the blue one is the distorted one. It is obvious that high frequency distortions have masked a large amount of valuable information about customer's power usage. Besides, many fake events are fabricated. Without knowing the exact noise samples used for distortion, the original power consumption trace is impossible to be recovered. Since we use relatively large noise as compared to the original data, conventional de-noising techniques and long-term averaging schemes are not able to retrieve privacy-sensitive power usage patterns. We will provide mathematical analysis and experimental results in Sections IV and VI, respectively. Fig. 4. Example of a distorted power consumption trace. ### C. Aggregated Billing In order to facilitate accurate and efficient billing without privacy invasion, an aggregated billing mechanism is adopted as part of the proposed metering scheme. Specifically, we use a storage unit in the smart meter device to accumulate the electricity bill. The initial value of the accumulator is set to 0. Given dynamic electricity price,$P[j]$, at time$t=j$, the aggregated bill$AB$of the$i$-th customer can be updated after each reporting time as TeX Source $$AB_{i}[j]=AB_{i}[\,j-1\,]+r_{i}[j]\times P[j],\eqno{\hbox{(2)}}$$ where$r_{i}[j]$denotes power consumption during the time period from$t=j-1$to$t=j$. When the smart meter receives a billing query from the billing service provider, it sends back the updated$AB$and resets the storage unit. Since we assume the smart meter is securely protected against energy theft and it can authenticate itself to the billing service provider, any unauthorized party cannot access and manipulate the power consumption data in the smart side and during data transmission. Optionally, the service provider can exploit the zero-knowledge proof technique as detailed in [11], [13] to verify the correctness of the bill. ### D. Power Consumption Distribution Reconstruction Billing is not the only purpose of collecting customer's power consumption data in a smart grid. Another essential usage of the collected data is power consumption analysis and prediction for power generation and distribution. Many third party companies are interested in the statistical information of customers' power consumption for business planning and development. Being distinct from previous schemes that provide limited statistical information (e.g., only the total power consumption and the average power consumption of a neighborhood), our scheme can provide a complete distribution curve of the original power consumption for a multitude of customers. On the other hand, by only dealing with distorted power consumption data, the distribution reconstruction process does not leak any private information of a specific customer. The reconstruction process is conducted on the distorted data as given in (1). Given the density function of noise$N$and a sample sequence of distorted power consumption data at time$t=jT$, we have TeX Source $$d_{i}=r_{i}+n_{i},\quad i=1,\ldots,M,$$ where$d_{i}=d_{i}[j]$,$r_{i}=r_{i}[j]$,$n_{i}=n_{i}[j]$and$M$is the number of customers. We use$D$,$R$and$N$to denote random variables for$d_{i}[j]$,$r_{i}[j]$and$n_{i}[j]$. By adopting the data mining approach as given in [22], the conditional probability density function$f_{R\vert D}(r\vert d_{i})$can be written as TeX Source $$f_{R\vert D}(r\vert d_{i})={{f_{N}(d_{i}-r)f_{R}(r)}\over{\int^{\infty}_{\infty}f_{N}(d_{i}-z)f_{R}(z)dz}},\eqno{\hbox{(3)}}$$ where$f_{X}(x)$is the probability density function of random variable$X$. Then, one can recover the power consumption distribution,$f_{R}(r)$, from$f_{R\vert D}(r\vert d_{i})$via an iterative algorithm as summarized in Algorithm 1. ### Algorithm 1 Power Consumption Distribution Reconstruction As described in Algorithm 1, the averaged conditional probability density function,$f_{R\vert D}(r\vert d_{i})$, in (3) is used to update$f_{R}(r)$in the next round of iteration. In fact, the averaging step is equivalent to multiplying each$f_{R\vert D}(r\vert d_{i})$with weight$f_{D}(d_{i}). Thus, we have TeX Source \eqalignno{{\mathtilde{f}}_{R}(r)=&\,{{1}\over{M}}\sum_{i=1}^{M}{{f_{N}(d_{i}-r)f_{R}(r)}\over{\int^{\infty}_{\infty}f_{N}(d_{i}-z)f_{R}(z)dz}}\cr=&\,\sum_{i=1}^{M}f_{R\vert D}(r\vert d_{i})f_{D}(d_{i})\cr=&\,\sum_{i=1}^{M}f_{R,D}(r,d_{i})\cr=&\,\sum_{i=1}^{M}f_{R,N}(r,d_{i}-r).&{\hbox{(4)}}} Since the original power consumption data and additive noise are independent, we can rewrite{\mathtilde{f}}_{R}(r)$as TeX Source $${\mathtilde{f}}_{R}(r)=\sum_{i=1}^{M}f_{R}(r)f_{N}(d_{i}-r).\eqno{\hbox{(5)}}$$ If there is a sufficient number of customers, we have${\mathtilde{f}}_{R}(r)\approx f_{R}(r)$. Thus, the averaging is actually the estimation of the probability density function$f_{R}(r)$from the conditional probability density function,$f_{R\vert D}(r\vert d_{i})$, and samples$d_{i}$,$i=1,2,\ldots,K$. The above sample-by-sample distribution reconstruction algorithm demands the availability of power consumption data from all customers. In a smart grid, millions of customers' data are gathered and processed in real time, and a centralized processing approach not only entails the high data processing capability of the computational units, but also exerts much pressure on data communications because of a large amount of network traffic. Towards a scalable implementation of the proposed system, it is advantageous to adopt a hierarchical processing architecture as depicted in Fig. 5, where a two-layer architecture is given as an example for distributed processing. The first layer is composed of customers and corresponding data collectors located in a single neighborhood. The second layer consists of data collectors and related parties (e.g., utility company). In the first-layer, each data collector runs the iteration algorithm on power consumption data collected from the neighborhood. For the second layer, the data collector sends the reconstructed local power consumption distribution to related parties for integration. The overall power consumption distribution is then calculated by averaging all received local distribution results. Fig. 5. A hierarchical architecture to reconstruct the power consumption distribution. SECTION IV ## PRIVACY PROTECTION ANALYSIS In the proposed hierarchical processing architecture, the local power consumption distribution (e.g., the reconstructed PDF) is transmitted between a data collector and a service provider. Since the distribution of the neighborhood power consumption reveals nothing about the individual power usage pattern, the privacy threat is confined to a local neighborhood. In this section, we investigate such a privacy threat. It is assumed that eavesdroppers can intercept all reported data from a smart meter to a data collector in their distorted form. De-noising schemes might be perceived as an effective method to recover the original signal from its distorted version. Here, we will analyze two popular de-noising schemes to demonstrate the effectiveness of the proposed privacy-preserving scheme. ### A. Linear Mean Filter Attack The linear mean (LM) filter is a widely-adopted technique to recover the original signal from its noisy version. The fundamental idea is to attenuate the noise influence by averaging cumulative values. Given a sequence of samples from a customer's distorted power consumption trace$d[j]$,$j=1,2,\ldots$, with the linear mean filter de-noising technique, an attacker can derive the de-noised signal$\{r^{\prime}(j),j=1,2,\ldots\}of the following form TeX Source \eqalignno{r^{\prime}[j]=&\,\sum_{{j-L}\leq{i}\leq{j+L}}w[i]d[i]\cr=&\,\sum_{{j-L}\leq{i}\leq{j+L}}w[i](r[i]+n[i]).&{\hbox{(6)}}} The expectation and the variance ofr^{\prime}(j)can be derived, respectively, as TeX Source \eqalignno{E[r^{\prime}[j]]=&\, E\left[\sum_{{j-L}\leq{i}\leq{j+L}}w[i](r[i]+n[i])\right]\cr=&\,\sum_{{j-L}\leq{i}\leq{j+L}}w[i]E[r[i]]&{\hbox{(7)}}} and TeX Source \eqalignno{Var[r^{\prime}[j]]=&\, Var\left[\sum_{{j-L}\leq{i}\leq{j+L}}w[i](r[i]+n[i])\right]\cr=&\,\sum_{{j-L}\leq{i}\leq{j+L}}w^{2}[i](Var[r[i]]+Var[n[i]])\cr=&\,\sum_{{j-L}\leq{i}\leq{j+L}}w^{2}[i](Var[r[i]]+\sigma_{N}^{2})&{\hbox{(8)}}} where2L+1$is the sliding window size of the linear mean filter. By choosing$\sigma_{N}^{2}$greater than$Var[r(i)]$, the noise term will dominate in the de-noised signal. The smaller the window size, the poorer the de-noising result. For a large window size, both noise and the variance of the signal are smoothed out and only a smoothed signal remains. In either case, no detailed privacy information is revealed. ### B. Non-Local Mean Filter Attack The non-local mean (NLM) filtering method is a state-of-the-art de-noising technique widely used for image de-noising. For a given pixel, it adopts a weighted average of other pixels that have a similar local structure to remove its noise. The higher the similarity, the larger the weight. Here, we present the 1-D version of the non-local mean filter and use it as an attack (i.e., recovering the original power consumption data of an individual user). With the application of NLM filtering to a sequence of samples from a customer's distorted power consumption trace$d[j]$,$i=1,2,\ldots$, the recovered power consumption at time$k$can be written as TeX Source $$r^{\prime}[k]=\sum_{j\in\Omega_{d}}w[k,j]d[j],\eqno{\hbox{(9)}}$$ where$\Omega_{d}$is a sliding window of size$2s+1$centered around the$k$th sample and$w[k,j]$is a weight indicating the contribution from sample$d[j]$. The weight is in form of TeX Source $$w[k,j]={{1}\over{C_{k}}}\exp\left\{{-{{\Vert D(NH_{k})-D(NH_{j})\Vert_{2,a}^{2}}\over{h}}}\right\},\eqno{\hbox{(10)}}$$ where$C_{k}$is a normalizing factor (i.e.,$\sum w[k,j]=1$),$D(NH_{k})$denotes a group of distorted power consumption samples in the neighborhood of the$k$th sample,$\Vert\cdot\Vert_{2,a}^{2}$is the Euclidean distance weighted by a Gaussian kernel of standard deviation$a$, and the decay of the weight is adjusted by$h$. The mean of$\Vert D(NH_{k})-D(NH_{j})\Vert_{2,a}^{2}in (10) can be written as TeX Source \eqalignno{E[D(k,j)]=&\, E\Vert D(NH_{k})-D(NH_{j})\Vert_{2,a}^{2}\cr=&\,\Vert R(NH_{k})-R(NH_{j})\Vert_{2,a}^{2}+2(2s+1)\sigma_{N}^{2},} whereR(NH_{k})$is the group of original power consumption samples in the neighborhood of the$kth sample. Thus, the mean of weights is TeX Source \eqalignno{& E[w[k,j]]={{1}\over{C_{k}}}\exp\left\{-{{2(2s+1)\sigma_{N}^{2}}\over{h}}\right\}\times\cr&\qquad\qquad\quad\exp\left\{{-{{\Vert R(NH_{k})-R(NH_{j})\Vert_{2,a}^{2}}\over{h}}}\right\}.&{\hbox{(11)}}} Clearly, if we choose\sigma_{N}^{2}$larger than$Var[r[j]]$, weight$w[k,j]$is dominated by the variance of additive noise rather than the signal similarity measure. In other words, a more similar sample may not be assigned a larger weight due to the noise effect and, therefore, NLM is not able to recover the original signal from the distorted one. SECTION V ## EFFICIENCY ANALYSIS As compared with cryptography-based solutions [9], [10], [11], [12], the proposed scheme is more efficient since it does not demand frequent encryption and decryption operations for the transmission of each power consumption data. Instead, a smart meter requires one extra additive operation in every report period (every minute or second). In addition, there is no stringent timeliness requirement for billing request and response. Therefore, reading distortion and aggregated billing have no efficiency concern. The complexity of power consumption distribution reconstruction can be analyzed as follows. Assume that we have$M$(the number of customers) power consumption data and the value range is split into$Q$bins for processing. According to Algorithm 1,$Q+2$multiplications and$Q-1$additions are needed to calculate$f_{R\vert D}^{k-1}(r\vert d_{i})$, and$M$additions and 1 multiplication from$f_{R\vert D}^{k-1}(r\vert d_{i})$to${\mathtilde{f}}_{R}^{k}(r)$. As a result, at each iteration, we need$M+Q-1$additions and$Q+3$multiplications. The adoption of the hierarchical architecture discussed in Section III further reduces the network traffic and operations of related parties (e.g., utility company). The low-complexity of required operations and highly reduced network traffic make the proposed scheme a practical privacy-preserving solution in a smart grid. SECTION VI ## EXPERIMENTAL EVALUATION We evaluate the proposed scheme in the following two aspects. • Privacy protection We adopt a widely used energy signature analysis technology known as non-intrusive load monitoring to examine the privacy-preserving effect of the proposed scheme. Two de-noising attacks (i.e., linear mean and NLM filters) are investigated in both the short-term and the long-term cases. A real world single household power consumption dataset is used in this evaluation. • Power consumption distribution reconstruction As to power consumption distribution reconstruction, both the accuracy of the proposed algorithm and its relationship with other factors (e.g., the distortion level, the number of households, and the number of iterations) are studied. Synthesized datasets from a smart grid data generator for a large amount of households are used in this evaluation. ### A. Data Sets #### 1) Power Curve Dataset We use a power curve dataset for privacy-preserving evaluation, which was collected by the Business Intelligence Lab of Telecom ParisTech [23]. It contains 349 days of electric power consumption data recorded every 10 minutes in a household in 2007. It has 144 readings starting from 00:00 to 23:50 for each day. #### 2) Smart Grid Simulator For the distribution reconstruction, we simulate a large number of households with a smart grid Simulator, which is a software tool developed by AIFB and Wechselfunchs [24]. The simulator generates data from electronic appliance statistics (mean and variance) learned from three weeks of real world power consumption data. Table II shows the parameters for an oven. The generated data are in N3 format and they include detailed hourly power consumption of each electronic device in each household. TABLE II STATISTICS DERIVATION OF AN OVEN ### B. Non-Intrusive Load Monitoring Non-intrusive load monitoring (NILM) is frequently used by utility companies to understand the load status of a power system. With this technology, customer's power consumption trace is decomposed into profiles of different appliances in different time periods. The method used in NILM is to measure changes in voltages and currents. The measured changes are then mapped to events of “ON/OFF” to indicate when appliances start to work and stop. For example, a 3-second load profile$\{(t_{0},0W),(t_{1},100W),(t_{2},200W),(t_{3},100W)\}$could generate features as:$\{(t_{1},+100W),(t_{2},+100W),(t_{3},-100W)\}$. According to extracted features and properties of various appliances, types and working hours of different appliances can be accurately detected. In general, both the reactive power and the real power are measured to distinguish appliances with the same total power draw. As power consumption data traces become more fine-grained, NILM has brought up a severe privacy concern. That is, customers' behavioral patterns can be easily derived from their power consumption traces with NILM. A rechargeable battery was introduced in [18] to smoothen the load profile to reduce the potential privacy leakage of a household, which can reduce up to 95% of potential features. Although this is effective, we demonstrate by experimental results that our proposed scheme can achieve better results with less cost. ### C. Privacy Protection Two levels of distortion are added to the original power consumption trace that was the recorded data for a single household on 2007 Jan. 1st in the power curve dataset, and they are shown in red curves in Fig. 6(a) and (d). The blue curves in Fig. 6(a)(c) are, respectively, the distorted trace with noise variance$\sigma_{n}^{2}=Var(PCT)$, the de-noised trace after LM filtering, and the de-noised trace after NLM filtering. Fig. 6(d)(f) show the corresponding results under noise variance$\sigma_{n}^{2}=1.5Var(PCT)$, respectively. We see that additive noise severely distorts the original trace, and it is difficult to derive the detailed power usage information from the distorted sequence. As a result of the large noise, both linear mean filtering and NLM filtering fail to recover the accurate power consumption traces of customers. The loss of accuracy makes the mapping to customer's daily activities infeasible. Fig. 6. Reading distortions and de-noising results. We adopt NILM to further verify this conclusion. We extract the ON/OFF feature from the distorted traces under three different distortion levels with a pre-defined threshold set to 0.3 KW and plot the result in Fig. 7. We see that many detected ON/OFF features from the distorted trace are fabricated and real ON/OFF features are severely distorted. Figs. 8 and 9 show the extracted features from the de-noised trace with different filters. Again, the features extracted from the de-noised trace are still noise-like, which could not provide any valuable information about customer's profile of daily activities. Fig. 7. Features extracted from the distorted power consumption trace. Fig. 8. Features extracted from the de-noised trace using the linear mean filter. Fig. 9. Features extracted from the de-noised trace using the non-local mean (NLM) filter. As shown in Table III, after distortion, the number of features extracted with NILM increases and it deviates from the original substantially. With additive noise of a larger variance, more features can be extracted. After the linear mean filtering and the NLM filtering, the numbers of features decrease, but they are still higher than the original one except for NILM processed results under noise variance$\sigma_{n}^{2}=Var(PCT)$. Although the number of features is closer to the original one, the minimal ratio of effective features to the total features (e.g., EF/TF) makes the feature recovery almost useless. TABLE III FEATURES EXTRACTED WITH NILM. Another possible attack is long-term averaging, where the power consumption traces from different periods are averaged to obtain the daily power usage pattern of the customer. Under this scenario, the random distortion is gradually diminished as more power consumption traces contribute to the long-term averaging. However, the long-term averaging also largely reduces the variance of the original power consumption trace. This is particularly true in the proposed scheme where a relatively large distortion is adopted. Fig. 10 shows the result of an averaging attack with a long-term period of 349 days. Although the averaged distorted trace (green curve) converges to the averaged original trace, the averaged original trace loses almost all high-frequency variations in daily power consumption traces (dashed curves). Hence, with long-term averaging, the unauthorized party may derive the yearly average peak and pit period of daily power consumptions, but cannot obtain detailed privacy-related information on a daily basis. Moreover, it is convenient in the proposed scheme to change the disturbance level from time to time which hides the original information furthermore. Fig. 10. Long-term averaging results. ### D. Power Consumption Distribution Reconstruction It is shown in Fig. 6 that the larger the additive noise, the worse the recovery results of the original power consumption trace. However, this does not mean that we can choose arbitrarily large noise since we have to make sure that the power consumption distribution can be accurately reconstructed. For a fixed number of users, the larger the random noise, more samples are required to get the accurate estimate of the power consumption distribution. This will be discussed in this section. Fig. 11 shows the distribution reconstruction result of the proposed scheme without distributed processing. The dataset is generated by the smart grid Simulator. It consists of the power consumption data of 10,000 households in the same time period. The variance of the original power consumption data is 7892.2 W. The disturbance added to the original trace is additive Gaussian noise with mean 100 W and its variance being the same as the original power consumption data. The number of bins for data splitting is 53. The red and blue curves show the aggregate original and distorted power consumption data of 10,000 households, respectively. The green curve is the reconstructed power consumption distribution using the proposed scheme presented in Section III-D. The averaged mean square error (MSE) of the distorted power consumption distribution is 0.1608, while that of the recovered distribution is enhanced to 0.0232. Fig. 11. Power consumption distribution reconstruction. Fig. 12 shows the distributed processing results under the same setting. The blue curve shows the reconstructed distribution with 10-group distributed processing. The MSE of a distributed processing system is 0.0251, which is almost the same as the centralized one. Fig. 12. Comparison of centralized and distributed processing of power consumption distribution reconstruction. To further compare the accuracy of the centralized and the distributed processing schemes, we test the performance of distributed processing with a different grouping number, denoted by$K$in Fig. 5, and show the averaged performance as a function of$K$in Fig. 13. We see that the distributed processing scheme achieves about the same level of accuracy for a wide range of$K\$ values.

Fig. 13. MSE under a different number of groups.

Fig. 14 shows the distribution reconstruction accuracy with a different number of households in the local aggregation process. We see that the averaged MSE decreases as the household number grows. For a power utility company with millions of customers, the averaged MSE could reach 0.001. Fig. 15 shows the MSE curve as a function of the iteration number of the power consumption distribution recovery algorithm parameterized by different distortion levels. We see that the distortion level does not affect the accuracy of distribution recovery. With a fixed number of households, the maximum reconstruction accuracy keeps the same. However, different distortion levels demand a different number of iterations to achieve the optimal reconstructed distribution. For the real world system design, a proper trade-off between privacy protection and accuracy of power consumption distribution reconstruction should be considered.

Fig. 14. The MSE of the reconstructed power consumption distribution as a function of the number of households.
Fig. 15. The MSE of the reconstructed power consumption distribution as a function of the iteration number parameterized by different SNR values.
SECTION VII

## CONCLUSION

A lightweight, efficient and robust privacy-preserving metering scheme for smart grids was proposed in this work. In the proposed scheme, a customer's power consumption data are first distorted by additive noise before being sent to other entities in a smart grid system. Any party receiving the distorted data cannot restore the original fine-grained power consumption trace so that customer privacy is protected. At the same time, related parties can still derive the complete distribution of the original power consumption data from a multitude of customers in a certain geographic area, which enables several important functions of smart grids such as production prediction, dynamic pricing, and billing service. It was shown analytically that several well known denoising attacks fail to recover the original power consumption trace. Experimental evaluations with real world and simulated power consumption datasets confirm the feasibility and practicality of the proposed scheme.

## Footnotes

X. He and C.-C. J. Kuo are with the Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90007, USA Corresponding author: X. He (xingzehe@usc.edu)

X. Zhang is with the Huawei Research Center, Santa Clara, CA 95050, USA

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Comment Policy