Adversarial Inference Control in Cyber-Physical Systems: A Bayesian Approach With Application to Smart Meters

With the emergence of cyber-physical systems (CPSs) in utility systems like electricity, water, and gas networks, data collection has become more prevalent. While data collection in these systems has numerous advantages, it also raises concerns about privacy as it can potentially reveal sensitive information about users. To address this issue, we propose a Bayesian approach to control the adversarial inference and mitigate the physical-layer privacy problem in CPSs. Specifically, we develop a control strategy for the worst-case scenario where an adversary has perfect knowledge of the user’s control strategy. For finite state-space problems, we derive the fixed-point Bellman’s equation for an optimal stationary strategy and discuss a few practical approaches to solve it using optimization-based control design. Addressing the computational complexity, we propose a reinforcement learning approach based on the Actor-Critic architecture. To also support smart meter privacy research, we present a publicly accessible “Co-LivEn” dataset with comprehensive electrical measurements of appliances in a co-living household. Using this dataset, we benchmark the proposed reinforcement learning approach. The results demonstrate its effectiveness in reducing privacy leakage. Our work provides valuable insights and practical solutions for managing adversarial inference in cyber-physical systems, with a particular focus on enhancing privacy in smart meter applications.


I. INTRODUCTION
A cyber-physical system (CPS) integrates physical components with computational and communication elements to enable real-time monitoring and control of physical systems.CPSs provide substantial advantages in utility systems such as electricity grids, water and gas supply networks, and transportation systems, including enhanced efficiency, stability, and automated network control.For instance, smart electric grids can use CPSs to monitor power usage and adjust supply and demand in real time, reducing energy waste and costs.However, the integration of CPSs in utility The associate editor coordinating the review of this manuscript and approving it for publication was Junggab Son .systems can also pose potential privacy risks as the usage patterns of resources can reveal sensitive private information about users to anyone with access to the data.For example, energy consumption data from smart meters (SMs) can be used to infer the types of household appliances [1] and their usage patterns, thereby disclosing sensitive private information about users, including presence or absence and the number of occupants, daily routines, and entertainment habits of occupants, medical equipment usage [2].This information is susceptible to exploitation by malicious actors or unauthorized third parties for various purposes, such as targeted advertising or surveillance.The General Data Protection Regulation (GDPR) in Europe establishes stringent guidelines for handling data containing sensitive personal information.Specifically, the GDPR forbids processing data that could disclose such information without obtaining users' informed consent.For instance, this means that, when using SM data, one should not be able to infer appliance usage patterns that may reveal the religious beliefs of consumers without their explicit consent.Hence, it is crucial to develop privacy-enhancing methods for CPSs that safeguard users' privacy while still enabling their benefits.
Consider a hypothetical scenario where a third-party energy service provider is not only aware of the user's energy consumption patterns but also has perfect knowledge of the privacy-enhancing control strategy employed by the user.This situation presents a significant challenge for the user as the adversary is well-equipped to exploit any weaknesses in the control strategy, potentially leading to the exposure of private information about the user's habits, preferences, or lifestyle.In our previous work [3], we studied the problem of optimally controlling the sequential Bayesian hypothesis testing (SBHT) of an adversary who is unaware of the presence of a control system.In this work, we address an even stronger privacy question: How can a user protect their privacy against an adversary who has perfect knowledge about the control strategy employed by the user?By addressing this question, we aim to design conservative privacy control strategies against a worst-case adversary performing SBHT, which can serve as a benchmark.

A. RELATED WORKS
Addressing the privacy risks associated with smart meter data, several privacy-enhancing techniques have been proposed in the literature to protect sensitive user information without compromising the overall utility and benefits of smart meters.These techniques can be broadly classified into two approaches: Data Manipulation and Demand Shaping.

1) DATA MANIPULATION
Data manipulation techniques aim to protect user privacy by altering measured smart meter data before transmission.[4] presents a privacy-preserving smart metering approach using homomorphic encryption that allows computation of the aggregated energy consumption of a given set of users without accessing individual user data directly.In [5], the authors present a more efficient and scalable approach for data aggregation using Secure Multi-Party Computation (SMPC) cryptographic technique.The authors in [6] propose a privacy-preserving protocol using zero-knowledge proofs that enables billing with time-of-use tariffs without disclosing the actual consumption profile to the supplier.In [7], a simple and efficient method to preserve differential privacy is proposed by adding noise to the SM data in such a way that makes it difficult to learn anything about an individual, but still allows for accurate statistics to be computed.More recently, in [8] a data obfuscation method utilizing both differential privacy preserving data perturbation and a cryptographic noise distribution.Data manipulation techniques, while useful, have certain drawbacks.First, altering the reported values may reduce their usefulness for grid management and load prediction, ultimately undermining the benefits of smart meters.Second, since these techniques do not address the issue at the physical layer level, adversaries with access to power lines could have potentially installed separate sensors, thereby circumventing the privacy protection offered by such methods.

2) DEMAND SHAPING
Demand shaping techniques physically alter the actual user energy demand from the grid in real-time to obfuscate sensitive information that can be inferred from SM data.This is achieved using energy storage systems (ESSs), flexible loads such as heating systems, and renewable energy sources.These physical layer techniques are highly effective in enhancing privacy since they limit information leakage even before data generation.
Several analytical techniques have been proposed in literature that quantify privacy using differential privacy measures [9], information-theoretic measures such as mutual information [10], [11], [12], [13], conditional entropy [14], and others, providing axiomatic guarantees on the maximum possible information leakage.Detection-theoretic privacyenhancing techniques, on the other hand, offer operational privacy guarantees, such as protection against hypothesis tests [3], [15], [16], [17].Related to controller-aware adversarial hypothesis testing, few attempts have been made in the literature to develop control policies to worsen adversarial detection performance.Li et al. [15] presented an optimal control strategy against a greedy and informed adversary conducting independent single-shot hypothesis tests.In [17], an informed adversary performing hypothesis tests on a static binary state is studied, and fundamental limits on achievable privacy are presented.In [16], the authors formulate a partially observable Markov decision process (POMDP) control problem, where the belief state of an informed adversary is optimally controlled over a given horizon, and the adversary is assumed to perform instantaneous hypothesis tests using only current observation.In another system setting, Liao et al. [13] proposed a privacy-enhancing mechanism to aid hypothesis testing while constrained by mutual information privacy measure, which differs from our work where hypothesis testing is used to model the adversary.In another related work, Salehkalaibar et al. [18] define a binary hypothesis state at the control level, where the controller is either in idle or privacy-enhancing state, and analyze hypothesis testing at the utility provider with access to some side information.
Heuristic-based computationally efficient techniques have also been proposed in literature.Notable approaches are the Best Effort Moderation approach [19] that aims to maintain a constant metered load by using a battery, and the Lazy stepping approach [20] that provides privacy by increasing the quantization error of the smart meter data by converting the grid load into a step function using an arbitrary number of quantization levels.While these heuristic approaches are easier to implement in practice, their ability to provide formal privacy guarantees and comply with legal standards could be limited due to their reliance on simplistic and pre-defined rules, especially when faced with adversaries with knowledge about these rules.

B. CONTRIBUTIONS
In this paper, we address a strong physical layer privacy case by considering an adversary with complete knowledge of the user's control strategy and modeling the adversary's inferences using the SBHT inference model.Using the Markov decision process (MDP) framework, we measure privacy leakage in the physical layer using the Bayesian risk (adversarial reward) in the SBHT.For a finite statespace system, we derive a fixed-point equation for an optimal stationary strategy that minimizes the discounted aggregate value of the infinite-horizon Bayesian risk.The fixed-point equation is similar to Bellman's fixed-point equation of a continuous MDP with infinite state and action spaces and with a non-linear objective function that is impractical to solve exactly without making simplifying approximations.In this paper, we present a few practical approaches to solve it using optimization-based control design, highlighting their computational complexities.
Although exact optimal policies are theoretically computable using optimization-based approaches, they become intractable for high-dimensional state-space problems.To tackle the computational complexity, we introduce an actor-critic reinforcement learning (RL) algorithm named Adversarial Model-based Deterministic Policy Gradient (AMDPG).In an actor-critic RL, the actor is parameterized using a neural network, which can be used to generate continuous actions easily without the need for optimization procedures.The critic provides a low-variance performance knowledge of the actor by parameterizing the expected return of its actions [21].This model-free approach allows us to handle the complex and dynamic nature of cyber-physical systems effectively where traditional MDP dynamic programming approaches are intractable.The proposed AMDPG algorithm is inspired by the Deep Deterministic Policy Gradient (DDPG) method [22].A key difference between our proposed AMDPG and the DDPG algorithms can be observed in the noise generation process.In AMDPG, we not only integrate randomly generated noise into the actor output but also add noise obtained from the solution of an optimization problem, which computes an instantaneously optimal policy.This modification is intuitively expected to encourage more effective policy exploration and take into account long-term rewards by functioning near instantaneously optimal control during the learning process.
Furthermore, to facilitate smart meter privacy research, we introduce the publicly accessible ''Co-LivEn'' dataset, which contains detailed electrical measurements of appliances in a co-living household.This dataset provides a valuable resource for studying the privacy implications of NILM and the effectiveness of various privacy-enhancing techniques.Finally, we benchmark the proposed reinforcement approach using the presented household energy consumption data.The results show the effectiveness of the proposed control strategy in reducing the privacy leakage even in the worst-case scenario when the adversary is aware of the exact control strategy employed by the user.
The important contributions of the paper can be summarized as follows: 1) A novel and implementable RL approach to address a worst-case adversary with complete knowledge of the user's control strategy.2) Derivation of a fixed-point equation for an optimal stationary strategy that minimizes the discounted aggregate value of the infinite-horizon Bayesian risk and practical approaches to solve it.3) A publicly accessible energy consumption dataset that includes comprehensive electrical measurements of appliances in a co-living household.4) Benchmarking of the proposed strategies using both synthetic and real data.

C. ORGANIZATION OF THE PAPER
The rest of the paper is organized as follows.In Section II, we present an overview of the system along with its modeling.In Section III, we present the preliminaries on the adversarial inference using SBHT framework.In Section IV, we formulate the optimal inference control problem.Subsequently, in Section V, we present several optimization-based approaches to achieve adversarial inference control.Using reinforcement learning, we present a novel control approach in Section VI.Furthermore, numerical studies using synthetic and real data are presented in Section VII and Section VIII.Lastly, we conclude the paper in Section IX.In this paper, unless otherwise stated, we use capital letters to denote random variables, lowercase letters for their realizations, and calligraphic letters for their alphabets.We use A k:k+i to denote the vector [A k , A k+1 , . . ., A k+i ] ⊺ , and A k:k+i to denote the Cartesian product and the matrix transpose operator by (•) ⊺ .P A (a) denotes a probability distribution function, and I denotes an indicator function with I a = 1 if a is true, and 0 otherwise.0 n and 1 n are n-dimensional vectors with all entries as zeros and ones, respectively.n denotes an (n − 1)-dimensional simplex.In summations, if not otherwise specified, the domain of a variable is its complete alphabet.Throughout the paper, we use the term policy to refer to a map from some information set to some action at a certain time instance, and the term strategy to refer to a sequence of policies.FIGURE 1.The proposed metering system that enables physical layer user privacy by altering the actual consumption using a storage.Here, the solid lines denote the physical resource flow and the dotted lines denote the information flow.

II. SYSTEM MODEL
We consider a privacy-concerned user consuming resource in a cyber-physical system as shown in Fig. 1.At the beginning of each slot k ∈ N in a discrete-time infinite-horizon N = {1, 2, . . .}, the user's consumption is altered by a control strategy using a storage system.We restrict our analysis to discrete-time and discrete resource levels, designing a control strategy for systems with digital signal processing capabilities.We define random variables X k and D k on finite alphabets X and D, respectively, where X k represents the user's resource consumption, D k represents the additional demand or usage of the stored resource specified by the control strategy, and Y k := X k + D k denotes the consumption measured by the sensor on a finite alphabet Y = {x + d : x ∈ X , d ∈ D}.
We model the storage system's state transitions using a first-order Markov model characterized by the conditional distribution P Z k+1 |Z k ,D k , where Z k represents the quantized value of the storage system state on a finite discrete alphabet Z.We estimate the conditional distribution P Z k+1 |Z k ,D k using Monte Carlo simulations and a sample-based density estimation approach [3].In this work, we further simplify the storage system model by parametrizing the conditional distribution Moreover, to capture the sensitivity of user behavior, we introduce a privacy-sensitive state denoted by H k , defined on a finite alphabet H.This state, referred to as the hypothesis state, can represent various user activities such as cooking, taking a shower, and more, which are potentially of interest to an adversary seeking to infer personal information.
We model the dependency between the sequence of user demands and hypothesis states [H 1:N , X 1:N ] corresponding to the horizon N using a first-order time-homogeneous hidden Markov model (HMM) characterized by a set of parameters denoted as θ and is given by where P X k |H k represents the observation probability of user demands, P H k |H k−1 represents the transition probability of hypothesis states, and P H 0 represents the prior probability of hypothesis states.
As we will discuss in the subsequent analysis, the statistics of the ESS state Z k are also relevant to the adversary when attempting to guess the hypothesis state.For analytical simplicity, we replace the state pair (Z k , H k ) with a 1dimensional random variable A k .Let A := {1, . . ., |Z| × |H|} denote the alphabet of A k .We define I k as the causal information available to the controller at the start of slot k, which is a discrete set defined on where U k denotes the set of all randomized control policies.

III. ADVERSARIAL INFERENCE
Here we assume a strong adversary who knows the HMM parameter set θ , the storage system state transition probability P Z k |Z k−1 ,D k , and the exact control strategy µ 1:∞ employed by the user.We model the adversary's inferences about the hypothesis state H k in the infinite-horizon case using the SBHT framework.Let Ĥk denote the hypothesis state estimate of the adversary, which is defined on H.We also define a randomized detection policy for the adversary, denoted by ζ k ∈ C k , which represents the conditional distribution P Ĥk |Y 1:k , where C k denotes the set of all randomized detection policies for the adversary.
In the SBHT framework [23], a reward is assigned to each possible test outcome denoted by c(h, ĥ) with h, ĥ ∈ H.An optimal detection strategy is designed by maximizing the expected reward.The expected reward at time slot k, known as the Bayesian reward, is denoted by r k and is given by where P H k , Ĥk is the joint distribution of the hypothesis state estimate Ĥk and the true hypothesis state H k .Note that r k represents the Bayesian risk, which is the average privacy cost for the user given a fixed function c(h, ĥ).In this work, we assume that c(h, ĥ) ≥ 0 and that the reward for a correct guess is greater than that for an incorrect guess.As a result, the adversary seeks to maximize the average Bayesian reward, while the user aims to minimize it.
For any finite horizon K N = {1, 2, . . ., N } of arbitrary length N , an optimal detection strategy of the adversary for any given control strategy µ 1:N , denoted by ζ * 1:N , that maximizes the average Bayesian reward can be expressed as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where r k|k denotes the conditional Bayesian reward of the adversary given the causal data y 1:k ∈ Y k , expressed as Lemma 1: Let πk denote the belief state of the adversary at time slot k, which represents the conditional probability vector P A k |Y 1:k .For any given data y 1:k and control strategy µ 1:k , the belief state of the adversary evolves according to the recursion: = where M π and M μ are |A| × |A| and |A| × |W| dimensional matrices respectively, and μk denotes the control sub-policy, which represents the conditional probability The proof of Lemma 1 can be found in Appendix A. Note that the conditional Bayesian reward r k|k depends on the causal data y 1:k−1 only through the sub-policy μk and the belief state πk−1 .Furthermore, for a fixed control strategy µ 1:N , the sub-policy μk and the belief state πk−1 do not depend on the detection strategy ζ 1:N .Therefore, the optimization problem for the optimal SBHT adversarial detection strategy ζ * 1:N in (2) can be decomposed into N linear programs: where c( ĥk ) is a |A|-dimensional vector with its elements defined as [c( ĥk )] a = c(f H (a), ĥk ) and f H : A → H is any deterministic function that maps the state A k to its corresponding hypothesis state H k .The adversarial detection policy given by ( 6) can be represented using a time-invariant and deterministic decision rule Remark 1: In the computation of the belief state using ( 4) and ( 5), the current belief state πk depends on the past observations y 1:k−1 only through the sub-policy μk and the previous belief state πk−1 .However, the computation of the sub-policy μk using (47) requires complete data y 1:k−1 at each time slot k if the control policy µ k is explicitly designed in the form P Y k |X k ,H k ,I k .As we will show in the coming section, using a control policy in this form with complete historical data does not help the user in minimizing the average Bayesian reward.
The information flow in the adversarial SBHT detection process is illustrated in Fig. 2. The optimization problem in (7) defining the adversarial decision rule ζ * can also be represented using polyhedral decision regions in the simplex space |A| .For each hypothesis state h ∈ H, the optimal adversarial decision region R h is defined by The set of all decision regions is denoted by R := {R h : h ∈ H}, and it satisfies ∪ h∈H R h = |A| .Since the adversarial detection policy ζ * given in ( 7) is time-invariant, the stationary strategy . .] also maximizes both the average and discounted Bayesian rewards, denoted by w and w ρ respectively.Specifically, for a given control strategy µ 1:∞ , the average and discounted Bayesian rewards of the informed adversary can be expressed as where ρ ∈ [0, 1) is a discount factor.

IV. INFERENCE CONTROL PROBLEM FORMULATION
The SBHT adversarial inference poses a significant threat to the privacy of users in CPSs.To mitigate this threat, we aim to find a stationary control strategy for the infinite horizon N that minimizes the discounted Bayesian risk w ρ .While the infinite horizon average Bayesian risk w is also a suitable optimization objective, it may not perform well in practice due to the dynamic nature of user demands.Therefore, we focus on minimizing w ρ to derive an optimal control policy that ensures the privacy of the CPS.Given any control strategy µ 1:∞ and any ϵ > 0, for a bounded reward function c(h, ĥ) and a sufficiently large N , we have since ρ k−1 → 0 as k → ∞.Hence, we approximate the minimum discounted Bayesian risk, denoted by w * ρ , FIGURE 3. Information flow in the control system with the policy designed using the sufficient statistic πk .
as follows: Here, r k|k − denotes the conditional Bayesian risk given any , which is expressed as where ĥ * k = ζ * ( πk ) is the optimal hypothesis guess of the adversary given the belief state πk which is updated according to (4) or ( 5) given (y k , μk , πk−1 ).As mentioned in Remark 1, computing μk when the control policy µ k is explicitly designed in the form P Y k |X k ,H k ,I k requires the complete data y 1:k−1 at each k.In Theorem 1, we show that the belief state πk−1 is a sufficient statistic of the information vector i k for finding an optimal control strategy that achieves the minimum achievable discounted Bayesian risk.Thus, using policies that rely on the entire data history y 1:k−1 does not provide any performance improvement against the adversary.
Theorem 1: Let Ũ = { μ : |A| → Ū} denote the set of all mappings from |A| to Ū.For any discount factor ρ ∈ (0, 1], at each k ∈ K N , there exists an optimal policy μ * k ∈ Ũ that achieves the minimum discounted Bayesian risk achievable by any policy µ k ∈ U k and is given by where n = N − k + 1 is the backward iteration index starting from k = N for some arbitrarily large N .Here, v n , known as the value function, is the aggregate value of discounted conditional Bayesian risk due to optimal strategy μ * k:N .The proof of Theorem 1 can be found in Appendix B. Fig. 3 illustrates the information flow in the control system with the policy μk ∈ Ũ designed using the sufficient statistic πk . Corollary 1: For any discount factor ρ ∈ (0, 1), there exists a unique fixed point value function v * : |A| → R + to which the Bellman's recursion in (15) converges.Consequently, the optimal stationary policy μ * ∈ Ũ that achieves the minimum discounted Bayesian risk w * ρ is the solution to the fixed point equation: where [v * ] y = v * πk (y, μk , πk−1 ) .Proof: Due to the monotonicity and contraction properties of the Bellman's recursion [24], the Banach's fixed point theorem [25] implies that there exists a unique fixed point v * ∈ V to which the Bellman's recursion in (15) converges.□Remark 2: The optimal SBHT control problem of minimizing the infinite-horizon discounted Bayesian risk can be formulated as a POMDP problem with continuous state πk−1 , continuous action μk , and a piecewise-linear convex cost function:

V. OPTIMIZATION-BASED CONTROL
In this section, we discuss several practical optimizationbased approaches for solving the inference control problem (16) introduced in Section IV.These approaches take advantage of the problem's structure and apply various constraints on the discount factor, control policy space, and belief state space to simplify the optimization problem.These methods are particularly useful for scenarios where the state and action spaces of the system model are small.We also provide a discussion on the computational complexity of each approach.

A. INSTANTANEOUSLY-OPTIMAL CONTROL
An empirical upper bound on the minimum discounted Bayesian risk w * ρ can be obtained by using the instantaneouslyoptimal control policy with ρ = 0.The objective function in (16) becomes piecewise-linear with respect to ( πk−1 , μk ) when ρ = 0, enabling efficient computation of an exact instantaneously-optimal policy.The minimum instantaneous risk, denoted by r * k|k − is given by where Ūr ( ζ , πk−1 ) denotes the set of all policies μk ∈ Ū that satisfy the set of constraints imposed by the decision regions R corresponding to the decision vector ζ and the belief state Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
πk−1 , given by since the belief state πk−1 evolves to πk following the linear-fractional transformation given in (4) and the adversarial decision regions R are also polyhedrons in the belief space |A| .As a result, the exact instantaneously-optimal policy can be obtained in real-time by solving a linear program for the observed belief state πk−1 .
Remark 3: Computing the exact instantaneously-optimal policy requires solving a piecewise minimum over the set H |Y| as given in (17).Therefore, the worst-case time complexity of the instantaneously-optimal control policy is O(H |Y| ), which may grow exponentially with the size of the observation space |Y|.

B. OPTIMAL CONTROL WITH FINITE SUB-POLICY SPACE
Here, we present an approach to solve the SBHT control problem by restricting the feasible space of the control policies in (15) to a finite set of control sub-policies, denoted by ŪF .For example, ŪF can be chosen to be the finite set of all degenerate sub-policies.With a finite control sub-policy space, for each πk−1 ∈ |A| , the Bellman's equation in (15) can be rewritten as where γ k ( μk , πk−1 , v n−1 ) is a vector in R |A| + with its elements given by The set of all hyperplanes in R |A| + that define the boundaries of the decision regions R is denoted by B, and is given by Let B0 = B and for n ≥ 1, Bn denotes the set of all hyperplanes in R |A| + given by Since the control sub-policy space ŪF is finite, by initializing with v N +1 ( πN ) = 0, we can solve the optimization problem in (19) at each k ≤ N by recursively partitioning |A| into a finite set of polyhedral partitions using all hyperplanes in Bn ∪ B. These resulting polyhedral regions are called Markov partitions, similar to those in a POMDP control problem [26].Within each Markov partition, the adversarial inference ĥ * k = ζ * ( πk ) and the vector γ k are constant with respect to the belief state πk−1 .These Markov partitions along with corresponding γ k vectors completely characterize the value function v k in the Bellman's recursion (19).
In a POMDP control problem with a linear cost function, the value function v k can be characterized without the need to compute Markov partitions at each iteration.Instead, it can be completely characterized by computing all possible γ k vectors in the belief space |A| [24, §7.5.1].Remarkably, this result also holds true for a SBHT control problem, even when the cost function is piecewise-linear.
Proposition 1: Let F denote the set of all Markov partitions obtained by partitioning the unit simplex |A| using all the hyperplanes in B1 ∪ B. Within each partition F i ∈ F, the value function v k is piecewise-linear and concave with respect to πk−1 ∈ F i .That is, where n = N − k + 1 denotes the backward iteration index starting from k = N for some arbitrarily large N , and initialized with Here, [ ζ * ] y = ζ * ( πk (y, μ, πk−1 )) and T F i , M π (y, μ) denotes the affine transformation of the polyhedron F i w.r.t. the belief transformation matrix M π (y, μ) defined in (4); and ⊕ denotes the cross-sum operation, which is the pairwise addition of vectors from two sets.The finite set i ( μ, n) in ( 23) is constructed backward recursively by taking a cross-sum of |Y| sets, where each set corresponding to a control action y ∈ Y contains all possible vectors M c ( ζ * ) μ |Y| + M ⊺ π y, μ γ , as a result of Bellman's dynamic programming.Here, γ belongs to the set j ( μ′ , n − 1) corresponding to each μ′ ∈ ŪF and each partition F j ∈ F that can be reached from F i ∈ F using an affine transformation using M π (y, μ).The Prop. 1 can be shown using an induction technique similar to the proof of [24,Theorem 7.4.1]for a POMDP control problem.The construction of the Markov partitions set F using all the hyperplanes in B1 ∪ B ensures that the adversarial decision vector ζ * and consequently, the instantaneous reward r * k|k − in ( 17) are constants w.r.t.πk−1 ∈ F i at each iteration n.Furthermore, as the belief state πk evolves according to the linear fractional transformation in (4), the value function in (19) becomes piecewise-linear with respect to πk−1 within each Markov partition F i .
Remark 4: Although Prop. 1 shows that it is theoretically possible to compute the exact optimal stationary policy over a finite control sub-policy space ŪF using ( 22 .Due to the high computational complexity of the approach, computing the optimal policy may not be feasible even for low-dimensional problems.

C. SUB-OPTIMAL CONTROL WITH FINITE SUB-POLICY SPACE
As mentioned in Section V-B, computing the exact optimal stationary policy over a finite control sub-policy space is intractable, even for small state-space problems.To overcome this issue, we propose a sub-optimal approach based on the method proposed by Lovejoy [27] for POMDP control problems with a linear cost function with respect to the belief state.Since the cost function of the SBHT control problem is piecewise-linear, we use a similar approach to find an upper bound on the minimum discounted Bayesian risk w * ρ empirically.
The key idea in this approach is to retain only a subset of the γ vectors in the set i ( μ, n) at each iteration, denoted by ¯ i ( μ, n), thereby avoiding the double-exponential growth of the γ vectors.Given a set i ( μ, n) computed using (23), we first choose a finite set of arbitrary belief states within the corresponding partition F i , denoted by Fi .We then construct ¯ i ( μ, n) for each μ ∈ ŪF as We then iterate (23) using ¯ i ( μ, n) instead of i ( μ, n) until convergence of the vectors in each set ¯ i ( μ, n) up to some finite precision.This approach gives a sub-optimal stationary policy that yields an upper bound to w * ρ with a fixed space complexity of O (| ŪF | × | F|) |Y| at each iteration, where is the number of all belief states we choose within the simplex |A| .

D. SUB-OPTIMAL CONTROL WITH DISCRETE BELIEF SPACE
Here, we consider the SBHT control problem with a restricted belief state space that is discretized with some precision ϵ > 0 for the probability measure.Let ¯ |A| ⊂ |A| denote the resulting finite discrete space, and let ξ i ∈ ¯ |A| for 1 ≤ i ≤ m represent the discrete belief states.We use the nearest-neighbour (NN) classification boundaries in ¯ |A| to define the Voronoi region N i of each ξ i , which is a polyhedron in |A| given by Fig. 4 illustrates the approximation of belief space in R 3 + to a discrete belief space obtained using 0.25 as the precision for the probability measure.
Let ḡk be a |Y|-dimensional vector representing the indices of belief state πk in ¯ |A| .We denote by Ūb (ḡ k , ξ i ) the set of all control policies μk ∈ Ū that satisfy the set of linear constraints: Then, the Bellman's equation in (15) for each πk−1 = ξ i ∈ ¯ |A| can be rewritten as a linear programming problem: Here, Remark 5: Due to the linear fractional constraint in (28), the optimal solution μ * k ( πk−1 ) in this case may not necessarily be a non-randomized control sub-policy.
Remark 6: The cardinality of the belief space ¯ |A| with precision ϵ is O(|H × Z| (1/ϵ) ).Since (29) requires computing the piecewise minimum over ¯ |A| , finding the optimal stationary policy has a worst-case time complexity of Thus, the computation time may grow double exponentially with respect to the cardinality of the observation space |Y|.
Note that approximating πk to the nearest discrete belief state ξ j ∈ ¯ |A| introduces an approximation error in the value function v n at each iteration n.This error can propagate through the recursive iterations and may lead to a sub-optimal policy against an adversary that uses a more precise belief state πk .Therefore, the trade-off between the precision of the belief state space and the computational cost needs to be carefully considered when using this approach.

VI. REINFORCEMENT LEARNING-BASED CONTROL
Although the SBHT inference control problem ( 16) can be solved approximately using the approaches presented in Section V, due to their complexity, they are only computationally tractable for low-dimensional problems.To address this challenge, we present a reinforcement learning-based control approach based on the Actor-Critic architecture [28].Our approach, called Adversarial Model-based Deterministic 24940 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Policy Gradient (AMDPG), is inspired by the Deep Deterministic Policy Gradient (DDPG) algorithm [22] and enables tractable policy computation even in high-dimensional problems.The AMDPG algorithm is presented in Alg. 1.
The goal of the AMDPG algorithm is to learn a deterministic policy that maps the current belief state to an optimal control sub-policy based on a critic's evaluation of the quality of the action.Like the DDPG algorithm, the critic evaluates the actor by estimating the state-action value function, denoted by Q µ , corresponding to any stationary policy µ, which is expressed as The optimal state-action value function Q μ * and the state value function v * in ( 16) for the optimal stationary policy μ * are related through the stationary distribution of the belief states, denoted by P Π∞ ( π ), as: In the AMDPG algorithm, the actor network selects a control sub-policy at each time step by observing the belief state πk−1 .The resulting belief state transition and adversarial Bayesian reward r k|k = c ⊺ ( ĥk ) πk are stored in a replay buffer.By using random sample batches, the critic network estimates the expected reward, and the actor network updates its parameters by minimizing the expected reward using the gradient descent algorithm.To map the actor network outputs to the control sub-policy action space W, the AMDPG algorithm employs an additive-logistic transformation denoted by L.  Sample rand randomly from (0, 1).Compute expected reward r k|k = c ⊺ ( ĥk ) πk .

22:
Sample random batch from B of size B.

25:
Update the actor and critic network parameters:

26:
end for 27: end for is given by and its unique inverse transformation L −1 for κ ∈ n is Because of the piecewise-linear structure of the reward as shown in (17), the AMDPG algorithm aims to enhance exploration by occasionally selecting an action based on a pool of solutions to the optimization problems that minimize the instantaneous rewards, that is: where φa ∈ a ⊆ A |Y exp | , Y exp ⊆ Y contains only the top e elements from Y, which are chosen according to the likelihood probability of observation y k given the actor network output μ# k , given by the constraint set Ũa , similar to (18) as: and N a is the Voronoi region of a simplex vertex a ∈ A given in (27).Then, the exploratory action μ † k is given by Further, the set of potential vectors in a can be condensed by randomly selecting a pre-defined finite number of vectors from the A Y exp space.This allows for a more manageable computation process to generate adversarial model-based noise.Then, to balance exploration and exploitation, the agent selects the exploratory action μ † k with probability ϵ 1 , L( μ# k +N ) with probability ϵ 2 , and the network output L( μ# k ) with probability 1 − ϵ 1 − ϵ 2 , where N is any randomly generated noise.
The presented model-free RL approach allows us to handle the dynamic nature of complex CPSs effectively where traditional MDP dynamic programming approaches are insufficient.Note that the computational steps consist of solving a few linear programs and computing gradients of actor and critic.Due to the usage of a replay buffer, depending on the computational power of the agent, these computational steps can be spread across a few or several time steps of the controller.Moreover, a noteworthy feature of this approach is its runtime adaptability, as the policy can be learned and adjusted dynamically during system operation.This enables real-world deployability, even in situations where system dynamics may evolve or are not entirely known a priori.Overall, the AMDPG algorithm provides a computationally tractable approach for solving the SBHT inference control problem even in high-dimensional cases.Our simulation results in the next sections demonstrate that it achieves competitive performance compared to other approaches.

VII. NUMERICAL STUDY WITH SYNTHETIC DATA
In this section, we present a numerical study using synthetic data to evaluate the effectiveness of the proposed approaches for the SBHT inference control problem.We consider a simple system with binary state-spaces of H, X , Y, and Z.The system has three control actions D, a horizon length of N = 96.We generate synthetic data using different HMMs with the same prior and observation probabilities, but varying transition probability with parameters λ 0 , λ 1 ∈ {0.2, 0.4, 0.6, 0.8}.The HMM model parameters are set for all k ∈ N as follows: To investigate the attainable privacy levels in relation to various system dynamics, we develop and assess the performance of the proposed control methods for different HMM transition probabilities.We model the state transitions of the storage system using the conditional distribution P Z k |Z k−1 ,D k with the following elements: We select a cost function that protects less frequently occurring hypothesis states since they are more informative.Specifically, we set c(h i , h i ) = 1/P H ∞ (h i ), where P H ∞ denotes the stationary probability of the HMM; and c(h i , h j ) = 0, ∀h i ̸ = h j .Before evaluating the privacy control approaches presented in the previous sections, we first implement an approach based on our previous work [3], where we design a policy that minimizes the discounted Bayesian reward of an adversary who is unaware of the control system's existence.We also discretize the belief states with a precision of 0.2 for the probability measure, as discussed in Section V-D, and use a discount factor of ρ = 0.9.We then evaluate this approach against an adversary with complete knowledge of the control system.Fig. 5 shows the average Bayesian reward corresponding to both the aware adversary (AA) and the unaware adversary (UA).This demonstrates that when the control system is designed for a weaker adversarial case, a stronger adversary can improve its detection performance with knowledge of the implemented control strategy.This result highlights the potential necessity to employ privacy control measures against the most extreme adversaries, ensuring that the system remains secure even in the face of worst-case scenarios.
As noted in Remark 4 in Section V-B, computing an exact optimal policy for even a simple binary system can be highly computationally complex.To illustrate this, we computed the γ vectors for an HMM with (λ 0 , λ 1 ) = (0.2, 0.2) using a finite control sub-policies space ŪF consisting   6 shows the double-exponential growth of the total number of γ vectors in all the sets i ( μ, n) for up to three iterations.Due to the impracticality of this approach, stemming from its excessive computational complexity, we exclude it from the following numerical study.
To design the sub-optimal policy with a finite control sub-policy space, as presented in Section V-C, we select the control sub-policy space ŪF to be the set of degenerate policies μ with transformation matrices M π (y, μ) such that det(M π (y, μ)) > 10 −5 .To reduce computational complexity, we only choose 12 degenerate policies μ with the highest min y det(M π (y, μ)) .To design the sub-optimal policy with a discretized belief space, as discussed in Section V-D, we use a precision of ϵ = 0.2 for the probability measure, resulting in

| ¯
|A| | = 56.Here, we design two such sub-optimal policies by discretizing belief states of an aware adversary (AA) and an unaware adversary (UA).
In addition, we simulate the Best Effort Moderation (BEM) approach [19] where the controller aims to maintain a constant metered load by charging or discharging the battery based on previous load y k−1 , current battery state z k and current consumption x k .We also simulate a differential privacy (DP) mechanism with a Laplacian noise distribution given by where b = x max /ϵ, and ϵ > 0 is a parameter which denotes level of the privacy guarantee the user desires.The lower the ϵ, the higher is the privacy due to more added noise.Furthermore, to design the AMDPG control policy, we use an exploration probability of ϵ 1 = 0.03, where the adversarial model-based noise is generated by solving an instantaneously optimal policy with relaxed constraints using In addition, a uniformly distributed random noise in [0, 0.05] is used to generate noisy action with an exploration probability of ϵ 2 = 0.03.The actor and critic neural networks are designed with 170 and 279 learnable parameters, respectively.We train the actor and critic for 2000 episodes, each containing 96 time slots.We set the discount factor ρ to 0.9 for all other policies to expedite convergence, but for AMDPG, we set it to 0.99.
The performance of the designed control policies is evaluated using an aware adversary in Monte Carlo simulations comprising 2000 episodes.Fig. 7 shows the average Bayesian reward of the aware adversary under different control policies.The results indicate that the sub-optimal policies obtained by restricting either the control sub-policy space or the belief state-space perform poorly against an informed adversary.Additionally, in this binary state system, both the instantaneously-optimal and the AMDPG control policies yield the lowest Bayesian reward among the evaluated control policies.Furthermore, we evaluate the control policies using the adversarial precision metric given by the formula: Precision is useful when dealing with cases where one state is significantly more frequent than the other.In such cases, accuracy can be misleading, and precision provides a better understanding of how well the model performs for the less-frequent state.As illustrated in Fig. 8, the precision of the aware adversary follows a similar pattern as the average Bayesian reward across various HMM parameter settings, represented by λ 0 and λ 1 .This result highlights the effectiveness of our proposed Bayesian approach in mitigating the precision of adversarial inference, thereby enhancing user privacy within the system.

VIII. NUMERICAL STUDY WITH REAL DATA
In this section, we present an experimental study with real data to evaluate the effectiveness of the proposed AMDPG control policy for the SBHT inference control problem.
We first describe the Co-LivEn dataset, which was collected  from a multi-occupancy household with energy consumption data for a variety of appliances.Next, we evaluate the proposed AMDPG control policy on this dataset and compare its performance with that of a control policy designed against an unaware adversary.

A. CO-LIVEN DATASET
The Co-LivEn dataset used in this study is available as a public repository at https://zenodo.org/record/6480220.The dataset contains detailed electricity measurements of various appliances in a collective living (co-living) student household at KTH Live-in-Lab in Stockholm, Sweden.The household comprises four single rooms with attached bathrooms, a shared kitchen, and a common living room.The measurements include root-mean-square (RMS) voltage, RMS current, real power, and power factor of each appliance in the household.The data was collected with a sampling rate of 1 second and over a period of 277 days between August 28, 2020, and May 31, 2021.This energy dataset is unique and comes from a Nordic country, providing insights into the energy consumption patterns of students living in a shared household throughout different seasons.
The data was collected using off-the-shelf smart plugs that were connected to the sockets for each appliance.The smart plugs were equipped with Wi-Fi modules that transmitted the data wirelessly to a local server.The server stored the data in its raw format, which was then pre-processed to eliminate missing data, outliers, and noise.The dataset contains detailed electrical measurements of 32 unique appliances as shown in Table 1.A detailed visualization of the appliance usage data over a single day can be seen in Fig. 9, which is derived from the Co-LivEn dataset.It is important to note that this figure does not depict all appliances, as those with low power consumption are difficult to visualize and have been excluded for clarity.To facilitate access to the data, it has been made available in two formats.The first is a compressed file called ''appliance_csv.zip,''which contains the data in plain CSV file format.The second is a compressed file called ''appliance_mat.zip,''which contains the data in MATLAB file format.Both files are publicly accessible and can be downloaded from the repository.
The dataset is organized into folders according to the location of the appliances, such as the common living room, kitchen, and each individual room.Each location folder contains folders for each appliance, and within each appliance folder, there is a separate file for each day of data collection.This structure allows for easy navigation and selection of specific appliances and time periods of interest.The high resolution and wide range of appliances present in the co-living household energy dataset make it a valuable resource for evaluating the effectiveness of proposed control policies in a real-world setting.

B. EVALUATION OF THE AMDPG CONTROL POLICY
In this section, we present an experimental study with real data to evaluate the effectiveness of the proposed AMDPG control policy for the SBHT inference control problem.We performed numerical simulations using the Co-LivEn Dataset.Specifically, we consider a scenario where users aim to conceal their cooking activities during the daytime to prevent potential disclosure of their presence at home.To model the system, we combine the consumption of high-power consuming kitchen appliances, such as the stove and oven, and define a hypothesis state with two possible outcomes, representing whether at least one of them is on or all of them are off.The consumption from all other appliances is assumed to be independent noise.We used the first 60% of the dataset to train the HMM parameters (using the FHMM approach [29]) and the remaining 40% for evaluating the designed control policy.In the simulations, we used 5minute time slots between 10:00-14:00 each day, resulting in a horizon length of 48.We discretized the mean power consumption data in each time slot using a 400W quantization and set |X | = |Y| = 5.Additionally, we consider a 48V-  30Ah battery with |Z| = 35.As a result, we have the belief state dimension as |A| = 70 and the control sub-policy dimension as |W| = 3500.To reduce the computational complexity, we generated parameters of a three-circuit energy storage model [3] at a higher discretization value than the storage state Z k using Monte Carlo simulations and used them to estimate state transitions at each discrete state Z k that fall within the high-level state.The estimated storage state transition probability P Z k |Z k−1 ,D k was used to simulate a battery in both the reinforcement learning and evaluation phases.Further, to design the AMDPG control policy, we use a discount factor of ρ = 0.99, an exploration probability of ϵ 1 = 0.03, a random noise probability of ϵ 2 = 0.03, and actor and critic neural networks with 8 × 10 6 and 19.5 × 10 6 learnable parameters, respectively.We train the actor and critic for 15000 episodes, each containing 48 time slots.In addition, we also implement the approach based on our previous work [3], where we design a policy that minimizes the discounted Bayesian reward of an adversary who is unaware of the control system's existence, using discretized belief states with a precision of 0.2.The performance of the designed control policies is evaluated using an aware adversary in Monte Carlo simulations comprising 2000 episodes, which are generated by randomly picking each episode from the available 111 episodes (40%) reserved for evaluation from the dataset.In addition, we simulate the BEM approach [19] and a differential privacy with a Laplacian noise distribution for ϵ ∈ {0.1, 1, 10}.
Table 2 shows the average Bayesian reward and precision of the aware adversary when using the designed control policies.It was observed that, with original data, the adversarial precision to identify the cooking (stove and oven) state of household is 0.6.That is, when adversary makes a guess that someone is using stove or oven at the household, it is accurate around 60% of the time.By using the proposed AMDPG control policy, the precision is reduced to 0.29, which is a 52% reduction compared to the original data, demonstrating its effectiveness in reducing privacy risk.BEM and differential privacy (with ϵ = 0.1) approaches also perform reasonably well in this case, with each achieving a precision of 0.4 and 0.33.In this case, although these heuristic approaches perform relatively well compared to the original data, they are not guaranteed to work as well in other cases as they operate based on pre-defined rules.In addition, we observe that when using a control policy designed against an unaware adversary, the aware adversarial precision actually increases to 0.95.This result further emphasizes the importance of employing a control policy against a worstcase adversary.In this study, we evaluated the effectiveness of the proposed control strategies against a privacy scenario related to hiding cooking patterns.Other interesting potential scenarios could be related to hiding occupancy patterns, electric vehicle ownership, usage patterns of entertainment devices such as TV, stereo etc.The MATLAB code used for computing the control policies is publicly available at https://github.com/r2avula/AdversarialInferenceControl.In this work, we use YALMIP [30], MPT3 [31], and Gurobi [32] for mathematical modeling and optimization.

IX. CONCLUSION
In this paper, we presented a Bayesian approach to control adversarial inference and address the physical-layer privacy problem in CPSs.We considered a worst-case privacy scenario, assuming an adversary with complete knowledge of the user's control strategy and modeling the adversary's inferences using SBHT.We employed the MDP framework to quantify privacy leakage in the physical layer by calculating the Bayesian risk (adversarial reward) in the SBHT.
For finite state-space problems, we derived the fixedpoint Bellman's equation for an optimal stationary strategy and proposed practical optimization-based control design approaches to solve it.While these optimization-based methods can produce finite or infinite horizon optimal policies by discretizing either the belief state or subpolicy space, they are not computationally tractable for high-dimensional problems.However, they can serve as useful benchmarks for smaller-scale, toy problems.To tackle the computational complexity of exact optimal policies for high-dimensional state-space problems, we introduced the Adversarial Model-based Deterministic Policy Gradient (AMDPG) RL algorithm, providing a more practical solution for protecting privacy against adversaries with perfect knowledge of the user's control strategy in complex systems.
The numerical simulations with a toy problem demonstrate that a stronger adversary can enhance their detection performance when the control system is designed to counter weaker adversaries by acquiring knowledge of the implemented control strategy.We also found that the achievable privacy is dependent on the HMM transition probabilities, implying that some HMM systems inherently possess higher risks than others.In a binary state-space system, both the instantaneously optimal and proposed AMDPG strategies achieve the minimum Bayesian risk compared to other evaluated strategies.
Additionally, we presented the Co-LivEn dataset, a publicly available energy consumption dataset containing comprehensive electrical measurements of appliances in a coliving household.Using this dataset, we benchmarked the proposed AMDPG strategy and compared it with a control strategy designed for a controller-unaware adversary.Notably, the AMDPG control policy significantly reduced the aware adversary's precision compared to the original data, indicating its effectiveness in mitigating privacy risks.The results reveal that when using a control policy designed against an unaware adversary, not only does it fail to achieve the primary objective of minimizing adversarial performance, but it inadvertently assists the aware adversary in improving their performance relative to the original data.This further emphasizes the importance of implementing a control policy against a worst-case adversary.
In conclusion, the proposed Bayesian privacy control approach and the RL-based policy design can help mitigate privacy risks and limit information leakage in CPSs.The Co-LivEn dataset supports smart meter privacy research by offering real-world data for benchmarking and comparison of privacy-enhancing techniques.Overall, this work contributes to the advancement of privacy-enhancing techniques for CPSs, enabling the full realization of the benefits these systems provide while safeguarding user privacy.

FIGURE 2 .
FIGURE 2. Information flow in the adversarial SBHT detection process with a control policy of the form P Y k |X k ,H k ,I k .

FIGURE 5 .
FIGURE 5.Comparison of average Bayesian rewards of aware adversary (AA) and unaware adversary (UA) when the control system is designed to minimize the Bayesian reward of the UA.

FIGURE 6 .
FIGURE 6. Double-exponential growth of γ vectors for a binary-state problem using the exact optimal approach in Section V-B.

FIGURE 7 .
FIGURE 7. Comparison of the average Bayesian rewards of the aware adversary when using different inference control approaches.

FIGURE 8 .
FIGURE 8. Comparison of the precision (true positive rate) of the aware adversary when using different inference control approaches.

FIGURE 9 .
FIGURE 9. Visualization of appliance usage data over a single day, obtained from the Co-LivEn dataset.

TABLE 1 .
Appliance types by location.

TABLE 2 .
Comparison of the aware adversarial performance using AMDPG control policy and the policy designed by discretizing the unaware adversarial belief state.