Online convex optimization for data-driven control of dynamical systems

We propose an algorithm based on online convex optimization for controlling discrete-time linear dynamical systems. The algorithm is data-driven, i.e., does not require a model of the system, and is able to handle a priori unknown and time-varying cost functions. To this end, we make use of a single persistently exciting input-output sequence of the system and results from behavioral systems theory which enable it to handle unknown linear time-invariant systems. Moreover, we consider noisy output feedback instead of full state measurements and allow general economic cost functions. Our analysis of the closed loop reveals that the algorithm is able to achieve sublinear regret, where the measurement noise only adds an additional constant term to the regret upper bound. In order to do so, we derive a data-driven characterization of the steady-state manifold of an unknown system. Moreover, our algorithm is able to asymptotically exactly estimate the measurement noise. The effectiveness and applicational aspects of the proposed method are illustrated by means of a detailed simulation example in thermal control.


I. INTRODUCTION
This paper considers the problem of controlling an unknown linear time-invariant (LTI) system subject to time-varying and a priori unknown convex cost functions. In particular, we aim to minimize the accumulated cost obtained by our proposed algorithm in closed loop with the unknown system. The main difficulty arises from the fact that the cost functions are time-varying and a priori unknown, i.e., the cost function L t at time t is only revealed to us at time step t + 1. These kind of problems commonly arise in practice, e.g., in power grids due to a priori unknown renewable energy generation and unknown energy consumption [1], in data center cooling [2], or in robotics [3]. Our approach is inspired by online convex optimization (OCO) [4], [5], an online variant of classical numerical optimization. Whereas the classical OCO literature does not consider underlying dynamical systems, it has gained significant interest recently for solving optimal control tasks. Its main advantages include its ability to handle a priori unknown and time-varying cost functions, low computational complexity, and its ability to take constraints on the state and the input of the system into account. OCO-based algorithms have been proposed to control linear dynamical systems [6], [7] subject to process noise [8], [9], constraints [10], [11], or output feedback [12].
Most of the existing OCO-based algorithms in the literature discussed above depend crucially on model knowledge of the system. However, obtaining such a model can be difficult or expensive in certain applications. Hence, in recent years, direct data-based control approaches have received a considerable amount of attention, compare, e.g., [13]. In this work, we employ a result from behavioral systems theory. The so-called fundamental lemma shows that a Hankel matrix consisting of a single persistently exciting inputoutput trajectory spans the whole vector space of all possible input-output trajectories of an LTI system [14]. This result has recently drawn significant attention and has been applied to solve a variety of control problems, e.g., model predictive control (MPC) [15], [16], state-and output-feedback design [17]- [21], and output matching [22]. We combine the fundamental lemma with OCO in order to control dynamical systems subject to time-varying cost functions, where neither the system nor the cost functions are known to the algorithm.
Another closely related line of research is so-called optimal steady-state (OSS) control. Therein, a system is : controlled to the solution of a (possibly time-varying) optimization problem by applying gradient-based feedback and, typically, asymptotic guarantees in the form of stability of the overall system are derived [23], [24]. Again, the main focus in the literature is on model-based control with process noise and output feedback [3], [25], [26]. In [27], a data-driven method for regulating the output of a general nonlinear system, subject to a constant disturbance, to the optimal steady state of a constant cost function is proposed. In particular, the authors leverage a result from zeroth order optimization in order to avoid requiring model knowledge of the controlled system. However, performance is only analyzed in terms of the second moments of the gradients of a smooth approximation of the cost function. Most relevant to this work is [28], where output feedback and unknown systems subject to disturbances are treated by application of the fundamental lemma. To this end, a steady-state map between the input to and the output of the unknown systems is estimated using only measured data. However, the cost functions are assumed to be constant and time-variability of the optimization problem is only introduced via timevarying process noise. Moreover, analysis of the closed loop's transient behavior is limited to analysis of contraction with respect to the optimal steady state, but does not consider the transient cost in terms of regret analysis.
The contribution of this work is fivefold. First, we consider an unknown system by leveraging results from data-driven control. Compared to alternative approaches in the literature, we thereby remove the need of a (set-based) model description and of an online estimation process. Second, we extend our previous results from OCO-based control [6], [10] to the case of output feedback instead of full state measurements, which requires considerable adjustments in algorithm design and analysis techniques. Third, we consider noise in the measurement process. In the relevant literature, e.g., [9], [11], [28], the main research focus is on systems subject to process noise, which is typically handled by estimating the process noise using exact measurements and model knowledge. We instead consider only noisy measurements in our theoretical work and leave the combination of both, process and measurement noise, as an interesting topic for future research. We do, however, consider both types of noise in our simulation example. Fourth, we generalize previous work [6], [10] by considering the practically relevant case of economic cost functions, i.e., the minimum of the cost functions at each time step need not be a steady state of the system. Finally, we derive a new data-driven characterization of the steady-state manifold of an LTI system by leveraging the fundamental lemma. As a main result, our analysis reveals that our proposed algorithm enjoys sublinear regret without access to a system model or exact measurements. This paper is organized as follows. In Section II, we present the basic notions necessary in our work and discuss the problem of interest. Section III introduces and illustrates our proposed algorithm. In Section IV, we discuss our the-oretical findings, in particular a regret analysis of the closed loop and asymptotic convergence of the measurement error estimates. A numerical simulation example, namely a thermal control problem, illustrates the closed-loop performance and applicational aspects of our algorithm in Section V. Section VI concludes the paper.
We close this section by noting that a preliminary version of parts of this paper was presented at the 2021 60th IEEE Conference on Decision and Control (CDC) [29]. This work extends the previously presented results in three directions. First, we consider measurement noise in this work and study its effect on the derived regret bound, which requires adaptations in both algorithm design and theoretical analysis. We show that measurement noise only leads to an additional constant term in the regret bound compared to our previous work. Second, we generalize our work to consider economic cost functions, as discussed above. Third, we remove restrictive assumptions on the steady-state manifold, compare [29, Assumption 1], in order to be able to control a wider class of systems. Moreover, we include a detailed simulation example to illustrate the applicability of our proposed algorithm.
Notation: We denote the set of integer numbers in the interval [a, b] and the set of integer numbers greater than or equal to zero by I [a,b] and I ≥0 , respectively. For a vector x ∈ R n , x is the euclidean norm and for a matrix A ∈ R n×m the corresponding induced matrix 2-norm is A , whereas its Moore-Penrose-Pseudoinverse is denoted by A † . The identity matrix of size n × n is given by I n , 1 n ∈ R n denotes the vector of all ones, and 0 n ∈ R n is the vector of all zeros. A sequence {z k } N −1 k=0 , z k ∈ R n , induces the Hankel matrix of depth L We denote a matrix containing a subset of block rows of H L (z) by With a slight abuse of notation, we write z for the sequence itself as well as for the stacked vector of all its components. We denote by z [a:b] = z a . . . z b the stacked vector of a subset of its components. The shift operator σ is defined by σz = z 1 . . . z N −1 . For matrices A and B, A ⊗ B denotes the Kronecker product.

II. SETTING
We consider linear time-invariant (LTI) systems of the form where x t ∈ R n is the system state, u t ∈ R m is the system input, y t ∈ R p is the true system output,ỹ t ∈ R p is the measured system output, and e t ∈ R p denotes measurement noise at time instance t. We denote by z t = u t y t the stacked input-output pair at time t. The system matrices (A, B, C, D) as well as the noise e t are unknown and only measurements of u t andỹ t are available to us. We do not impose any assumptions on the measurement noise e. We make the following assumptions on system (1). Assumption 1. The matrix A is Schur stable, the pair (A, B) is controllable, and the pair (A, C) is observable.
Controllability and Observability are standard assumptions in the literature [28]. Compared to [29], we only consider stable systems because of the additional measurement noise. In this setting, we can estimate the measurement error asymptotically exactly, if the system is stable (compare Lemma 3 below). If the system is not stable, data-based techniques from, e.g., [19], [20] can be used to stabilize the (unknown) system. Our algorithm can then be applied to the prestabilized system. However, some of our theoretical guarantees deteriorate for this approach, compare Remark 2 for more details.
Our goal is to solve the optimal control problem here the main difficulty arises from the fact that the timevarying cost functions L t : R m × R p → R are a priori unknown. Specifically, we want to find a controller that computes an input u t at every time instance t which is applied to system (1) and yields performance close to the solution of (2). Only after u t is applied to system (1), the cost function L t is revealed, i.e., u t is computed by the algorithm without knowledge of the current cost function. Then, we measure the noisy outputỹ t and move to the next time step. As standard in OCO, we do not attempt to solve (2) directly at each time step [4], [5]. Since the cost functions are a priori unknown, optimization would have to be carried out based on the last known cost function L t−1 . Then, openloop optimization will in general not improve the closedloop performance, due to the time-varying nature of the cost functions. Therefore, we aim to design a computationally efficient algorithm, instead of solving a (potentially largescale) optimization problem at each step. We denote the solution to (2) in hindsight, i.e., the solution when knowing all cost functions, by u * = {u * t } T t=0 and the corresponding system output by y * = {y * t } T t=0 . As common in OCO, we consider smooth convex cost functions as specified in Assumption 2.
Assumption 2. The cost functions L t (z) are • α z -strongly convex, i.e., there exists α z > 0 such that • l z -smooth, i.e., there exists l z > 0 such that • and Lipschitz continuous with Lipschitz constant L z , i.e., there exists L z > 0 such that for all t ∈ I ≥0 and any two points z 1 , z 2 ∈ R m+p .
Remark 1. We assume Lipschitz continuity for clarity of exposition of our results, even though l z -smoothness and Lipschitz continuity cannot be satisfied globally simultaneously. However, if u t and y t remain within bounded sets for all time, Assumption 2 is satisfied on this bounded set. Moreover, techniques from [6] can be used to avoid assuming Lipschitz continuity. In this case, all triangle inequalities in the proof of Theorem 2 are replaced by Jensen's inequality which entails additional assumptions on the step size and the condition number l z /α z of the cost functions. Moreover, changing the regret definition below to R = T t=0 (u t , y t ) − (η t , θ t ) also removes the necessity to assume Lipschitz continuity of the cost functions.
Characterizing the solution to (2), i.e., u * and y * , for general time-varying cost functions L t requires optimization or verifying certain dissipativity conditions [30]- [32] and is thus computationally expensive. For a priori unknown cost functions as considered in this work, computing u * and y * online is impossible altogether. Instead, we adopt a strategy of tracking the a priori unknown time-varying optimal states given by where we define ζ t = η t θ t , η t ∈ R m is the optimal steady-state input, and θ t ∈ R p is the optimal steady-state output of system (1) at time t. In case of constant convex cost functions L, steady-state operation is optimal [33]; hence, we expect that the proposed strategy yields good performance in many practical applications, in particular in case the cost functions L t do not change too frequently. Note that the setting considered here includes as a special case our previous works [6], [10], [29], where only strongly convex, smooth cost functions were considered that are each positive definite with respect to some (time-varying) steady state (η t , θ t ) of the system. Here, we consider more general convex cost functions that do not need to satisfy this requirement. Such cost functions often occur in practice related to some economic considerations, such as minimization of energy cost (compare the example in Section V), which is why such cost functions have been termed economic in the context of model predictive control (see, e.g., [34]- [36]). As common in OCO, we analyze our controller's closedloop performance in terms of regret. In light of our strategy : of tracking a priori unknown and time-varying optimal steady states of system (1), we define the regret R as i.e., the accumulated difference between the closed-loop cost of our controller and the optimal steady-state cost in hindsight. The regret R is a measure of the performance lost due to not knowing the cost functions L t a priori. In the literature, commonly the goal is to achieve sublinear regret 1 , i.e., Hence, if the proposed algorithm achieves sublinear regret, then the closed-loop cost is asymptotically on average no worse than the optimal steady-state cost. Such a performance result is typically also considered in the context of economic model predictive control (MPC), compare, e.g., [33], [35].
Since we do not assume knowledge of the system matrices (A, B, C, D), we assume that we have access to measurement data in the form of a prerecorded input-output and an upper bound on the system order n. Note that we require the true system output as data instead of the (noisy) measured system output. Such data can be obtained in practice when, e.g., the prior data is recorded in a laboratory setting using more accurate measuring instruments than during online operation.
is noise free. Moreover, we assume that the data sequence is persistently exciting as defined in Definition 1.
This definition allows to characterize all possible system trajectories of (1) using only Hankel matrices of the data sequence. This result was first published in the context of behavioral system theory [14] and can be formulated in the classical state space setting as follows.
is a trajectory of system (1), where u d is persistently exciting of order L + n and let Assumption 3 be satisfied. Then, As discussed above, we aim to track a series of a priori unknown steady states without access to a model of the system. Therefore, a data-driven definition of steady states is given in Definition 2.
Definition 2. An input-output pair (u s , y s ) is an equilibrium of (1), if the sequence {u k , y k } n k=0 with (u k , y k ) = (u s , y s ) for all k ∈ I [0,n] is a trajectory of (1). Definition 2 states that an input-output pair (u s , y s ) is an equilibrium of system (1) if and only if a sequence consisting of (u s , y s ) for at least n + 1 consecutive time steps is a trajectory of the system. We make use of Definition 2 and the prerecorded data sequence to characterize the steady-state manifold of system (1) in Lemma 1.
Lemma 1. Let Assumption 3 be satisfied. Assume that the sequence u d is persistently exciting of order 2n + 1. Then, the input-output pair (u s , y s ) is an equilibrium of (1) if and only if Proof: By Definition 2 and Theorem 1, (u s , y s ) is a steady state of (1) if and only if there exists ν ∈ R N −n such that The general solution to this equation is given by where ν ∈ R N −n can be chosen arbitrary. Since the second term on the right-hand side of the above expression is in the nullspace of H n+1 , inserting ν back into (4) yields which proves the result. Lemma 1 explicitly defines the steady-state manifold of system (1) only in terms of Hankel matrices of the prerecorded data sequence.

III. ALGORITHM
In this section, we introduce our algorithm. For notational convenience, we define U = H 2n+µ+1 (u d ) and Y = H 2n+µ+1 (y d ), i.e., the Hankel matrices associated with the system input and output, respectively. In addition, we denote The proposed data-driven OCO scheme is given in Algorithm 1. In the framework described above, at every time instance t, Algorithm 1 Algorithm 1: Data-Driven Output Feedback Given step size γ, prediction horizon µ, initialization −1] , and data (u d , y d ). At each time t: Choose α t such that Measureỹ t and receive L t Set t = t + 1 and go to (5) 1) computes an input u t via (5)- (11) and applies it to system (1), 2) measures the outputỹ t and receives the cost function L t , 3) moves to time step t + 1.
Roughly speaking, Algorithm 1 estimates the measurement noise by relying on its own predictions, applies online gradient descent (OGD) to estimate the optimal equilibrium of system (1), and calculates an input sequence that reaches the estimated optimal steady state. The whole procedure is illustrated in Fig. 1.
In more detail, an estimate of the measurement noise is computed in (5) by comparing the measured outputỹ t−1 to the output predicted at the previous time step Y n+1 (α t−1 + β t−1 ). The estimated measurement noise is then used in combination with the last n inputs u [t−n:t−1] and outputs y [t−n:t−1] in (6) to initialize a prediction step. As an input for prediction, we take the shifted previously predicted input sequence σû t−1 and append it with the previously estimated optimal steady-state input u s t−1 . Thus, α t in (6) encodes the prediction at time t. In (7), the previously estimated steadystate input u s t−1 and the µ-step ahead prediction Y n+µ+1 α t are collected in preparation for the projected OGD step in (8), where the parameter µ can be interpreted as the prediction horizon of Algorithm 1. As common in OGD, we perform one gradient descent step in (8) based on the previous cost function L t−1 , since we do not have access to the current cost function L t yet. Note that multiplication by I m+p − S † S is equivalent to orthogonal projection onto the null space of S, which corresponds to the steady-state manifold by Lemma 1. Thus, in (8), we perform one online gradient descent step and project it onto the steady-state manifold of system (1). The resulting input-output pair z s t can be regarded as an estimate for the optimal steady state. In (9), we compute an input sequence which, if applied in addition to the input sequence used for prediction (i.e., U α t ), reaches the estimated optimal steady state z s t in µ steps and remains at z s t for another n + 1 steps in order to ensure that the unknown internal states x t of system (1) reach the desired steady state. In order to be able to reach the estimated optimal steady state z s t in µ time steps, we require the prediction horizon µ to be sufficiently long. Note that n ≥ µ * always holds. Since we require an upper bound of the system order to be available, it is therefore possible to satisfy Assumption 4 without knowing µ * . However, simulations suggest that a shorter prediction horizon can sometimes be beneficial for the algorithm's performance, since decreasing the prediction horizon forces the algorithm to reach the desired steady state z s in less steps in (9), resulting in a more aggressive controller. Finally, we update the predicted input sequenceû t in (10). Note . Thus, the predicted input sequence is updated by shifting it, appending u s t−1 , and adding the input sequence encoded in β t , which steers the system to the new estimate of the optimal steady state z s t . Last, the first part ofû t is applied in (11) to system (1). Then, we measure the new (noisy) system outputỹ t , receive the cost function L t , and move to the next time step t + 1.
The matrix Q in the cost function of (9) can be tuned to achieve satisfactory performance, e.g., Q = I minimizes the norm of β and can be beneficial if there is process noise affecting the system (1), Q = U minimizes the input difference needed to steer the system to z s t (instead of z s t−1 ), and Q = Y similarly minimizes the deviation of the system's output from the predicted output. Moreover, weighted combinations are possible by stacking the matrices I, U , and Y in Q (compare Section V).
Since we compute input-output sequences of length 2n + µ + 1 in Algorithm 1 (n steps for initialization, µ steps for prediction, and n + 1 steps as a terminal constraint ensuring steady-state operation), by Theorem 1 we need persistency of excitation of order 3n + µ + 1.
Finally, we derive explicit formulas to solve (6) and (9) in Algorithm 1. In order to do so, we need to ensure that (6) and (9) always have a feasible solution, which is guaranteed if their respective right-hand sides describe (parts of) valid input-output sequences of system (1) by Theorem 1. Proof: Assume that at time step t, u [t−n:t−1] andỹ [t−n:t−1] − e [t−n:t−1] are a valid n-step trajectory of system (1). Then, α t can be chosen according to (6), compare [22]. Moreover, there exists β t such that by Assumption 4, since system (1) can be steered from 0 to any steady state in µ steps by controllability, and β t such that by Assumption 4 and because α t encodes a valid inputoutput sequence. Since sums of input-output sequences of a linear time-invariant system are input-output sequences of the same system, there exists a solution to (9) at time step t given by β t = β t − β t . Then, at time step t + 1, the righthand side of (6) is a valid input-output sequence because of Theorem 1 and U 2:n+1 (α t + β t ) (6), (11) = u [t−n+1:t] Thus, (6) and (9) where λ ∈ R ≥0 is a weighting factor, for some initialization u −1 , u s −1 , instead of solving (5) -(6) at time step t = 0. Note that one solution to (6) is given by the pseudo-inverse Moreover, if Q Q 0, i.e., Q Q is positive definite, then the unique solution to (9) is given by the weighted pseudoinverse [38] where g t is the right-hand side of (9) If Q Q 0 is only positive semidefinite, then the solution to (9) is not unique and (13) is only one possible solution.
In the following, we assume that (9) is solved using (13) in both cases. Thus, the necessary online computations in Algorithm 1 reduce to one gradient evaluation and multiple matrix-vector multiplications.

IV. THEORETICAL RESULTS
In this section, we discuss theoretical guarantees for Algorithm 1, in particular a bound on the regret R. In order to derive such a bound, we first analyze the error estimateŝ e. Lemma 3 states that the measurement error estimatesê converge to the true measurement error e. Thus, Algorithm 1 is able to (asymptotically) exactly recover the measurement error e and control the true system output y t , even though only noisy measurementsỹ t are available at each time step. is an input/output sequence of system (1). Then, the error of the measurement noise estimatesê − e follows the unforced system dynamics, i.e.,ê − e is an output of (1) with u ≡ 0 and lim t→∞ê t − e t = 0.

Proof:
For every t ≥ 0, let i.e., the prediction with the real outputs y [t−n:t−1] , compare (6), and Note that α t is well-defined at all times due to Lemma 2. Then, we have U t = 0 by definition of t and Moreover, = u t−1 and, therefore, predict the output y t−1 correctly. Combining this fact with (16) yields Combining the above results U t = 0 and (17), we conclude that the error sequence e t −ê t follows the unforced system dynamics. In more detail, at each time step t the sequence generated by Y t by (17) is initialized by the endpiece of the initialization of Y t−1 (i.e., Y 2:n t−1 ) appended with the one step ahead prediction at time t − 1 (i.e., Y n+1 t−1 ). Therefore, we have by Theorem 1 and U t = 0 for all t that e t −ê t is the output of a trajectory of the unforced system for all t. Since the unforced system dynamics are stable by Assumption 1, we obtain the result lim t→∞ e t −ê t = 0.
Next, we are able to derive an upper bound on the regret R as stated in Theorem 2.
where E 0 = e [−n:−1] −ê [−n:−1] and C µ , C ζ , C e < ∞ are constants independent of T and ζ −1 = z s −1 . The proof is given in the appendix. The upper bound on the regret depends on constants, which in turn depend on system and problem parameters, T t=0 ζ t − ζ t−1 , and E 0 = e [−n:−1] −ê [−n:−1] , i.e., the initialization error of the measurement error estimates. The quantity T t=0 ζ t − ζ t−1 , commonly termed path length in the literature [39], can be regarded as a measure of the variation of the cost functions. A bound which depends on the variation of the cost functions is to be expected, since in our framework, the cost function L t is only available to the algorithm at time step t + 1, i.e., there is a onestep delay between the cost function becoming active and being used to control the system. Thus, it is impossible to achieve low regret if the cost functions vary too frequently. A bound which depends linearly on T t=0 ζ t − ζ t−1 is well aligned with other results on dynamic regret in the literature, compare, e.g., [7], [10]. A sublinear regret can therefore be achieved if the path length is sublinear in T . Moreover, introduction of measurement noise to the control problem only introduces an additional constant term C e E 0 in the regret upper bound compared to [29]. This is due to the convergence ofê to e as shown in Lemma 3.
Finally, consider the case where the optimal steady state is constant, i.e., ζ t = ζ t for some t ∈ I ≥0 and all t ≥ t . Following the proof of Theorem 2, it can be shown that where C µ , C ζ < ∞ are again constants independent of T . Thus, which implies that, in the case that the optimal steady state is constant, the closed loop with Algorithm 1 asymptotically converges to the optimal steady state. Remark 2. (Unstable systems) Suppose that Assumption 1 is not satisfied because the system is not Schur stable. In this case, it is possible to design a linear stabilizing feedback [19], [20] and apply Algorithm 1 to the stabilized system, as discussed above. In particular, at each time step t, u t + v t is applied to the system, where u t is the input computed in (11) and where K is the stabilizing controller. In order to do so, every Hankel matrix of the open-loop system in Algorithm 1 has to be replaced by Hankel matrices of the stabilized system. Moreover, a mapping V (y s ) can be computed that maps a steady-state output y s of the system to a steady-state input of the stabilizing controller. Finally, the cost functions have to be reformulated to L t (u + V (y), y) to account for the stabilizing input when determining the optimal steady state. Then, it is still possible to derive a regret upper bound for the prestabilized system, but the theoretical guarantees deteriorate in two ways: 1) The estimates of the measurement error do not asymptotically exactly converge to the true measurement error as stated in Lemma 3, because the stabilized system is only practically (and not asymptotically) stable due to the measurement noise. Instead, the estimatesê inherit their stability properties from the stabilized system. For example, assume that the stabilizing feedback stabilizes some robust positive invariant (RPI) set. In this case, the estimatesê converge to the same RPI set around the true measurement error e.

2) The regret bound is increased by additional terms
where the constants C sl and C sc only depend on system parameters A, B, C, D, n, and the prediction horizon µ, and v max is an upper bound on the error feedback v e t ≤ v max for all t. Thus, the regret upper bound for the prestabilized system becomes linear in T , which is to be expected since the stabilizing controller feeds back the measurement error e t at every time step, thereby preventing us from staying at the optimal steady state. In this regard, the feedback of the measurement error v e t can also be interpreted as process noise acting on a stable system.

V. APPLICATION EXAMPLE -THERMAL CONTROL A. Setting
In this section, we test our OCO-based control scheme on a thermal control problem. Specifically, we consider a Heating Ventilation and Air Conditioning (HVAC) system which controls the temperature of five nonuniform zones. The HVAC system is equipped with a sensor in zone 1, 4, and 5, and actuators adjusting the supply air rate in every zone. The zones are depicted in Fig. 2. We consider the linear thermal dynamics model proposed in [11], [40] given by where C i is the thermal capacitance of zone i, T i is the zone temperature of zone i, T o is the outdoor temperature, R i is the thermal resistance between the i-th zone and outside, R ij is the thermal resistance between zones i and j, N (i) denotes the set of zones neighboring zone i, u i,t is the control input at time t associated with zone i, and q i,t denotes (unknown) process noise, caused, e.g., by additional heat sources in zone i. For zone 3, we set R 3 = ∞ in our simulation since it is surrounded by other zones and, therefore, not directly influenced by the outdoor temperature. Note that we do not consider process noise in our theoretical work, but do consider it in the simulation as an additional difficulty. We define the system states as Then, we discretize the thermal dynamics with sample time t s = 60 s. The cost function consists of a term penalizing the deviation from a desired temperature T set t and a term minimizing control cost where ∆T set t = T set t − T o , Λ t ∈ R 3×3 λ t ∈ R are a priori unknown time-varying parameters, and p t denotes the a priori unknown energy cost. In particular, λ t and Λ t are weighting factors, trading off user comfort and control cost. We set T o = 15°C, λ t = 10, Λ t = I p . However, we change Λ t to Λ t = 0.1I p between 0 am and 6 am, in order to save energy during the night. The normalized energy cost p t is shown in Figure 3. We choose T set t = 18°C · 1 p but switch it, a priori unbeknown to the algorithm, at 9 am to T set t = 21°C · 1 p . In Algorithm 1, we choose Q = I N −2n−µ U , µ ∈ {10, 30}, and γ = 0.15, which satisfies γ ≤

B. Prediction Horizon and Robustness to Measurement Noise
First, we simulate the proposed Algorithm 1 with different prediction horizons and assess its robustness with respect to measurement noise. To this end, we sample the measurement error e t uniformly from the interval [−1, 1]. Moreover, we 8 VOLUME   increase the measurement error of the sensor in the fifth zone e 3,t between 10 am and 2 pm as shown in Figure 4 to simulate a failing sensor.
The results are illustrated in Figures 4-5. Figure 4 shows the measurement error in the fifth zone e 3,t and the corresponding estimate of Algorithm 1ê 3,t for the first 15 hours. Initially, the estimate is off by 2°C because of the wrong initialization, but then converges to the true measurement error in accordance with Lemma 3. Note that a slight mismatch persists due to process noise. Figure 5 shows the true closed-loop temperatures and inputs of zones 2 and 5, only one of which can be measured, together with the optimal steady state (η, θ) for both zones. Even though the temperature in zone 2 cannot be measured and the algorithm has to cope with process and measurement noise, the closed loop closely tracks the optimal steady state. This is true for sudden changes, i.e., the change of Λ t at 6 am and the change in T set at 9 am, as well as for gradual changes due to the fluctuation of the energy prices p t . Note that the increase in measurement noise around 12 am has no influence on the control performance. The noise in the true temperatures is due to process noise.
Comparing the different values for the prediction horizon µ ∈ {10, 30}, Figure 5 indicates that a shorter prediction horizon yields a more aggressive controller. The accumulated cost over the whole day are approximately 7165 for µ = 10 and 7247 for µ = 30 for the same noise realization. Thus, a shorter prediction horizon yields (slightly) superior performance in this example.

C. Comparison to related work
In a second experiment, we compare Algorithm 1 to the method proposed in [28] for a similar setting (compare the discussion in the Introduction). In order to achieve satisfactory performance for both algorithms, we have to reduce the measurement error and sample it uniformly from the interval [−0.1, 0.1]. For Algorithm 1, we choose the same parameters and initialization as before and choose µ = 10.
For the algorithm proposed in [28], we set the step size η to 0.005. The results are illustrated in Figure 6. It can be seen that both algorithms are able to track the time-varying optimal steady state closely. Moreover, for these parameters, both algorithms achieve almost the same closed-loop cost. However, the algorithm proposed in [28] does so with a higher overshoot, more oscillations, and larger control inputs.

VI. CONCLUSION
In this paper, we proposed a data-driven OCO-based scheme for controlling linear dynamical systems subject to measurement noise. We only use a single persistently exciting data trajectory instead of a model of the system and output feedback to derive the control algorithm. The control scheme achieves a similar sublinear regret bound as comparable algorithms from the literature, despite only having access to noisy measurements. In particular, we show that adding measurement noise to the control problem only leads to an additional constant term in the regret bound. Compared to previous work, the proposed algorithm is able to handle the more general and practically important case of economic cost functions and additionally allows to relax previous assumptions on the steady-state manifold of the system. Future work includes obtaining theoretical guarantees for the case of both process and measurement noise, as well as considering noisy a priori data. Furthermore, enabling the proposed algorithm to handle state and input constraints, which has already been achieved in a model-based setting, is an interesting direction of future research.

A. Proof of Theorem 2
Before we prove the regret bound, we first derive some auxiliary results. Note that α t and β t in (6) and (9) always have a solution by Lemma 2. Then, by (6), we have Moreover, the input sequence generated by α t is given by Hence, α t and α t−1 +β t−1 give rise to the same initialization (18) and the same input sequence (19) and, therefore, must produce the same output trajectory [22] Y n+1: Note that = 1 n ⊗ y s t , U n+µ+1:2n+µ+1 (α t + β t ) (9) = 1 n+1 ⊗ u s t , which implies that the predicted system is at the equilibrium (u s t , y s t ) for n time steps and at the (n + 1)-th time step, u s t is applied again. Hence, the system remains at the same equilibrium and we have Moreover, we need the following key result on the convergence rate of projected gradient descent from [41, Theorem 2.2.14]. Let L(z) be an α z -strongly convex and l z -smooth function to be minimized. Then, one projected gradient descent step z 1 = Π Z (z 0 − γ∇L(z 0 )), where Π Z (·) denotes projection onto the set Z and γ ≤ 2 αz+lz is the step size, satisfies where z * = arg min z∈Z L(z) and κ = 1 − α z γ. Now, we are ready to bound the regret R of Algorithm 1. By definition of the regret and Lipschitz continuity of the cost functions, we have is a constant which is independent of T . Applying the triangle inequality yields Again applying the triangle inequality, we get T −µ where σ max (S O ), σ min (S O ) denote the largest and smallest singular value of S O , respectively. Moreover, since A is Schur stable by Assumption 1, there exist constants c > 0 and λ ∈ (0, 1) such that A t ≤ cλ t . Thus, Next, letᾱ * Then, we have y t+µ = Y n+µ+1ᾱ * t . Moreover, we have that u t+j (11) = U n+1 (α t+j + β t+j ) (6), (10) = U n+2 (α t+j−1 + β t+j−1 ) + U n+1 β t+j .
Applying (6) and (10) repeatedly, we get for 0 ≤ j ≤ µ − 1 and Define C 1 = Y n+µ+1 H † α . Combining the above results, we are now ready to bound the output prediction error T −µ t=0 Y n+µ+1 α t − y t+µ . By Theorem 1, any α t satisfying (6) results in the same output Y n+µ+1 α t since the vector on the right-hand side of (6) uniquely specifies the input sequence and initial condition (compare [22]). Hence, in the following we assume without loss of generality that α t is chosen according to (12).