SECTION I

The engineering research field of cyber–physical systems (CPSs) has drawn a great deal of attention from academia, industry, and the government due to its potential benefits to society, economy, and the environment [1]. As a whole, CPSs refer to the next generation of engineered systems that require tight integration of computing, communication, and control technologies to achieve stability, performance, reliability, robustness, and efficiency in dealing with physical systems of many application domains [2].

Even though the specific context of problems and challenges of today's CPSs is different from those in the past, the basic goal of developing control systems through integration of technologies from computing and communication has roots that go back nearly a century. For example, at the time of World War II, the development of automatic antiaircraft guns was one of the most important and challenging problems that required tight integration of technologies from the mechanical, electrical, electronics, and communication fields [3], [4]. In a much broader sense, we may also interpret CPSs as physical systems controlled or manipulated in a principled manner through engineering technologies. With such an interpretation, the history of CPSs can easily be traced back to the Industrial Revolution sparked by the development of the steam engine governor in the 18th century. Hence, we can view and understand the emergence of today's CPSs as a continuation of technological evolution that started from the early uses of feedback control technologies.

Over the last several decades, the advancements in computing and communication technologies have been so significant that we now refer to them as having collectively given rise to an information technology (IT) revolution. In fact, every aspect of today's individual, social, industrial, and economic activities are highly dependent on such cyber–system technologies. In particular, the Internet has changed the way we interact and communicate with each other and also how we create, distribute, and consume information. Continuing this trend, the advent of ubiquitous embedded computing, sensing, and wireless networking technologies are becoming the key enabling technologies for how we interact, control, and build physical engineered systems such as automobiles, aircrafts, power grids, manufacturing plants, medical systems, and building systems, on which our modern society and economy are becoming highly dependent.

The potential benefits of the convergence of computing, communication, and control technologies for developing next-generation engineered systems that can be called CPSs are transformative and wide ranging. Through real-time embedded systems for distributed sensing, computation, and control over wired or wireless communication networks, multiobjective optimization, high-level decision-making algorithms, and formal verification technologies, engineered systems in many societally critical application domains such as transportation, energy, and medical systems can be designed and developed to be much more smart, reliable, secure, efficient, and robust. Needless to say, there are many challenges ahead that need to be addressed in the future. These efforts will have to span all the constituent fields.

The spectrum of research fields relevant to CPSs is very broad. This overview paper is not an exhaustive survey that covers every aspect of CPS research, and is necessarily limited by the knowledge of the authors. In Section II, we review the history of control, communication, and computing technologies leading to CPSs. Then, we review recent achievements in many research disciplines. In Section III, we review research advances in selected areas, networked control systems and hybrid systems, which constitute some of the theoretical foundations for design and analysis of the dynamical behavior of CPSs. In Section IV, we discuss theories and technologies vis-a-vis real-time computing and networking. In Section V, we review fundamental theoretical results and implementation platforms for wireless sensor networks. In Section VII, we discuss the design and development of CPSs from the software engineering point of view. In Section VIII, we conclude by envisioning opportunities and challenges in some domains.

SECTION II

Computers were originally invented to perform computation. The first computer ENIAC [5] was constructed in 1946 to perform ballistic calculations. However, computers subsequently began to be used to close control loops around physical systems. This motivated the development in 1973 of real-time computation [6], [7], which involved the problem of how to schedule computational tasks so that every job in every task was completed before its deadline. This constituted a significant shift in the usage of computers. If performing calculations correctly was the only purpose, then all one needed was to ensure the *order* of computations. There is no need to deal with physical time. However, if one is interfacing a computer with a physical plant, then the *time* by which computations are performed is important. So, already by this time, there was interest in CPSs, though the name itself was to be invented much later.

In the 1990s, there began to appear much greater interest in the interaction between computational and physical systems [8]. Specifically, with the physical plant modeled by differential equations, and the computational systems modeled by finite state machines or other discrete models of computation, the interest centered on how the interaction of the two evolved. This field was called hybrid systems, reflecting the composite nature of the overall system.

Around 2006, researchers, predominantly in real-time systems, hybrid systems, and control systems, coined the name “cyber–physical systems” to describe this increasingly important area at the interface of the cyber and physical worlds.

There are several other paths also leading to this area of interest. From its origins as ARPANET [9] in 1969, the Internet developed into a worldwide network connecting computers. Around 1973 was the beginning of the cellular telephony revolution. Also around 1971 the ALOHA network was developed to interconnect users across the Hawaiian islands with a mainframe computer in Oahu [10]. Its pioneering ideas, concerning how to resolve contention of the shared wireless medium, were used in Ethernet as well as packet radio networks. In 1977, the U.S. Defense Advanced Research Projects Agency (DARPA) tested the PRNET packet radio network [11]. In 1978, the U.S. Army deployed the Single Channel Ground and Airborne Radio System (SINCGARS) packet radio system [12]. Subsequently, in 1997, the IEEE 802.11 WiFi standard was developed and proliferated across offices and homes after the introduction of IEEE 802.11b [13]. All this, including the landline telephone network, have led to a communication revolution. The goal of interconnecting computers to form a communication network has played a central role in ALOHA, the Internet, and WiFi. Thus, we see here the convergence of communication and computation.

Around 1998, a new element was added to the mix—sensing, with the development by the Smart Dust project [14] of a mote, a tiny device capable of sensing, communication, and computation. These motes allowed the attachment of sensors to the nodes, bringing information about the physical environment into the interconnected wireless communication network of computational nodes.

When nodes in a communication network are connected to both sensors and actuators, one obtains a networked control system. Thus, again, we see an evolutionary path from communication and computing to, in this case, *networked* CPSs.

There is yet another path that one can trace to the present interest in CPSs. In the modern electronic era, the first generation of control systems, analog control, was based on the operational amplifier [15]. To use this technology, a theoretical framework was needed. The appropriate framework was the frequency domain approach, developed by Nyquist [16], Bode [17], Evans [18], and others. This also led to CPSs—though based on analog computation. One can regard Ziegler–Nichols tuning rules [19], for example, as methods to adjust the overall CPS to achieve desired behavior. Already, by 1954, there was beginning to emerge the second generation of control—digital control [20]. This was spawned by the development of the digital computer. Now simple calculations on algorithms could be performed on the measured signals before closing the loops. This too required a theoretical framework; the appropriate one in this case was the state–space approach. This was developed by Bellman [21], Pontryagin [22], Kalman [23], [24], [25], [26], [27], and others under the leadership of Solomon Lefschetz at the Martin Company's Research Institute for Advanced Study in Baltimore which was founded in 1955. This led to a very strong foundation of systems theory, with a thorough investigation of optimal control [28], stability [29], linear systems [30], nonlinear control [31], stochastic systems [32], adaptive control [33], robust control [34], infinite-dimensional systems [35], decentralized control of complex systems [36], discrete event systems [37], and even attempts at integrating automata theory and control [38].

Digital control is more than 50 years old, and in the intervening years there have been dramatic advancements in the power of computers as well as the proliferation of embedded computers. There has also been enormous growth in the complexity of software and in the programming abstractions that have been developed for building them. Finally, wireline and wireless data networks were nonexistent 50 years ago. Thus, the emergence of networked CPSs is leading to a third generation of control systems. There has been evolution in the technology of control system implementations on distributed systems. In process control, the controller area network (CANBus) [39] has been used to provide the underlying communication network for distributed control systems. There has also been developed the Field Bus system [40] for interconnection. There is also interest in the “Internet of Things,” where physical objects are assigned addresses and interconnected with each other, with interest therefore focused on the communication–physical system interface. All this, together, constitutes yet another platform revolution. At such a time of platform revolution, it is necessary to examine both mechanisms as well as policies. By “mechanisms,” we mean how to implement a system, while by “policies,” we mean what to implement, for example, which control law.

There is also a great impetus from the viewpoint of applications of societal interest to develop more complex control systems featuring sensing, actuation, and computation capabilities connected by a communication network. There is an increasing demand for more and better transportation systems, energy systems, healthcare systems, and water systems, across all segments of the planet. Due to these demands, as well as the increasing awareness of the resource limitations of the planet, the 21st century could well be the age of building large systems. Many if not most or all of these systems will be composed of complex CPSs.

All these trends—the convergence of several disciplines, the evolution of technology in various fields, and the increasing need to build large scale systems to meet the burgeoning societal needs in an environment of resource frugality—have led to great research interest in the issues sought to be captured by the phrase of CPSs [1].

SECTION III

The dynamics of CPSs is complex, involving the stochastic nature of communication systems, discrete dynamics of computing systems, and continuous dynamics of control systems. In this section, we review recent theoretical results on modeling and analysis of dynamical behavior of CPSs from different points of view.

One of the fundamental characteristics of today's CPSs is the existence of a communication network mediating between and among computing and physical entities as shown in Fig. 1. The interactions between controller and the physical system can therefore experience network-induced delay. Packets can even occasionally be lost. The network's links can be regarded as communication channels that are subject to data rate constraints. Hence, some of the fundamental questions that are of importance for networked control systems are as follows. 1) How do the network-induced delay, packet loss, and communication channel affect the stability of the system? 2) Under what conditions is an NCS stabilizable, and how does one stabilize it?

The first issue is when to sample a physical system. The traditional approach is to sample it periodically or at predetermined instants. An alternative is to sample it when specific events occur, e.g., when a signal crosses a level. These have been called Riemann and Lebesgue sampling [41]. The latter approach requires continuous monitoring of the system to detect when to sample it. An alternative is to decide a safe interval for which the system can be left unsampled and an appropriate time to sample it next. This is called self-triggering and can lead to more efficient monitoring as well as usage of resources, and even be used to guarantee stability based on some knowledge of the plant [42], [43], [44].

To study the effect of network-induced delay, consider an NCS modeled as consisting of a linear continuous time plant and a controller exchanging data packets over a lossless communication network that is shared with other unrelated nodes [45], [46]. Define the network-induced error $e(t):=[ \mathhat{y}(t)\,\mathhat{u}(t)]^{T}-[y(t)\,u(t)]^{T}$ where $y(t)$ is the output of a plant, $u(t)$ is the output of a controller, and $\mathhat{y}(t)$ and $\mathhat{u}(t)$ are the most recently received versions of $y(t)$ and $u(t)$, respectively. If there is no network-induced delay between plant and controller, then $\mathhat{y}(t)=y(t)$ and $\mathhat{u}(t)=u(t)$ and so $e(t)=0$ for all $t$. A network scheduling strategy, called maximum-error-first with try-once-discard (MEF-TOD), which dynamically assigns the packet transmission order among nodes to share the network is proposed in [45] and [46]. The notion of maximum allowable transfer interval (MATI) is introduced to bound the amount of time between transmission events and derive a sufficient condition in terms of MATI for stability of the NCS.

Another approach to the stability analysis of an NCS [47] is by using hybrid systems analysis techniques [48]. As a model of an NCS, consider a plant $\mathdot{x}(t) = Ax(t)+Bu(t)$ for $t\in[kh+\tau,(k+1)h+\tau)$, and a state feedback controller TeX Source $$u(t^{+})=-Kx(t-\tau), t\in\{kh+\tau:k=0,1,\ldots\}\eqno{\hbox{(1)}}$$ where $h$ is the sampling period, $\tau$ is the fixed network-induced delay that is the sum of the delays from sensor to controller and controller to actuator, and $u(t^{+})$ is piecewise continuous changing values only at $kh+\tau$. Stability is guaranteed if the following matrix has all its eigenvalues inside the unit disk: TeX Source $$H=\left[\matrix{e^{Ah}&-E(h)BK\cr e^{A(h- \tau)}&-e^{A\tau}\left(E(h)-E(\tau)BK\right)}\right]\eqno{\hbox{(2)}}$$ where $E(a):=\int_{0}^{a}e^{A(a-s)}ds$. Instead of (1), one can consider a state feedback controller that uses an estimated plant state $\mathhat{x}(kh+\tau)$ to compute a control input at $kh+\tau$ [49].

A more general framework for stability analysis of the NCS is to consider a nonlinear NCS with disturbance and also a general class of network scheduling protocols, called Lyapunov uniformly globally exponentially stable (UGES) protocols [50]. Both the round-robin (RR) static scheduling protocol and the MEF-TOD dynamic scheduling protocol considered in [45] turn out to be Lyapunov UGES protocols. Moreover, the input–output ${\cal L}_{p}$ stability of the NCS for Lyapunov UGES protocols is shown in [50] based on the small gain theorem.

A data packet that is transmitted, especially over wireless, can be dropped. One way to model packet loss [51] is as an asynchronous sample and hold switch which closes with a certain rate $r$. The NCS with packet loss can then be modeled as an asynchronous dynamical system (ADS) incorporating both discrete and continuous dynamics, and its stability analyzed through Lyapunov-based analysis. Lower bounds on the transmission rate $r$ needed for stability can be obtained [52].

The stabilization of an NCS over a channel that is prone to packet drops can be addressed through robust control analysis and synthesis techniques [53]. An NCS can be viewed as a feedback interconnection of a deterministic nominal system, denoted as $G$, and a zero-mean stochastic structured model uncertainty, denoted as $\Delta$. The stability problem can then be formulated as a linear matrix inequality (LMI) feasibility problem. Using the notion of mean square structured norm of $G$, denoted by $\mu_{\rm MS}(G,\Delta)$, the controller design problem for stabilizing an NCS can be posed as an optimization problem TeX Source $$\eqalignno{\mu_{\rm MS}^{\ast}(G,\Delta)=&\,\inf_{K-{\rm stab,LTI}}\mu_{\rm MS}(G,\Delta)\cr=&\,\inf_{\theta\,>\,0,{\rm Diag.}}\inf_{K-{\rm stab,LTI}}{\Vert\theta^{-1}G\theta\Vert}_{\rm MS}^{2}&\hbox{(3)}}$$ where the infimum is taken over all stabilizing LTI controller $K$ for the given feedback interconnection of $G$ and $\Delta$. However, it turns out that the search for the controller $K^{\ast}$ with the largest stability margin $\mu_{\rm MS}^{\ast}(G,\Delta)$ is nonconvex with respect to the parameter $\theta$. Hence, the optimization problem (3) is intractable in general. However, it is shown in [53] that, for any fixed $\theta>0$, the optimization problem (3) can be converted into an equivalent LMI optimization problem and the optimal controller $K^{\ast}$ can be determined through it.

The problem of state estimation over a lossy communication link corresponds to a filtering problem with intermittent observations [54]. More explicitly, the plant can be modeled by a discrete time linear Gaussian system $x_{t+1}=Ax_{t}+w_{t}$, where packets $y_{t}=Cx_{t}+v_{t}$ arrive with probability $(1- \alpha)$ as a Bernoulli process, and $w_{t}$ and $v_{t}$ are independent identically distributed (i.i.d.) Gaussian random vectors. If the matrix $C$ is invertible, then for a stable Kalman filter, it is necessary that the packet drop probability $\alpha\ <\ (1/(\max_{i}\vert\lambda_{i}(A)\vert)^{2})$, where $\lambda_{i}(A)$ are the eigenvalues of $A$.

One can formulate the control of NCSs as an optimal control problem for LTI systems over a lossy communication link, with an uplink from sensor to controller, and a downlink from controller to an actuator that is collocated with the plant [55], [56]. A fundamental problem that arises when there are packet drops between the controller that computes a potential value of control to be applied, and an actuator that actually applies control inputs, is the resulting nonclassical information pattern which renders very difficult the computation of the optimal control law under a linear quadratic control framework [57]. This difficulty disappears when the network protocol is a TCP-like protocol, i.e., a notification of successful reception is available. Then, there is indeed separation of estimation and control [58]. A sufficient condition for the stabilizability of an NCS is $\max\{\alpha,\beta\}\ <\ (1/(\max_{i}\vert\lambda_{i}(A)\vert)^{2})$, where $\alpha$ and $\beta$ are critical values of drop probabilities for uplink and down link.

In the NCS, it is also important to determine where in the network to perform calculations required by the control law. Under some conditions, the optimal placement of a controller in the NCS is to collocate it with the actuator [59]. Also, the above condition is then a necessary and sufficient condition for the existence of a stabilizing controller in the presence of packet drops, even when the matrix $C$ is not invertible.

Another important issue is how the presence of the data-rate limited communication channels affects the stabilizability of the system. An early precursor [60] considers the problem of optimal control with respect to a long-term average quadratic cost criterion of a linear Gaussian system. The channel is modeled as one that appropriately delays finite length codewords. It is shown that when the encoder codes the innovations process of the state estimate rather than the state itself, then there is a separation theorem and the optimal control is linear in the state estimate. More recently, there has been increasing attention paid to the problem of stabilizing a linear system when some of the feedback loop has to be closed over a communication channel of limited data rate. In an early work [61], the plant is modeled as a linear deterministic continuous system. The communication channel's limited data rate is modeled by assigning a long time delay, proportional to the number of bits that are sought to be communicated. An instantaneous output measurement taken at a certain time is simply quantized by a symbol from a finite alphabet. The decoder can however choose an appropriate control, from a finite set, based on the past history of all encoded measurements received. An unstable system is not asymptotically stabilizable, and an appropriate notion of containability related to the ability to keep the system in an open sphere around the origin when it is started close enough to the origin is introduced. It is shown that an inequality relating the rate of change of the system and the data rate is sufficient for containability. Another early paper [62] considers a scalar plant with a channel capable of transmitting $R$ bits per second without noise, and shows that in order to keep the trajectory bounded when sampling uniformly it is necessary for the rate $R$ to exceed a multiple of the logarithm of the absolute value of the unstable eigenvalue. In [63], the problem of quantization is studied where the sensitivity (i.e., fineness) of the quantizer is varied within a bounded neighborhood of the origin. In [64], it is shown that for quadratic stabilizability, the optimal sampling time depends on the sum of the logarithms of the magnitude of the unstable eigenvalues, and that the optimal quantization levels are logarithmic. In [65], a discrete linear system is considered, and the channel is modeled as being able to transmit $R$ bits perfectly in each second, i.e., as a bit pipe. It is shown that for asymptotic stabilizability it is necessary that the data rate exceeds the sum of the logarithms of the magnitudes of the unstable eigenvalues of the system matrix. When the encoder has access to past control inputs that were applied, some of the complications caused by information patterns, as in [57], do not arise, and it is shown that such a rate is also sufficient for asymptotic stabilizability. A companion paper [66] considers the case of noisy channels, and a similar necessary condition is shown on the rate, defined in a Shannon-theoretic sense, for almost sure asymptotic stabilizability. For certain channels with erasures, and when past control inputs are available to the encoder, this rate is also shown to be sufficient. In [67], a deterministic scalar autoregressive-moving-average (ARMA) system with a random initial condition is considered. The channel is modeled as a bit pipe, and a necessary and sufficient condition is obtained on the rate to ensure that the $m$th moment of the state is driven to zero. The minimum data rate needed for mean-square stabilizability of a system with both state and observation noises in treated in [68]. A dynamic quantizer is used to account for possible unbounded values of the state. There has also been attention to the case of bit pipes where the data rate varies randomly. In fact, the case of dropped packets can be regarded as a special case where the rate can be zero. The case of i.i.d. rate variation is considered in [69] and [70]. The case of a channel which changes between 0 and a certain rate as a two-state Markov chain is considered in [71]. The case of a more general Markovian channel rate evolution is examined in [72]. The concept of anytime capacity is introduced in [73] to capture a noisy communication channel when it is used as part of a feedback loop to stabilize an unstable linear system. Again, for scalar systems, the required rate is larger than the logarithm of the unstable systems' gain. The issue of coding for noisy communication channels when they are used to close control loops is examined in [74], [75], [76], [77]. The case when the channel noise is Gaussian is simpler because uncoded transmission can be used [78]; this problem is connected to the problem of communication in the presence of feedback.

Further results on the NCS can be found in [79], [80], [81] on optimal control over a communication channel, in [82] and [83] on NCS with sampling and delay, in [84] and [85] on stability and control analysis of the NCS through delayed differential equation framework, in [86] on wireless control network where the entire network itself acts as the controller, in [87] and [88] on decentralized control problems, and the references in [72] and in [89] and [90] for a survey of this field.

CPSs are typically required to adapt to various changes in internal and external factors. One way of adaptation is through “switching” between different operation “modes” which results in a switched system. The class of systems with switching can be described by $\mathdot{x}=f_{\sigma}(x)$, where $\sigma:[0,\infty)\rightarrow{\cal P}$ is a piecewise constant function of time, called a switching signal, and ${\cal P}$ is some index set [91]. The stability of such systems has been studied, and recent results can be found in [91] and [92] and the references therein.

A more general modeling framework for CPSs is hybrid automaton (HA), which can be used to model complex dynamics of CPSs through various mathematical formalisms [93], [94], [95], [96], [97], [98] that can capture both the transition between discrete states and also the evolution of continuous states over time. One useful HA model developed for algorithmic verification of CPS has the following components [94].

- A finite directed graph $\langle V,E\rangle$ where each $v\in V$ is called a control mode or a location, and each edge $e\in E$ is called a control switch.
- A finite set of continuous real-valued variables $X=\{x_{1}, \ldots,x_{n}\}$. The first derivative of $X$ is written as $\mathdot{X}=\{\mathdot{x}_{1},\ldots, \mathdot{x}_{n}\}$ and $X^{\prime}=\{x_{1}^{\prime},\ldots,x_{n}^{\prime}\}$ is used for the value of $X$ at the conclusion of a discrete change.
- Two edge labeling functions
*guard*and*reset*that assign to each $e\in E$ predicates of variables from $X\cup X^{\prime}$ to indicate a discrete transition condition and a reinitialization of a continuous variables. - Three vertex labeling functions
*init, invariant, flow*to indicate an initial, invariant, and flow condition for each $v\in V$. Both*init(v)*and*invariant(v)*are predicates with variables from $X$, while*flow(v)*is a predicate with variables from $X\cup\mathdot{X}$ to describe the dynamics of $X$ within a mode.

A simple example of an HA is Fig. 2 in which $V=\{{\rm on,off}\}$ and $X=\{x\}$. For the mode *on*, three vertex labeling functions are $x=2$ for *init(on)*, $x\in[1,3]$ for *invariant(on)*, and a differential equation $\mathdot{x}=-x+5$ for *flow(on)*. The discrete transition, or control switch, from *on* to *off* occurs based on an edge labeling function *guard* $x>3$ for the mode *on*, and the variable $x$ is reset to a value by a *reset* map $x^{\prime}:=3$ during the transition.

Safety verification of a given hybrid automaton ${\cal A}$ can be addressed by determining whether the set $\overline{\rm Reach}({\cal A},{\cal I})\cap{\cal U}$ is nonempty, where ${\cal I}$ denotes a given initial set of states, ${\cal U}$ denotes a set of unsafe states, ${\rm Reach}({\cal A},{\cal I})$ represents the set of states reached by executions of ${\cal A}$ starting from ${\cal I}$, and $\overline{\rm Reach}({\cal A},{\cal I})$ is an overapproximation of ${\rm Reach}({\cal A},{\cal I})$.

${\rm Reach}({\cal A},{\cal I})$ can be computed through the iteration $\varphi_{k+1}={\rm Post}(\varphi_{k})$, where $\varphi_{k}$ is the set of reached states at the $k$th step, and ${\rm Post}(\varphi_{k})$ is the set of states that is the union of $\varphi_{k}$ and the set of states reached from $\varphi_{k}$ through a discrete transition and continuous flow. If $\varphi_{k}$ and $\varphi_{k+1}$ coincide for some finite number $k$, then the algorithm terminates, returning ${\rm Reach}({\cal A},{\cal I})$. However, it is well known that the exact computation of $Reach({\cal A},{\cal I})$ is undecidable in general [100], [101]. Hence, in such cases, computing $\overline{\rm Reach}({\cal A},{\cal I})$ is also an important research issue as we will discuss later.

The first subclass of hybrid automata for which reachability was shown to be decidable is the class of timed automata [102]. Roughly, timed automata are those where 1) the vertex labeling function *flow(v)* is of the form of $\mathdot{x}_{i}=1$; 2) the edge labeling function *reset(e)* either does not change the value of $x_{i}$ or resets $x_{i}$ to zero during a discrete transition; and 3) the sets associated with *init, inv, guard* are all in rectangular form, i.e., a finite boolean combination of the form $x_{i}\oplus c$ for some $c\in\BBQ$ and $\oplus\in\{<,\leq,=,\geq,>\}$. It is important to note that even though the continuous dynamics of timed automata is very simple from a control perspective, introducing time in a model of computation was a significant conceptual advance in the area of algorithm verification, and a precursor to a lot of work on hybrid systems.

The notions of simulation and bisimulation relations were established in the area of formal methods and used successfully for complexity reduction in discrete systems [103], [104], [105]. It turns out that they are also very useful for complexity reduction of hybrid systems to address reachability. In [102], the reachability problem for timed automata is shown to be decidable since there exists a finite quotient transition system ${\cal R}({\cal A})$ which is bisimilar to the original timed automaton. A quotient transition system is one that is constructed by the partition of the continuous state space. Transition systems ${\cal T}_{1}$ and ${\cal T}_{2}$ are said to be bisimilar if there exists a bisimulation relation ${\cal B}$ between ${\cal T}_{1}$ and ${\cal T}_{2}$. Definitions of transition system, simulation, and bisimulation relation are as follows [106].

A (labeled) transition system with observations is a tuple ${\cal T}=(Q,\Sigma,\rightarrow,Q^{0},\Pi,\langle\langle\cdot\rangle\rangle)$ where 1) $Q$ is a set of states; 2) $Q^{0}\subseteq Q$ is a set of initial states; 3) $\Sigma$ is a set of labels; 4) $\Pi$ is a set of observations; 5) $\langle\langle\cdot\rangle\rangle$ is an observation map from $Q$ to $\Pi$; and 6) $\rightarrow$ is a transition relation such that $\rightarrow\subseteq Q\times\Sigma\times Q$ and a transition from $q$ to $q^{\prime}$ with a label $\sigma$ is denoted by $q\buildrel{\sigma}\over{\rightarrow}q^{\prime}$.

A relation ${\cal S}\subseteq Q_{1}\times Q_{2}$ is called a simulation of ${\cal T}_{1}$ by ${\cal T}_{2}$ if for all $(q_{1},q_{2})\in{\cal S}$: 1) ${\langle\langle q_{1}\rangle\rangle}_{1}={\langle\langle q_{2} \rangle\rangle}_{2}$; and 2) $\forall q_{1} \buildrel{\sigma}\over{\rightarrow}_{1}q_{1}^{\prime}$, $\exists q_{2}\buildrel{\sigma}\over{\rightarrow}_{2}q_{2}^{\prime}$ such that $(q_{1}^{\prime},q_{2}^{\prime})\in{\cal S}$.

A relation ${\cal B}\subseteq Q_{1}\times Q_{2}$ is called a bisimulation between ${\cal T}_{1}$ and ${\cal T}_{2}$ if ${\cal B}$ is a simulation relation from ${\cal T}_{1}$ to ${\cal T}_{2}$ and ${\cal B}^{-1}$ is a simulation relation from ${\cal T}_{2}$ to ${\cal T}_{1}$.

A result on the decidability of the class of initialized rectangular hybrid automata (IRHA) is shown in [101]. Two important factors for decidability are: 1) rectangularity, that is, if we denote the set of all rectangular regions in $\BBR^{n}$ by ${\cal R}^{n}$, then the three vertex labeling functions *init, inv, flow* are all mapping functions from $V$ to ${\cal R}^{n}$, and the two edge labeling functions *guard, reset* are mapping functions from $E$ to ${\cal R}^{n}$; and 2) initialization, that is, a continuous variable has to be reinitialized whenever its flow changes during a discrete transition. In [101], it is shown that slight generalizations from IRHA lead to undecidability.

The o-minimal hybrid systems are defined in [107] as initialized hybrid systems whose relevant sets such as *guard, reset*, etc., and *flow* are definable in an o-minimal (or order-minimal) theory [108], [109]. This class captures hybrid systems with relatively complex continuous dynamics including linear, polynomial, and exponential *flow* dynamics. In [107], it is shown that every o-minimal hybrid system admits a finite bisimulation, and furthermore, the computation of such finite bisimulation terminates. Hence, o-minimal hybrid systems comprise a decidable class of hybrid system.

An interesting class of hybrid systems, called linear hybrid automata (LHA) [99], [110], are those for which, for each $v\in V$ and $e\in E$: 1) the vertex labeling functions *flow(v), inv(v), init(v)*, and edge labeling functions *guard(e), reset(e)* are finite conjunctions of linear inequalities; and 2) more importantly, the flow function *flow(v)* is finite conjunction of linear inequalities over the variables in $\mathdot{X}$ only. An important result is that if a given HA ${\cal A}$ is an LHA, then $Reach({\cal A},{\cal I})$ can be computed exactly [110]. However, it is not guaranteed that the iterative reach set computation terminates.

One of the most common class of hybrid systems of interest has vertex labeling functions *flow(v)* in the form of a differential equation $\mathdot{x}=f(x)$. An example of an HA with linear differential equations is shown in Fig. 2. For such HAs and other classes of HAs more general than LHA, there is no known algorithm that can compute ${\rm Reach}({\cal A},{\cal I})$ exactly even without termination guarantee. Hence, the safety verification problem for this class of HAs can only be addressed through an overapproximation of ${\rm Reach}({\cal A},{\cal I})$.

In [99], an approximation technique, called linear phase portrait approximation, is proposed. The basic idea of this technique is to replace the dynamics of $f(x)$ for each $v\in V$ by a corresponding rectangular region that upper and lower bounds the function $f(x)$ over the invariant set for the mode $v$. As an example, the dynamics $\mathdot{x}=-x+5$ for the mode *on* in Fig. 2 can be over-approximated by $\mathdot{x}\in[2,4]$ over the range of $x\in[1,3]$. Then, it is easy to see that the HA in Fig. 2 can be overapproximated by an LHA through this technique.

Another useful technique is to overapproximate the evolution of continuous variables using polyhedral representation [111], [112]. Given a dynamical system $\mathdot{x}=f(x)$, let ${\cal R}_{[t_{k-1},t_{k}]}({\cal I})$, called a flow-pipe segment, be the set of states over the time interval $[t_{k-1},t_{k}]$ reachable from the initial set ${\cal I}$ at time $t_{0}$, and let $(C,d)$ be a matrix–vector pair that defines a polyhedron to approximate the flow-pipe segment such that $x\in\{x\vert Cx\leq d\}$ for any $x\in{\cal R}_{[t_{k-1},t_{k}]}({\cal I})$. Then, for a given $C$, the optimal $d^{\ast}$ that minimizes the overapproximation error can be determined as the solution to the optimization problem TeX Source $$\eqalignno{&\max_{x_{0},t}\quad c_{i}^{T}x(t,x_{0})\cr&{\hbox {s.t.}}\qquad x_{0}\in{\cal I}\quad{\hbox {and}}\quad t\in[t_{k-1},t_{k}]&\hbox{(4)}}$$ where $c_{i}^{T}$ is the $i$th row vector of $C$ that is the unit normal vector to the $i$th face of the polytope $C x\leq d$, and $x(t,x_{0})$ is the solution of $\mathdot{x}=f(x)$ at time $t$ from the initial state $x_{0}$. Then, from the optimal solution $(t^{\ast},x_{0}^{\ast})$ of (4), the optimal value $d_{i}^{\ast}$ for the given $c_{i}$ is determined as $d_{i}^{\ast}=c_{i}^{T}x(t^{\ast},x_{0}^{\ast})$. Now, the question is how to determine $C$. A heuristic approach is also proposed in [111] based on a convex hull computation from a set of vertices. Assuming that ${\cal I}$ is a polyhedron, let ${\cal V}({\cal I})$ be the set of vertices of ${\cal I}$, and let ${\cal V}_{t}({\cal I})=\{x(t,v)\vert v\in{\cal V}({\cal I})\}$. Then, the matrix $C$ can be determined by the set of outward pointing normal vectors of the convex hull that is obtained from the set of vertices ${\cal V}_{t_{k-1}}({\cal I})\cup{\cal V}_{t}({\cal I})$.

As noted earlier, the notion of a bisimulation relation between transition systems is crucial for the decidability result of several classes of HAs that have fairly simple continuous dynamics, such as timed automata. This notion can be extended to explicitly include the observation error in its definition so that a larger class of continuous dynamics $\mathdot{x}=f(x)$ can be abstracted as a finite state transition system that is *approximately* bisimilar to the original continuous dynamics. If we let ${\cal T}_{M}(\Sigma,\Pi)$, called a metric transition system, be the set of transition systems associated with a set of labels $\Sigma$ and a set of observations $\Pi$ where $(Q,d_{Q})$ and $(\Pi,d_{\Pi})$ are metric spaces, then, for ${\cal T}_{1},{\cal T}_{2}\in{\cal T}_{M}(\Sigma,\Pi)$, an approximate bisimulation relation is defined as follows [106].

A relation ${\cal B}_{\delta}\subseteq Q_{1}\times Q_{2}$ is a $\delta$-approximate bisimulation relation between ${\cal T}_{1}$ and ${\cal T}_{2}$ if for all $(q_{1},q_{2})\in{\cal B}_{\delta}$: 1) $d_{\Pi}({\langle\langle q_{1}\rangle\rangle}_{1},{\langle\langle q_{2}\rangle\rangle}_{2})\leq \delta$; 2) $\forall q_{1}\buildrel{\sigma}\over{\rightarrow}_{1}q_{1}^{\prime}$, $\exists q_{2}\buildrel{\sigma}\over{\rightarrow}_{2}q_{2}^{\prime}$ such that $(q_{1}^{\prime},q_{2}^{\prime})\in{\cal B}_{\delta}$; and 3) $\forall q_{2}\buildrel{\sigma}\over{\rightarrow}_{2}q_{2}^{\prime},\exists q_{1}\buildrel{\sigma}\over{\rightarrow}_{1}q_{1}^{\prime}$ such that $(q_{1}^{\prime},q_{2}^{\prime})\in{\cal B}_{\delta}$.

Concerning an approximate bisimulation relation, if a nonlinear control system is incrementally asymptotically stable [113], then it is $\delta$-approximately bisimilar [114] to a symbolic model of the original continuous system that can be constructed by aggregating states and control inputs using several parameters such as $\tau\in \BBR^{+}$ for time domain quantization, $\eta\in \BBR^{+}$ for state space quantization, and $\mu \in\BBR^{+}$ for input space quantization satisfying the following inequality: TeX Source $$\beta(\delta,\tau)+\mu+\eta/2\leq\delta\eqno{\hbox{(5)}}$$ where $\beta$ is a ${\cal KL}$ function [31]. Once we have such a symbolic model of a continuous control system, a controller satisfying a given specification can be synthesized automatically using techniques developed in supervisory control of discrete event systems or algorithmic game theory [114], [115]. Other results relevant to automatic controller synthesis for hybrid systems can be found in [116] on algorithmic controller synthesis through finite bisimulation to satisfy LTL specifications of discrete-time linear systems and in [117] on the synthesis of control laws for piecewise-affine hybrid systems on simplices.

It is important to note that, based on these theoretical results, many software tools have been developed for formal verification and automatic controller synthesis of hybrid systems. Some examples are UPPAAL [118], a verification tool for real-time systems based on timed automata, HyTech [99] and PHAVer [119] for LHA, SpaceEx [120], which is based on the LeGuernic–Girard (LGG) algorithm [121] that can efficiently handle HAs with linear differential equations with a larger number of system states compared to other approximation techniques, and PESSOA [122], which is a tool for controller synthesis based on [114]. More details and results can be found in [123], [124], [125], [126], [127] for various approaches for reachability, in [128] for other classes of systems for which some verification/synthesis problems are decidable, and in [129], [130], [131], [132] for abstractions of hybrid systems.

A major goal in the design of CPSs is to have formal proofs of correctness of the overall system design. This overall system can however be quite complex, involving not only differential-equation-based dynamics of the physical system, but also discrete models of the physical system, as well as interaction with real-time computation and communication. The system is the composition of several systems. Thus, proofs of correctness will have to be holistic and transcend domains. An example is a proof of correctness of an automobile traffic control system in [133] and [134]. It involves not only differential equation models of automobiles, but also a balls-and-bins model of the positions of all cars, which is necessary to prove properties such as deadlock avoidance [135]. Also involved is real-time scheduling of the computational tasks. Similarly, the design of CPSs can involve several choices such as the extent of centralization, and the extent of robustness, both of which may have to be made, keeping in mind the provable correctness of the design. An example is a design accompanied with a proof of correctness for an automated traffic intersection [136]. So far, verification for such systems has mostly involved pencil–paper proofs and interactive theorem prover-based verification [137], [138], [139], [140]. For future systems, it would be valuable to have compositional frameworks, and more systematic or automated methods for proving correctness.

SECTION IV

Computing and networking are key driving forces and key components of new highly connected, distributed, and reliable CPSs. We review classical and recent results in these areas.

In real-time systems, the correctness of a system depends not only on the logical results of the computation but also on the time at which the results are produced [141]. One of the primary design objectives of a real-time computing system is to support *temporally predictable* execution of a set of computing tasks so that it is guaranteed that there will be timely interaction between computing tasks and the physical environment. More precisely, for a given set of computing tasks $\Gamma=\{\tau_{1},\tau_{2},\ldots,\tau_{n}\}$ with timing constraints, a set of processors $P= \{P_{1},P_{2},\ldots,P_{m}\}$, and a set of resources $R=\{R_{1},R_{2},\ldots,R_{r}\}$, a real-time computing system has to make an appropriate scheduling decision on $\Gamma$ to meet all the timing constraints. If some tasks cannot meet their timing constraints, then the system should be able to determine this in advance. However, it is known to be computationally intractable to solve such scheduling problems in general [142].

One of the most influential results in real-time scheduling theory is based on a task set model, which is simple enough to be computationally tractable and also practical enough to be useful in many applications [6]. Consider the scheduling problem for a set of preemptible and periodic tasks under both fixed (or static) and dynamic priority assignment based on the following assumptions: 1) all tasks $\tau_{i}\in\Gamma$ are independent, i.e., there is no shared resource or precedence relationship between tasks; 2) all instances of a periodic task $\tau_{i}$ have the same relative deadline $D_{i}$ and it is equal to its period $T_{i}$; 3) all instances of a periodic task $\tau_{i}$ have the same worst case execution time $C_{i}$; and 4) there is only one processor.

The rate monotonic (RM) policy is a static priority scheduling policy that assigns priorities to tasks based on the rate of arrivals of jobs in the task. The shorter the period of a task, the higher is the priority assigned to the task. It is optimal among all static priority scheduling policies in the sense that if a periodic task set can be scheduled by some static priority policy, then it can be scheduled by RM [6]. Moreover, there is a simple sufficient condition for schedulability: $\sum_{i=1}^{n}U_{i}\leq n(2^{1/n}-1)$, where $U_{i}=C_{i}/T_{i}$ is the processor utilization of $\tau_{i}$. There is also a less conservative schedulability condition for RM, called a hyperbolic bound [143]. For a set of periodic tasks whose relative deadlines are less than their periods, the deadline monotonic (DM) scheduling policy is an extension of RM. For the exact schedulability analysis for a given periodic task set, there is an iterative algorithm, called response-time analysis [144].

The earliest deadline first (EDF) policy assigns priorities to tasks according to the absolute deadline of their instances. Hence, EDF is a dynamic scheduling policy. It is optimal in that if there is any schedule that can meet all deadlines, then EDF will too. For task sets with deadlines less than periods, a necessary and sufficient condition for schedulability under EDF is derived in [145].

Many scheduling algorithms have been proposed to simultaneously handle both *hard* periodic tasks and *soft* aperiodic tasks. The primary objective is to minimize the response time for aperiodic tasks without compromising the schedulability of the periodic tasks. In fixed-priority scheduling, a basic idea is to create a periodic task $\tau_{s}=(T_{s},C_{s})$ for serving aperiodic tasks where $T_{s}$ is the period and $C_{s}$ is the computation time for $\tau_{s}$, called server capacity, in addition to the hard periodic task set. Some examples of scheduling algorithms based on this idea are polling server [146], deferrable server [147], and priority exchange server [148].

For dynamic scheduling, especially under EDF, one useful idea, called the total bandwidth server (TBS) [149], to handle aperiodic requests along with a set of periodic tasks, is to assign each aperiodic request a deadline such that the overall aperiodic load never exceeds a specified value of processor utilization $U_{s}$ that is called the *bandwidth* of an aperiodic server. When the $k$th aperiodic request which requires $C_{k}$ amount of execution time arrives at time $r_{k}$, then the deadline assigned to this aperiodic task is $d_{k}=\max\{r_{k},d_{k-1}\}+(C_{k}/U_{s})$, where $d_{k-1}$ is the deadline assigned previously for the $(k-1)$th aperiodic request. However, TBS cannot be used when $C_{k}$ is unknown. For such cases, a bandwidth reservation mechanism, called the constant bandwidth server (CBS), is proposed in [150]. The basic idea of CBS is that when a new aperiodic request arrives, it is assigned a suitable deadline that is determined by the currently available bandwidth resource for the request. If the request cannot be completed before its deadline, then its deadline is postponed. Notice that, under EDF this implies that its priority is decreased and thus the interference to the other tasks is reduced.

For a task set that consists of aperiodic tasks with arbitrary arrival times, execution times, and deadlines, a utilization bound for schedulability is derived in [151]. In particular, the notion of *synthetic utilization*, denoted as $U(t)$, is introduced, which is roughly defined as the sum of utilization values of all arrived aperiodic requests whose deadlines have not yet expired at time $t$. A set of $n$ aperiodic tasks is schedulable under the deadline monotonic scheduling policy if, for all $t$, $U(t)\leq UB(n)$ where $UB(1)=1$, $UB(2)=0.75$, and $UB(n)=1/1+ \sqrt{(1/2)(1-(1/n-1))}$ for $n\geq 3$. Deadline monotonic scheduling policy is optimal among all time-independent scheduling policies, a generalized notion of fixed-priority scheduling that applies to aperiodic tasks, since no other time-independent scheduling policy can have a higher upper bound for $U(t)$. For the case of dynamic aperiodic task scheduling, EDF is optimal and its utilization bound is 1.

In reality, tasks are not generally independent since they typically share resources such as memory, files, communication network, etc., for their execution in a mutually exclusive manner. In such cases, a higher priority task can be blocked by a lower priority task due to resource sharing. This is called *priority inversion* and its duration can be arbitrarily long. In [152], a simple solution is proposed that is called the priority inheritance protocol (PIP). The basic idea of PIP is to let a lower priority task which currently holds the shared resource to inherit temporarily the highest priority among the blocked tasks, until it releases the resource. After releasing the resource, it recovers its original priority. However, it is known that priority inheritance does not prevent deadlocks. The priority ceiling protocol (PCP) is also proposed in [152] as an extension of the PIP to resolve this issue. Under EDF, the PIP and the PCP are not applicable since they are based on the fixed-priority scheduling system. For such cases, the stack resource policy (SRP) [153], extended from the PCP to support dynamic priority scheduling and to allow the sharing of runtime stack-based resources, is a useful mechanism that is applicable to both fixed-priority and dynamic scheduling.

It is of interest to determine guarantees for jobs that have to be processed by a sequence of processors, i.e., as they move through a network. The work on stability of reentrant lines provides bounds on end-to-end delays that are of potential interest because it establishes a pipeline property where the delay is related to the bottleneck node [154]. Other important results can be found in [155] and [156] on real-time queueing theory for stochastic analysis of soft real-time systems, in [157] and [158] on real-time scheduling analysis in a resource partitioned computing environment, in [159] on resource kernels as an approach for operating system resource management based on resource reservation for real-time applications, in [160], [161], [162] on a control theoretical approach to performance and throughput management of computing systems, in [163] on real-time scheduling algorithms for power management in embedded real-time systems, and in [7] for more on real-time system theories.

There are several important characteristics which make today's CPSs different from earlier generation control systems: 1) the scale of a CPS is much larger; 2) entities comprising a CPS typically run over heterogeneous environments; and 3) entities interact with each other in a very complex manner. It is also expected that CPSs should be highly extensible for new functionalities, and flexible for runtime adaptation. Due to such structural and behavioral complexities, it is more challenging to design and implement a CPS. To overcome such complex issues, it is becoming increasingly important to develop a software platform, called middleware, based on an appropriate abstraction of such complex systems, and a well-designed architecture for rapid implementation of reliable and evolvable CPS applications [134].

The Common Object Request Broker Architecture (CORBA) [164] is a well-known industry standard specification for middleware developed by the Object Management Group (OMG). It is designed primarily for interoperability between software objects running on different machines in a heterogeneous distributed environment. Thus, it is not designed for control system applications in which temporal predictability is essential. Later, Real-Time CORBA has been developed as an extension of CORBA to support temporally predictable end-to-end interactions between client and server objects in a system. It defines a set of mechanisms and interfaces which enable applications to explicitly manage system resources such as synchronization mechanism, thread pool model, scheduling service, and explicit binding.

The ACE ORB (TAO) [165] is an implementation of Real-Time CORBA. It has been used in application areas such as telecommunication, aerospace, medical, and financial services. It is used as a middleware framework for an application development platform, called open control platform (OCP) [166], developed for complex and reconfigurable control system applications under the U.S. DARPA Software Enabled Control research program.

Etherware [133] is a middleware developed for large-scale networked control systems. It is based on the concept of microkernel architecture and supports component-based application development. It also supports runtime reconfiguration, such as component upgrade, and even migration at runtime from one computing node to other node. This is possible through an Etherware component model that is based on several software design patterns [167]. Etherware has been enhanced to support time-critical systems by incorporating quality of service (QoS) in component interaction and a real-time scheduling mechanism for interactions [168]. As an illustrative example of Etherware-based CPS, Fig. 3 shows how a distributed traffic system can be developed over Etherware.

A component-based middleware framework for networked embedded systems has been developed under the European Reconfigurable, Ubiquitous, Networked Embedded Systems (RUNES) project. As in Etherware, it is designed to support runtime reconfiguration provided by a middleware service, called the Logical Mobility Service. A number of components for network reconfiguration, localization, and collision avoidance have been developed based on the RUNES middleware component framework [169]. OSA+ [170] is another middleware based on the microkernel architecture for distributed real-time embedded systems. Other real-time middleware frameworks are RTZen [171], an implementation of Real-Time CORBA developed on a real-time Java platform, and ARMADA [172], a set of communication and middleware services for real-time embedded distributed applications.

Another approach to implementation of real-time CPSs is the development of programming languages. Giotto [173] provides platform-independent high-level abstractions that can be used for specifying time-triggered sensor readings, task invocations, actuator updates, and mode switches of control systems. Platform-specific issues such as schedulability analysis of a program on a specific platform are handled by the Giotto compiler. Giotto thereby decouples high-level real-time programming of real-time embedded systems from low-level real-time scheduling of computation and communication. Other programming languages for real-time systems that have been used successfully, especially in industry, are the synchronous languages such as Esterel [174] and Signal [175].

A discussion on the importance of time in computing abstractions of every layer of the computing system and possible approaches for the development of repeatable and predictable CPS can be found in [176].

CPSs rely on an underlying communication network to transport data packets between sensors, computational units, and actuators. For actions to be taken on time, these packets need to be delivered within a time deadline. Nodes may require a certain minimum throughput of such packets. Thus, CPSs need a *real-time* communication network that can provide guarantees on both the throughputs and delays of flows. The current Internet does not provide such QoS guarantees, a significant challenge.

As motivation, current automobiles have about 75 sensors and 100 switches connected by a wired network. The wiring harness is heavy, complex, hard to assemble, expensive, and subject to failures. There is significant motivation for replacing the wiring harness by a wireless access point. This can potentially lead to savings in fuel economy and reduced manufacturing cost, as well as making it possible to perform software upgrades, and add or remove devices. Packets will then have to be delivered within timing constraints.

Such a system can be modeled as an access point serving $n$ several clients [177]. Similar to Section IV-A [6], suppose that packets arrive, one for each client, at the beginning of each common period of $T$ slots. Suppose that each packet takes one slot to transmit, and in each slot the access point can attempt a transmission for one of the clients. The deadline for each packet is the end of the period. The throughput of each client is the long-term average of the number of packets delivered per period. Each client $c$ has a throughput requirement of $q_{c}$ packets/slot, called the *timely* throughput since it only considers packets delivered by their deadline. The distinction from the deterministic model for real-time computation is that the wireless channels are unreliable. When the access point transmits a packet for client $c$, it only succeeds with probability $p_{c}$. (This model of channel reliability can be generalized [178].)

There are two fundamental questions concerning the QoS requirements $\{(q_{c},p_{c},\tau):c\in\{1,2,\ldots,C\}\}$: 1) Are they feasible; and, if so, 2) what is an appropriate scheduling policy? The first item is the problem of *admission control*.

Let $\gamma_{c}$ be a geometrically distributed random variable with mean $1/p_{c}$ representing the number of transmissions needed to successfully deliver client $c$'s packet, and let $I(S):=E[{[\tau-\sum_{c=1}^{n}\gamma_{c}]}^{+}]$ be the expected *unavoidable* idle time when the access point has to serve only the subset $S\subseteq\{1,2,\ldots,C\}$, of clients. With $z^{+}:=\max\{z,0\}$, the necessary and sufficient condition for feasibility [177] is
TeX Source
$$\sum_{c\in S}{q_{c}\over p_{c}}+I(S)\leq\tau,\qquad{\hbox {for every}}\ S\subseteq\{1,2, \ldots,C\}.\eqno{\hbox{(6)}}$$ The following weighted delivery debt policy fulfills any set of feasible clients: give priority to clients according to the expected number of packets that ought to have been delivered minus the number of packets that have been actually delivered, weighted by some positive constant.

In some situations, task frequencies can be optimally tuned to support control systems [179]. Suppose that the throughputs $\{q_{c}\}$ are not prespecified, but there is a strictly concave and increasing utility function $U_{c}(q_{c})$ for each client, and the goal is to maximize the sum of the utilities: $\max_{(q_{1},q_{2},\ldots,q_{n})}\sum_{c=1}^{n}U_{c}(q_{c})$. This problem is difficult because the number of constraints (6) is exponential in $n$. One can decompose the problem into two subproblems, as in [180], by decoupling clients from the access point by using a price per unit throughput $\psi_{c}$ for each client, and the amount $\rho_{c}$ paid by client $c$. Then, client $c$'s problem is $\max_{\rho_{c}}[U_{c}(\rho_{c}/\psi_{c})-p_{c}]$, subject to $0\leq\rho_{c}\leq\psi_{c}$. The access point's optimization problem is to determine how much timely throughput to provide to each client, given that the client is willing to pay $\rho_{c}$: $\max_{(q_{1},q_{2},\ldots,q_{c})}\sum_{c=1}^{n}\rho_{c}\log q_{c}$, subject to the constraints (6). Surprisingly, the access point's problem is solved [181] by simply giving higher priority to clients with lower value of the ratio. (Number of slots provided to $c$ so far)/$\rho_{c}$. Also interesting is the consequence that neither the access point nor the clients need to know the channel reliabilities $(p_{1},p_{2},\ldots,p_{n})$.

This formulation of the problem of real-time wireless communication can be extended to handle random arrivals [182], model fading and rate adaptation [178], provide a minimum specified throughput to each client while maximizing the total utility even when the clients are strategic and noncooperative in revealing their true utilities [183]. There has also been work on simultaneous existence of flows with delay constraints as well as flows without delay constraints [184]. A major open problem is that of delay constraints in multihop networks.

For sensor networks, protocols have been developed to support real-time applications [185]. A protocol to support timeliness is presented in [186] that exploits cellular structure for the network architecture, the periodic nature of communication, and uses EDF to support real-time messages. The SPEED protocol [187] attempts to ensure that end-to-end delay is proportional to the distance traveled by the flow. RAP [188] is a communication architecture for supporting high-level query and event services. Nano Resource Kernel (Nano-RK) is a real-time operating system for sensor networks [189].

SECTION V

Wireless networks allow nodes to communicate with each other over the wireless medium, possibly by using other nodes as relays or cooperating in other more information theoretic ways. By attaching sensors to nodes and providing them with computational capability, one obtains wireless *sensor* networks. They can be deployed to monitor their environment, e.g., monitoring facilities for anomalies or monitoring wildlife [190], [191], [192], or to conduct physics-in-the-large by offering scientists the means to deploy large number of sensors in the field and wirelessly gather information from them, as at the Center for Embedded Networked Sensing [193]. By using active sensors they can estimate distances between nodes and thus their relative positions [194].

Two nodes not in range of each other may need to communicate over several hops. Therefore, a multihop wireless network will need to ensure such a path between any two nodes, i.e., it is connected. The range of a wireless transmission will depend on its power, for the same data rate. If nodes do not employ adequate power, there may not be enough links in the network to produce a connected graph, while using too much power is wasteful. It is of interest to determine what is an appropriate range that ensures that a wireless network is connected.

A simple model is when $n$ nodes are randomly scattered, say uniformly in a square, and employ a common range $r_{n}$ that depends on $n$. When the nodes are few and therefore sparse, they need to choose a large range to form a connected graph, while if there are many nodes, and hence dense, then they can each choose a small range. The network is connected with a probability approaching one as the number of nodes $n\rightarrow\infty$, if and only if $r_{n}=\sqrt{(\log n+\gamma_{n})/\pi n}$, where $\lim_{n\rightarrow+\infty}\gamma_{n}=+\infty$ [195], [196]. When nodes are more regularly spread, one can reduce the range while still being connected.

A related problem is that of coverage by sensors. If each node has a sensor that can detect events within a distance $r_{n}$, one is interested in how large $r_{n}$ has to be so that every point in the entire domain is covered by some sensor [197].

The overall sensor network may often be deployed untended over a long duration, with the nodes drawing energy from their batteries or from renewable energy sources, such as solar cells, and one is interested in ensuring that the networks can survive a long duration in the field before requiring attention, e.g., replacing batteries or other maintenance [198]. All protocols used will therefore have to be energy efficient.

Clearly, any collision of packets leads to packet loss and is wasteful. Nodes will need to coordinate their wireless transmissions to avoid interfering with each other. This needs a medium access protocol. It must efficiently use the transmission medium and avoid wasting the communication spectrum that is a common resource of the network, and also be energy efficient. In contrast to wireless local area networks, ensuring fairness to all nodes is not important since sensor networks are often deployed for a specific purpose. Thus, one can design a medium access control protocol specifically for sensor networks. Also, a node wastes energy if its radio is “on” listening to packets that are not intended for it, or just “on” when there is no nearby ongoing transmission. One of the most significant ways to save power is to turn off a node's radio and put it to sleep. The protocol Sensor-MAC (S-MAC) [199] takes advantages of such sleep to save power. Sleeping can be initiated on the basis of time, i.e., by scheduled sleep, or by implicit signaling that occurs due to a neighbor's transmission. The former requires clock synchronization. Collision can be avoided by using control packets, e.g., “request-to-sent” (RTS) and “clear-to-send” (CTS) as in IEEE 802.11 [13]. Long packets can be fragmented into smaller packets, so that not all is lost when a long packet is corrupted. However, transmission of the short packets can be done in a single burst, after only a single RTS and CTS, thus amortizing their overhead. Through such strategies, MAC protocols can be made to be specifically energy efficient for sensor network deployment [200]. The protocol B-MAC is motivated by the goal of simple implementation, and aims at only providing link-layer functionality, relegating other functionalities like task synchronization and organization to higher layers, which can then employ the mechanisms exposed by B-MAC so as to adapt to changing network or channel conditions. It employs carrier sensing and adaptive preamble sampling to design an efficient MAC for sensor network monitoring applications.

A routing protocol is needed for two nodes that are not neighbors to communicate. Peer-to-peer routing, multicast and all-cast may be needed by sensor network applications. Very commonly, the data gathered, or the information that is extracted, may need to be communicated to a designated “sink” or “collector” or “fusion” node, which may also possibly serve as a gateway for exfiltrating the desired information out of the sensor network. This is called “ConvergeCast” [201]. It may need to be done in an energy-efficient manner, or with low delay, depending on the application. In some applications, the identity of a node may not be important; only its data may be relevant. This can simplify ID or address management schemes. Nodes may be limited in their processing or storage capabilities. The data collected from nearby nodes may have considerable redundancy, which can also be exploited in designing an efficient protocol. The protocol may be query based, responding to particular information that is sought, or the dissemination may be content based. The protocol itself may be flat, hierarchical, or even location based [202].

TinyOS [203], [204], an open source operating system developed for sensor networks, has triggered much experimental and deployment activities in sensor networks. The challenges in the networking, operating system, and middleware layers are surveyed in [205]. The IEEE 802.15.4 standard [206] specifies the physical layer and medium-access control for wireless personal area networks. The Zigbee alliance [207] builds upon IEEE 802.15.4 to specify high level protocols for low data rate and low energy consumption applications. WirelessHART [208], [209] is an open communication standard for process control. So is ISA100.11a that has been developed by the International Society of Automation (ISA) [210]. An Internet Engineering Task Force Working Group has developed 6LoWPAN [211], [212], [213] to use Internet Protocol version 6 over IEEE 802.15.4. It allows interoperability with Internet Protocol (IP) links, while still being energy efficient, reliable, adaptable to applications, and allowing management of a large number of nodes. Interoperability with IP also allows use of established security mechanisms, network management tools, transport protocols, and services for naming, addressing, discovery, etc.

In many applications, it is important that sensor measurements be time stamped. In fact, this is an important aspect of CPSs because the physical world's evolution does depend on time. Different nodes in the network may however have different clocks, and so it is necessary to synchronize them. Another reason is that in order to save energy, it is important that nodes go to “sleep” most of time, and “wake up” only when necessary to hear or send a transmission, or take a sensor measurement. When a node wakes up and transmits, it is necessary that the receiving node also be in an awake state. The more accurately their sleep–wake times are coordinated, the less is the energy wastage in an awake but idle state.

When clocks are linear, they can be described by their skew (rate) and offset. Two neighboring nodes can exchange time-stamped packets. If there is a constant but unknown time delay in such packet exchanges that is symmetric, i.e., the same for transmissions in both directions, then the nodes can determine all three quantities—offset, skew, and time delay [214], [215].

In a network, one can multiply multiple skews over successive links in a path to determine the skew between two remote nodes, and likewise one can also estimate offsets. The Flooding Time Synchronization Protocol [216] time stamps packets at the MAC layer and uses linear regression to smooth noisy time stamps and delays. When the synchronization error at each link is independent with a certain standard deviation, then summed over the links along the path, the error grows as $O(\sqrt{d})$, where $d$ is the diameter of the network. In a grid topology where $n$ nodes are located at, say, points with integer $x$ and $y$ coordinates in a square, the synchronization error grows like $O(n^{1/4})$. If nodes are uniformly and randomly located in a square of side 1, then the critical range at which the network gets connected is $O(\sqrt{(\log n)/n})$, as noted earlier. Then, the synchronization error grows like $O((n/\log n)^{1/4})$ [217]. All these errors grow polynomially with the number of nodes in the network. However, one can do much better by combining estimates over different paths [218], [219]. The error is then related to the resistance distance of the graph, i.e., the resistance between two nodes when each link is replaced by a 1-$\Omega$ resistance [218]. The resulting error in a critically connected random wireless network is then only $O(1)$, showing that error can indeed be kept bounded even in random wireless networks with large number of nodes [217].

The *raison d'etre* for sensor networks is that they can provide *information* about the environment, which may be exfiltrated through a designated gateway to an external entity. To do this, the *data* gathered by the sensor nodes has to be processed to determine relevant information. One strategy is to send all data from all nodes to the sink or gateway node, where it is centrally processed. However, this may be very wasteful of energy and communication bandwidth due to the large amount of data. An alternative is for all nodes to conduct processing, and only send along to other nodes what is relevant. This strategy is feasible because individual nodes in sensor networks have computation capabilities. Nodes can thereby trade off computation for communication. This strategy is called “in-network computation,” and how it is to be best done is an important issue.

An early precursor is the communication complexity problem in distributed computing [220]. The goal is to exchange the minimum number of bits between two nodes which each possess the value of one variable, so that they can determine the value of a function of the two variables. Similar to block communication, one can compute several instances of the function, giving rise to the direct sum problem [221]. In information theory, a similar problem is source coding with side information. One variant is to require zero error for finite block lengths [222], [223].

The problem of computing a function corresponds to a rate-distortion problem with a particular choice of distortion measure, and the required capacity is the conditional graph entropy [224]. The problem of computing some symmetric functions, i.e., invariant to permutations of their arguments, has been considered in the context of a wireless sensor network in [225]. For a random wireless network with $n$ randomly located nodes, the shared aspect of the wireless medium is modeled by each wireless transmission consuming a certain interference footprint. The rate at which the *Average* of nodal values can be computed is $\Theta(1/ \log n)$, when each node chooses a communication range that leads to a connected graph. Interestingly, this problem does not benefit, up to order, from allowing block computation. In contrast, computing the *Maximum* does significantly benefit from allowing block computation; the computational rate is $\Theta(1/(\log\log n))$. Such symmetric functions are of interest because many statistical functions are symmetric, and because they embody the data-centric paradigm where only nodal values are relevant, and not nodal identities. The problem of computing divisible functions is addressed in [225], while the problem of computing divisible functions that are amenable to divide and conquer is considered in [225], [226], [227].

When data are random, one can consider *optimal* function computation to minimize the expected number of bits communicated. This has been considered for symmetric functions that are Boolean valued [228], [229], and some specific problems are solved optimally or near-optimally when nodes are collocated within one hop of each other.

One can also consider the problem of in-network computation from an information theoretic point of view; this has been done for two nodes in [230] and for collocated nodes in [231]. The problem of computing in noisy networks is considered in [232], [233], [234], [235], [236], [237], [238], [239]. Related information theoretic problems are studied in [240].

There are also interesting issues at the sensing end [241]. For example, sensing nodes may provide erroneous measurements about the environment, and it is important to detect that based on correlated sensor measurements from neighboring nodes. More generally, there is the problem of how nodes in a network can self-calibrate themselves [242].

SECTION VI

Security is a critical aspect of any safety-critical system, i.e., one where physical harm can be caused. Much remains to be done for security of CPSs. The case of an attack on a Supervisory Control and Data Acquisition system is described in [243]. There have been attacks on natural gas pipeline systems [244], trams [245], power utilities [246], and water systems [247]. Recently, there has been the Stuxnet worm that attacked control systems [248], [249], [250]. There has been much work on security at the computational and communication layers, but CPSs have additional challenges since they involve not only the communication and computation layers, but also the control layer and the physical system itself. At the same time, one can also exploit the features of the CPS system to develop approaches to security.

Several new challenges and a research roadmap are presented in [251]. Due to the feedback processes between the physical and cyber parts, there are new communication “channels” that need to be secured. Some large-scale systems, e.g., power grid, are federated. The systems are real time, yet can be geographically distributed. There are a multiplicity of time-scales and the overall system is a system-of-systems.

The vulnerability of CPSs is increased because controllers are computers prone to bugs and attacks, the communication networks are open and of potentially large scale, increasing use of commodity solutions so that systems are susceptible to the flaws of components, protocols for control are becoming more open and accessible, and increasing functionality provided by CPS opens new vulnerabilities [252]. There are challenges and security mechanisms for prevention, detection and recovery, resilience, and deterrence of attacks [253]. Computer attacks can be detected by incorporating knowledge of the physical system under control [254]. Other results for detecting attacks can be found in [255] and [256].

Standardization efforts underway include North American Electric Reliability Corporation [257], National Institute of Standards and Technology [258], and ISA-SP99 [259].

SECTION VII

The importance of the computing system, especially software, in control system applications has been increasing. Since its first introduction in automobiles around 30 years ago, the amount of software has increased. Computing systems including software can take up almost half of the production costs of today's automobiles [260]. The same is true for many other control systems such as airplanes and factory automation systems. This trend is anticipated to continue due to the significant benefits provided by software technologies in control applications, with respect to functionalities, performance, and flexibility. Simultaneously, it is becoming more challenging to develop such control systems since the overall complexity of the system also increases. In fact, the performance, reliability, and production costs of control systems are becoming more dependent on those of computing systems, especially software systems. Hence, an important research issues in software technology is how to manage complexity to make it easy to design and implement software systems for reliable CPSs.

From past experience, it has been observed that one of the most effective approaches in managing complexity, and accordingly increasing productivity in software development, is to raise the level of abstraction. In the early stages of computing, assembler technology allowed us to step up from machine code to assembler code. Later in the 1970s, compiler technology raised the level of abstraction a step further from assembly language to high level programming languages such as C and Fortran, which make it significantly easier to write and understand software programs. We now have object-oriented programming languages such as C++ and Java, which allow us to develop software at even higher levels of abstraction than procedural programming languages such as C and Fortran.

The next level of abstraction beyond today's component and object-based programming can be model-driven development (MDD), as emphasized in [261], [262], [263], [264]. One of the important visions of MDD is that software developers can develop software systems through designing models in the application domain instead of writing computer programs at the implementation level, and can then transform the application domain design models into real implementation. MDD can thereby significantly improve the productivity of the software development process. Another important benefit is to improve productivity in the long term by supporting developers to build a software system that is less sensitive to changes in personnel, requirements, and implementation platforms [262].

Broadly, a model is a description of some aspect of a system for some purposes such as communication, analysis, or implementation. In principle, models relevant for software systems can be in any form depending on the purpose. For example, in the traditional software development process, the requirement and functionalities of a software system are typically described in text and picture format, resulting in documents for software developers to use. At the next stage, the system is designed based on requirements typically in the form of diagrams, e.g., class diagrams and activity diagrams of Unified Modeling Language. Finally, the design is implemented and tested by software developers in the form of computer programs. One of the issues in this process is that the models at various stages are only loosely connected and information contained in a model might not be correctly captured during the transition from one form of model to another. As an example, whenever there is some change in requirements, lower level models have to be manually updated to maintain consistency between models, and *vice versa*. Another important concern is that whether there are some errors at the design stage might not be determined until the test stage of the implemented code. Thus, it requires much cost and effort to maintain consistency between models in the traditional software development process.

To fully exploit benefits of MDD such as automatic generation of complete programs from application domain models and automatic verification of a system at design time, models, especially those at the application layer, should possess properties that enable seamless usage throughout the development process [261]. Key to MDD are that a model should be 1) an appropriate abstraction of the system, hiding irrelevant details; 2) represented so that it is easily understandable for improving productivity in design and maintenance; and 3) executable, so that it can help to predict the modeled system's properties at an early stage of development process. Building such models is itself a great challenge in MDD. Major challenges in realizing the vision of MDD are categorized into three different aspects [263]: 1) modeling language to support creating well-defined models; 2) separation of concerns to support modeling a system from multiple viewpoints; and 3) model manipulation and management, such as transformation between models, maintaining consistency between models, and model-level execution and debugging.

Model-driven architecture (MDA) [265] is a conceptual framework for software development defined by OMG, and is supported by standards for modeling and transformation between models such as UML, XML Metadata Interchange, Meta-Object Facility. In particular, to improve flexibility for better support of evolving software systems, MDA models a system in three different types: 1) computation independent model to capture system requirements; 2) platform independent model to represent a system with high-level designs that are independent of any forms of implementation technologies; and 3) platform-specific model to represent a system in terms of some specific platform implementation technologies.

Model-integrated computing (MIC) [264] is another well-known software development framework which supports the development paradigm envisioned by MDD. As in MDA, models are the main artifacts for software development and used in each stage of the development process, such as design, analysis, and test. However, while MDA adopts UML as one of its primary modeling languages, MIC emphasizes the framework for designing modeling languages, called domain-specific modeling language (DSML) [266]. DSML tool suites developed based on such an MIC concept are the Platform-Independent Component Modeling Language for component-based software system development and the Embedded Control Systems Language for distributed embedded automotive system development [267]. Another approach to MDD is Software Factories [268], which provides a software framework that can be used to create software development environments for rapid development of applications. The Architecture Analysis & Design Language (AADL) [269] is a Society of Automotive Engineers (SAE) standard model-based language that can be used for designing and analyzing structure and runtime behavioral properties such as performance, schedulability, and reliability of complex real-time embedded systems.

SECTION VIII

As can be seen, the research spectrum related to CPS is indeed quite broad, ranging from theories in various areas for analysis and design, to technologies for implementation. The impact of CPS research can be significant enough to bring revolutionary changes in how to design and develop engineering systems to meet societal needs in several domains such as energy, environment, and healthcare. In this section, we attempt to anticipate benefits that CPS research can potentially provide in some representative application areas. We also outline some of the challenges that need to be overcome.

Energy generation, transmission, and distribution for a clean and sustainable society are high-priority issues that need immediate research attention in many disciplines for the global public interest. Smart grid [270] is a next-generation infrastructure for electric power systems that can help to produce, distribute, and use electricity in a more clean, efficient, and cost-effective manner through the integration of computing, communication, and control technologies. The production and distribution of electric energy can be made more responsive and reliable through real-time distributed sensing, measurement, and analysis. Furthermore, communication and information technology can contribute to improving efficiency of overall electric energy consumption by encouraging consumers to avoid consumption at peak times through dynamic pricing mechanisms and by providing useful real-time price information to consumers. Thanks to the infrastructure and mechanisms for bidirectional exchange of information and electricity, smart grid also allows traditional electric energy consumers to become providers. Electric energy that is stored or generated at residential and industrial facilities from renewable energy sources such as wind and solar can be sold to other consumers in the neighborhood or electric power providers.

Computing, communication, and control technologies can play an important role in improving efficiency in home and office building energy consumption. Electric energy used in the buildings sector is approximately 70% of total electricity consumption in the United States [271]. Energy consumption for lighting, heating/cooling, and computing can be made more efficient through distributed sensing and intelligent management of energy consumption by dynamically reacting to circumstances such as human activities and weather conditions.

The development of vehicles, mass transit, and traffic systems to address sustainability, efficiency, congestion, and safety is an important research issue for the benefit of our environment, economy, and safety. Next-generation transportation systems can potentially integrate intelligent vehicles and intelligent infrastructures. Intelligent vehicles can be equipped with seamlessly integrated embedded computing systems and in-vehicle networking systems. Vehicles can exchange information through wireless communication between vehicle-to-vehicle and vehicle-to-infrastructure. Intelligent mass transit systems can be more adaptive to the needs of users. Through these capabilities, vehicles can assist drivers or even drive autonomously by monitoring and estimating traffic conditions, planning ahead their behavior, and implementing the plan through drive-by-wire functionalities such as stability control, speed control, braking, and steering. Intelligent traffic infrastructures can be operated to manage the throughput of entire traffic systems. Intelligent mass transit systems can be better adaptive to the needs of users.

It is an important challenge to design and develop medical devices and systems with better efficiency, reliability, intelligence, and interoperability. Medical devices need to be highly reliable, and moreover should be operated in a patient-specific manner since patients have different physiological characteristics. Formal models of patient physiological dynamics, and the hardware and software systems of medical devices, and their interactions, can play an important role in designing and verifying safety properties of devices. The integration of wireless networking and distributed sensing and computing infrastructure for interconnectivity and interoperability with medical devices enables the development of medical systems by which patient physiological conditions can be diagnosed and treated in a more integrated and intelligent manner.

The high level of complexity of CPSs in both structural and behavioral aspects poses many challenges for researchers in realizing the benefits envisioned in many application areas.

Fundamental theoretical frameworks that can address the dynamics of CPSs in an integrated manner need to be developed. Further development of theoretical foundations is needed to better understand and predict complex dynamical behaviors caused by tight interactions between cyber and physical domains. Significant further advancement is needed to develop theories which enable us to capture and analyze the dynamics of the communications, computation, control, and applications in a unified theoretical framework.

Much research remains to be done to address complexity and productivity issues in the design and development of CPSs. Languages to model various aspects of a system at different levels of abstraction for various application domains need a fuller development. Further advances are also required to support automatic transformation between models in different semantic domains, model-level execution and debugging capabilities, composition of models to build an application, and incorporation of verification and validation capabilities.

Software platforms with well-defined and appropriate levels of abstractions and architecture are essential for the development of reliable, scalable, and evolvable CPSs in various application domains. They should hide unnecessary complexities inherent to CPSs, such as heterogeneity and distribution, and support rapid implementation of application and runtime reconfiguration and resource management to meet functional and nonfunctional requirements of an application.

Control methodologies need to be extended to much broader contexts since next-generation CPSs will be operated in much larger scales and in open environments. Algorithms and theories for high-level decision making based on information collected from different sources at different spatial and temporal scales are necessary for system-wide reliability, efficiency, security, robustness, and autonomy of CPSs.

Much important work remains to be done.

The authors would like to thank M. Caccamo, M. Franceschetti, S. Mitra, and P. Tabuada for their careful reading of the paper and valuable comments.

This work was supported in part by the National Science Foundation (NSF) under Contracts CNS-1035378, CNS-0905397, CNS-1035340, and CCF-0939370, by the United States Army Research Office (USARO) under Contracts W911NF-08-1-0238 and W-911-NF-0710287, and by the U.S. Air Force Office of Scientific Research (AFOSR) under Contract FA9550-09-0121.

The authors are with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128 USA (e-mail: kdkim@tamu.edu; prk@tamu.edu).

No Data Available

No Data Available

None

No Data Available

- This paper appears in:
- No Data Available
- Issue Date:
- No Data Available
- On page(s):
- No Data Available
- ISSN:
- None
- INSPEC Accession Number:
- None
- Digital Object Identifier:
- None
- Date of Current Version:
- No Data Available
- Date of Original Publication:
- No Data Available

Normal | Large

- Bookmark This Article
- Email to a Colleague
- Share
- Download Citation
- Download References
- Rights and Permissions