A Roadmap Towards Resilient Internet of Things for Cyber-Physical Systems

The Internet of Things (IoT) is a ubiquitous system connecting many different devices - the things - which can be accessed from the distance. The cyber-physical systems (CPS) monitor and control the things from the distance. As a result, the concepts of dependability and security get deeply intertwined. The increasing level of dynamicity, heterogeneity, and complexity adds to the system's vulnerability, and challenges its ability to react to faults. This paper summarizes state-of-the-art of existing work on anomaly detection, fault-tolerance and self-healing, and adds a number of other methods applicable to achieve resilience in an IoT. We particularly focus on non-intrusive methods ensuring data integrity in the network. Furthermore, this paper presents the main challenges in building a resilient IoT for CPS which is crucial in the era of smart CPS with enhanced connectivity (an excellent example of such a system is connected autonomous vehicles). It further summarizes our solutions, work-in-progress and future work to this topic to enable"Trustworthy IoT for CPS". Finally, this framework is illustrated on a selected use case: A smart sensor infrastructure in the transport domain.


Introduction
Cyber-physical systems (CPS) [1][2][3][4] are the emerging smart information and communications technology (ICT) that are deeply influencing our society in several application domains. Examples include unmanned aerial vehicles (UAV), wireless sensor networks, (semi-) autonomous cars [5], vehicular networks [3] and a new generation of sophisticated life-critical and networked medical devices [6]. CPS consist of collaborative computational entities that are tightly interacting with physical components through sensors and actuators. They are usually federated as a system-of-systems communicating with each other and with the humans over the Internet of Things (IoT), a network infrastructure enabling the interoperability of these devices.

Motivation for Resilient CPS
The advent of the Internet has revolutionized the communication between humans. Similarly the CPS and IoT are reshaping the way in which we perceive and interact with our physical world. This comes at a price: these systems are becoming so pervasive in our daily life that failures and security vulnerabilities can be the cause of fatal accidents, undermining their trustworthiness in the public eye. Over the last years, popular mainstream newspapers have published several articles about CPS that are recalled from the market due to software and/or hardware bugs. For example in 2015, The New York Times published the news [11] about the finding of a software bug in Boeing 787 that could cause "the plane power control units to shut down power generators if they were powered without interruption for 248 days". The Washington Post has recently published an article [12] about Fiat Chrysler Automobiles NV recalling over 4.8 million U.S. vehicles for a defect that prevents drivers from shutting off cruise control, placing them in a potential hazard. The recent accident of Uber's self-driving vehicle killing a pedestrian shocked the world [13], raising several concerns about the safety and trustworthiness of this technology.
With the connection of a CPS to the Internet, security becomes a crucial factor, too, that is intertwined with safety ("if it is not secure it is not safe" [14]). The tight interaction between the software and the physical components in CPS enables cyber-attacks to have catastrophic physical consequences. The Guardian reported last year [15] that over half a million pacemakers have been recalled by the American Food and Drug Administration due to fears that hackers could exploit cyber security flaws to deplete their batteries or to alter the patient's heartbeat. In 2015 the BBC announced [16] that the black-out of the Ukraine power grid was the consequence of a malware installed on computer systems at power generation firms, enabling the hackers to get remote access to these computers. In the same year two hackers have proved in front of the media [17] that they could hijack a Jeep over the internet.
The rise of the IoT, that is forecast to grow to 75 billions of devices in 2025 ( Fig. 1), is exacerbating the problem, by providing an incredibly powerful platform to amplify these cyber-attacks. An example is the MIRAI botnet that in 2016 have exploited more than 400000 devices connected through the IoT as a vehicle to launch some of the most potent distributed denial-of-service (DDoS) in history [18].
Managing and monitoring such ultra large scale system is becoming extremely challenging. A desired property to achieve/enforce this is to be resilient, i.e., the service delivery (or functionality) that can justifiably be trusted persists, when facing changes [19]. In other words, the system shall remain safe and secure in the advent of faults and threats (see Fig. 2 for some examples in the automotive domain) that could be even unpredictable at design time or could emerge during runtime [14,19].

State-of-the-Art
Resilience has been identified and discussed as a challenge in IoT [14,[20][21][22]. However, it has been mostly studied in other areas of computer science (see Table 1). The majority of surveys focus on one building block of a resilient system, e.g., a CPS, or one attribute of resilience. For instance, some publications survey security by intrusion detection [23,24] (e.g., based on machine learning / data mining [25] or computational intelligence [26]). Recent surveys on the IoT (Table 2) review definitions, state IoT and research challenges or discuss technologies to enable interoperability and management of the IoT. However, to the best of our knowledge, resilience, adaptation and long-term dependability and security have not yet been discussed in the context of IoT for CPS.

Novel Contributions
This paper provides an overview of the state-of-the-art to resilience -that is dependability and security -for the IoT. We focus on resilience mechanisms that can be applied during runtime and may be extended to adapt, such that a system undergoing changes remains resilient. We discuss a roadmap to achieve resilience, and illustrate our recent work on this topic with a case study. In particular: • We summarize state-of-the-art methods and discuss recent work on detection, diagnosis, recovery and/or mitigation of faults. Due to the expected heterogeneous architecture, we specifically target non-intrusive methods which reason and act in the communication network or at the interfaces of the IoT devices.
• We state the challenges of these techniques when applied in the IoT and depict a roadmap on how to achieve resilience in an IoT for CPSs.
• Besides discussing several new perspectives, we further demonstrate some of our key methods/solutions and ongoing works on providing high resilience for the information collected and employed by the IoT in an automotive case study.

Organization of the Paper
The rest of the paper is organized as follows (see also Fig. 3). The next two sections (Section 2 and Section 3) introduce the terminology around resilience, fault types and examples, building blocks of resilient systems and architectural layers to the readers. Section 4 collects state-of-the-art techniques for fault detection and recovery. Section 5 states research challenges for resilience, and particularly for the long-term dependability and security. Section 6 discusses challenges and our roadmap to resilience in IoT with several new perspectives. Section 7 presents some of our key solutions to this topic on the case study "resilient smart mobility". Section 8 finally concludes the paper with a discussion of the presented solutions and future work.

Resilience
In order to provide a better understanding of resilient IoT, we introduce resilience and its terminology in this section.

Attributes of Resilience
We desire the IoT for CPS to be dependable and secure throughout its entire life-cycle. Avizienis [27] defines the dependability property of a system to be the combination of following attributes: availability (readiness for correct service), reliability (continuity of correct service), safety (absence of catastrophic consequences), integrity (absence of improper system alterations), maintainability (ability to undergo modifications and repairs). Security includes availability, integrity and confidentiality (the absence of unauthorized disclosure of information). Robustness can be considered as another attribute of dependability. It has its roots in the control theory or CPS where a system is called robust if it continues to function properly under faults of stochastic nature (e.g., noise). In recent work on the concepts of cyber-physical systems-of-systems (CPSoS) [4], robustness is extended to consider also the security issues in CPS as well: "Robustness is the dependability with respect to external faults (including malicious external actions)". Figure 4 summarizes the attributes of a resilient system. A fault-tolerant system recovers from faults to ensure the ongoing service [27], i.e., achieving dependability and robustness of a system.

Fault Type Examples Physical
Broken connector (e.g., due to aging effects), radiation, noise, interference, power transients, power-down or short generated by an attacker, material theft (e.g., copper), denial-of-service by jamming / signal interference Development Hardware production defect, hardware design error ("errata"), software bug in program or data (memory leaks, accumulation of round-off errors, wrong set of parameters), unforeseen circumstances (of the system and/or its environment), vulnerabilities, aging effects like memory bloating or leaking Interaction Input mistake, message collision, spoofing (obscure identity), modify information with a Trojan horse, no or late message delivery (e.g., by replay attack), denial-of-service by flooding (e.g., bomb of connection requests), hacked sensor producing inaccurate or false data causing incorrect control decisions and actuator actions Permanent Design faults, broken connector, noise, stuck-at ground voltage due to a short, logic bomb carried by a virus slowing down or crashing the system, aging effects (e.g., electromigration) Transient Radiation, power transients, input mistake, intrusion attempt (via vulnerabilities, e.g., heating the RAM to trigger memory errors) Table 3: Main classifications of faults by [27] with examples.
The term resilience is often used by the security community to describe the resistance to attacks (malicious faults). Laprie [19] defines resilience for a ubiquitous, large-scale, evolving system: Resilience is "The persistence of service delivery that can justifiably be trusted, when facing changes.". The author builds upon the definition of dependability by giving the following short definition of resilience "The persistence of dependability when facing changes.".
A ubiquitous, heterogeneous, complex system-of-systems will typically change over time raising the need for the dependability and security established during design time to scale up. We therefore find the definition of resilience from Laprie [19] a good fit to express the needs of an IoT for CPS. A resilient IoT ensures the functionality when facing also unexpected failures. Moreover, it should scale dependability and security when it comes to functional, environmental and technological changes [49] -we refer this capability to as long-term dependability and security.
However, to ensure resilience in a system, two important factors need to be analyzed: i) possible faults (the sources of dependability and security threats, see Sec. 3) and ii) available detection and mitigation methodologies (techniques and actions to apply, see Sec. 4).

Faults, Errors, Failures and Attacks
A failure is an event that occurs when a system deviates from its intended behavior. The failure manifests due to an unintended state -the error -of one or more components of the system. The cause of an error is called the fault [27].
The source of a fault (Table 3) may be internal or external. Internal faults may be of physical nature (e.g., broken component connector) or introduced by the design (software/hardware bug). External faults originate from the environment (e.g., noise, radiation) or inputs (e.g., wrong or malicious usage of the system). Faults can be mainly classified into transient and permanent faults. Although a transient fault manifests only for a short period of time, it can cause an error and might lead to a permanent failure. Physical faults (internal/environmental) and inputs may be transient or permanent. Design faults are always permanent. Faults that cannot be systematically reproduced are often called intermittent faults (e.g., effects of temperature on hardware, a transient fault like a short in the circuit activated by a specific input). Such faults lead to so-called soft errors. A possible attack scenario (that is a malicious external fault) is often referred to as a (security) threat.
Consider the CPS/IoT infrastructure shown in Figure 5. Faults (e.g., radiation or a malicious signal for an actuator) may occur at different layers of the architecture (e.g., physical or control layer, respectively) [50]. The physical layer is vulnerable to disruption, direct intervention or destruction of physical objects (e.g., sensors, actuators and mechanical components). The network layer (here: the IoT) connects the devices. The monitors

Dependability Faults in IoT
The IoT is susceptible to communication failures particularly due to its size and heterogeneity. Traditional CPS would avoid or mitigate such failures by verification and sufficient testing of the design and final implementation of the network component. However, the IoT will evolve in technology and grow in size over time. For instance, following faults may occur per CPS layer: • Physical Layer: -Interference: Disruption of a signal. The number of connected devices and subsequently the radiation increases which may influence sensor measurements, transmitted messages or control signals [51].
• Network Layer: -Message Collision: Similarly to interference, the number of communicating devices might trigger communication failures, e.g., collisions or an overload of the network.
-Protocol Violation: Wrong message content due to different protocol version or protocol mismatch.
• Control Layer: -Deadline Miss: Late control signal reception. Control loops still have to follow the timing constraints of a CPS application.
-Misusage: Send/set wrong inputs to a component, e.g., due to wrong or incomplete syntactical and/or semantic information about the device.
• Information Layer: -Unavailability: Missing information caused by a technology update. Things might be connected, disconnected or updated in the IoT.

Security Threats in CPS
Security has been a topic since the beginning of computer networks identifying vulnerabilities (that is an internal fault or a weak point in a system enabling an attacker to alter the system [27]) and avoiding or mitigating malicious attacks in devices. However, in CPS additional vulnerabilities arise given the connection to the physical domain and the uncertain behavior of the physical environment [38,52]. For instance, following attacks may be applied per CPS layer: • Physical Layer: [53][54][55] -Information Leakage: Steal critical information from devices, e.g., secret keys or side channel parameters [56][57][58][59][60].
-Denial of Service (DoS): Manipulate several parameters to perform a denial of service attack, e.g., hack the power distribution network to drain the energy [61,62], destroy the sensors or actuators (in case of physical access), add extra power/communication load.
• Network Layer: W.r.t. security, this is the most vulnerable layer in a CPS because of the vast possibilities of attacks on communication networks which emerged over the years [63][64][65][66].
-Collision: Manipulate the timing, power and/or frequency of a network to trigger metastable states which eventually lead to data collision or violation of communication protocols [71][72][73][74].
-Routing ill-direct: Manipulate the routing mechanism leading to data collision, data flooding and selective forwarding of data [75,76].
• Information Layer: -Eavesdropping: Steal or sniff information. This is one of the major threats related to privacy.
Moreover, information can also be manipulated to perform several attacks, i.e., jamming, collision or DoS.
The potential threats and consequences can be expressed in security threat models for CPS [38]. To define a certain threat model, the following factors have to be identified: 1. Source/Attacker: All the possible factors/actors which intentionally disturbs or interrupts the behavior or functionality of the CPS [38].
2. Attack Methodology: The methodology or framework used to perform the attacks. However, it depends upon the attacker's capabilities (available computational power, access to CPS resources and layers, etc.) motive (reason for the attack) [38] and the type of attack vectors.

Long-term Dependability and Security Threats
The IoT and CPS will undergo changes over time, especially when subjected to long operational duration (over decades like in autonomous vehicles). Following aspects of the change [49] might trigger faults (see examples per CPS layer in Fig. 5).
1. Environmental: Uncertainty of the physical world. Decay and aging of material and components.
2. Functional: Different and/or new applications and requirements. Dynamic system, i.e., connecting/disconnecting devices.

Fault Behavior
A failure manifests in a wrong content or timing (early, late or no message at all) of the intended service. Components may contain an error detection mechanism and additionally suppress wrong outputs. Such components are called fail-silent. Some components may automatically stop their execution on failures or halt crash, so-called fail-stop components. However, an erroneous component may provide wrong outputs, i.e., the service is erratic (e.g., babbling) which can cause other services to fail. In the worst case the behavior/output of the failed component is inconsistent to different observers (Byzantine failure) [27,28].

Techniques for Resilient IoT for CPS
There are various online and offline approaches to achieve resilience in a system. Developers may try to prevent faults (e.g., by an appropriate design, encryption or consensus), tolerate faults (e.g., by switching to a redundant component or another pre-defined configuration), remove/mitigate faults (e.g., isolate faulty components to avoid the propagation of faults) or forecast faults (e.g., to estimate the severity or consequences of a fault) [27]. We want to focus on the possibilities to fulfill the following requirements regarding resilience: • R1: Detection and identification of faulty, attacked or failed components during runtime in the IoT. Faulty or already failed components shall be detected to be able to maintain or recover to a healthy system state providing correct system services.
• R2: Autonomously maintain resilience in the IoT. Ensure the functionality of a dynamic and heterogeneous system in the presence of faults, i.e., recover from failures in an automatic fashion.
The following two sections give an overview about methods split into detection and diagnosis, and recovery or mitigation of failures. They summarize background and terminology, highly-cited surveys (≥100 citations according to Google Scholar), recent surveys (≥2015), recent approaches not part of surveys / additional work, and examples (see distribution in Table 6) given the keywords in Table 5. Note that we tried to cite original publications and no derivations of basic fault-tolerant techniques.

Detection and Diagnosis
Recovery or Mitigation anomaly detection, fault detection/diagnosis, security in CPS, intrusion detection, runtime monitoring, runtime verification, self-awareness self-healing, self-adaptation, software adaptation, runtime reconfiguration, fault-tolerance, fault recovery, threat mitigation dependability, resilience

Detection and Diagnosis
Anomaly detection is the process to identify an abnormal behavior or pattern. The abnormal behavior or service failure (e.g., wrong state, wrong message content) is caused by a fault [27], e.g., a random failure, a design error or an intruder. Though this definition probably complies with all fault detection mechanisms listed in this section, the various communities use different keywords depending on the application or type of the mechanism. The related term monitoring is used in the field of runtime verification to refer to the act of observing and evaluating temporal behaviors [29]. In the security domain the phrase intrusion detection is used for reasoning about threats. Halting failures (fail-stop or fail-silent behavior) can be detected by simple methods like watchdogs or timeouts. Faults that manifest in erratic or inconsistent values or timing need a behavior specification, model or replica to compare against (we therefore focus on these methods). Such detection methods can be roughly separated w.r.t. the knowledge used to compare to the actual behavior (Fig. 6).
Another field of reasoning about failures is the root cause analysis or fault localization which identifies the reason why a fault occurs (e.g., a vulnerability of the system or the first failed component which caused other components to fail due to fault propagation).

Redundancy
Additional information sources can detect many types of faults [98]. A simple method to verify a message's content or intermediate result is plausibility checking or majority voting [28], e.g., by comparing a received message's content against redundant information sources (see also "agreement" in Sec. 4.2). Nevertheless, redundancy is typically the last resort to increase the resilience or to ensure a specific level of dependability because it is costly when it is added explicitly (e.g., triple modular redundancy often deployed in the avionics [28]).
In hardware, fault detection by redundancy is also known as lockstep execution where typically two computational units run the same operations in parallel to detect faults [99,100]. When three replicas are used, the fault can be masked by majority voting (under the assumption that only one component can fail at the same time), see also Triple Modular Redundancy (TMR) in Section 4.2.

Specification in Signal Temporal Logic
Between 2s and 4.5s the signal is between -2 and 2 However, some techniques exploit implicit or functional redundancy that is already available in the system. For instance, [92] combines anomaly detection with sensor fusion. Their approach uses a particle filter fusing data of different sensors and simultaneously calculating a value of trust of the information sources derived from the normalization factor, i.e., the sum of weights of the particles. When the weights of the particles are high, the information source match the prediction and are rated trustworthy. The authors in [101] propose to use hard-wired local data of an automotive ECU to check the plausibility of a received control input. Our method presented in Section 7 is based and relies upon implicit (and explicit) redundancy too.

Specification
Verification of Safety Properties The IoT generally consists of spatially distributed and networked CPS. At design time, the CPS behavior can be modeled using hybrid systems, a mathematical framework that combines discrete transition systems capturing the computational behavior of the software component with continuous (often stochastic and nonlinear) ordinary differential equations (ODEs) describing the behavior of the physical substratum with which the software component is deeply intertwined.
Although there has been a great effort in literature to provide efficient computational techniques and tools [102][103][104][105][106][107][108][109] to analyze safety properties in hybrid systems, the exhaustive verification (i.e., model checking) is in general undecidable [91]. The approaches currently available to check safety properties are based on generating conservative over-approximations of the state variables dynamics called flow pipes [110] and on checking whether those intersect the unsafe regions of interest. However, these methods are generally limited to small scale CPS models. This limitation becomes more evident when we want to study more complex emergent behaviors, which result from the interactions among system components and that can be observed only by taking in consideration large scale CPS.
Hybrid systems are approximation models of the real CPS behavior and so their analysis may be not always faithful due to inevitable approximations errors (especially of the physical behavior) in the modeling phase. Furthermore, CPS models are not always available for intellectual property issues and indeed CPS need to be studied as black box systems where we are not able to observe the internal behavior.
Runtime Verification A complementary approach to exhaustive verification is to equip CPS with monitors that verify the correctness of their execution. Monitoring consists of observing the evolution of the discrete and continuous variables characterizing the CPS behavior and deciding whether the observed trace of values is good or bad. As Fig. 7 illustrates, these traces can be obtained by simulating the CPS design or can be observed during the CPS execution through the instrumentation of the system under test (SUT) (more details concerning instrumentation techniques can be found in [111]).
Runtime verification (RV) [29] is a specification-based monitoring technique that decides whether an observed trace of a SUT conforms to rigorous requirements written in a formal specification language. The main idea of RV consists in providing efficient techniques and tools that enable the automatic generation of a software-or hardware-based monitor [112,113] from a requirement. RV can provide useful information about the behavior of the monitored system, at the price of a limited execution coverage.
RV is nowadays a very well-established technique, widely employed in both academia and industry both before system deployment, for testing, verification, and post-deployment to ensure reliability, safety, robustness and security.
A typical example of formal specification language is the Linear Temporal Logic (LTL) introduced by Pnueli in [114]. LTL provides a very concise and elegant logic-based language to specify sequences of Boolean propositions and their relations at different points in time. LTL considers only the temporal order of the events and not the actual point in time at which they really occur. For example, it is not possible to specify that a property should hold after one unit of time and before three and a half units of time.
Real-time temporal logics [115] overcome these limits by embedding a continuous time interval in the until temporal operator. Signal Temporal Logic [116,117] is a popular example of a real-time temporal logic suitable to reason about the real-time requirements for CPS which has been proposed for detection of threats [118].
Although reasoning about a single trace can provide an insight about safety properties, this is generally not sufficient to capture important information-flow security properties [119] such as noninterference, noninference and information leakage. These properties are called hyperproperties, because in order to be verified, they need two or more execution traces of the system to be considered at the same time. In order to specify hyperproperties, both LTL and STL have been extended respectively in HyperLTL [120] and HyperSTL [121] adding in the syntax both universal and existential quantifiers over a set of traces. Runtime verification of such specification languages is still an open challenge (some preliminary results appeared in [122]), since the majority of the monitoring algorithms available are usually developed to handle only a single trace.
Falsification-based analysis and Parameter synthesis As illustrated in Fig.7, the Boolean semantics of STL decides whether a signal is correct or not with respect to a given specification. However, this answer is not always informative enough to reason about the CPS behavior, since the continuous dynamics of these systems are expected to be tolerant with respect to the value of certain parameters, the initial conditions and the external inputs.
Several researchers have proposed to address this issue by defining a quantitative semantics for STL [123,124]. This semantics replaces the binary satisfaction relation with a quantitative robustness degree function that returns a real value (see Fig.7) indicating how far is a signal from satisfying or violating a specification. The positive and negative sign of the robustness value indicates whether the formula is satisfied or violated, respectively.
The notion of STL robustness was exploited in several tools [125,126] for falsification analysis [127] and parameter synthesis [128,129] of CPS models. On one hand, trying to minimize the robustness [125] is suitable to search counterexamples in the input space that violates (falsifies) the specification. On the other hand, maximizing the robustness [126] can be used to tune the parameters of the system to improve its resilience. To this end, a global optimization engine is employed to systematically guide the search.
Signature-based Intrusion Detection Signature-based intrusion detection compares pre-defined behavior (known as golden behavior or signature) to identify the the abnormal event during runtime [23]. Though these techniques effectively identify the intrusion with a small number of false positives they require a precisely calibrated signature [93]. Therefore, such techniques are not feasible if designers and IP providers are not trusted. Such misuse-based intrusion detection typically cannot handle zero-day attacks that are new unknown attacks. It is therefore often combined with anomaly detection (e.g., in [130]).

Anomaly-based Detection
Statistical Techniques In statistical anomaly detection the data is fit into a statistical model. If a test instance occurs in the low probability region of the model, i.e., it is unlikely to be generated by the model, then it is claimed to be an anomaly. Statistical models can be specified with parameters when the underlying distribution is known (e.g., is Gaussian). The parameters are trained by machine learning (ML) algorithms [31] or estimation [39] describing the correct behavior of the system. The inverse of the test instance's probability to be generated can directly be used as anomaly score. Statistical tests can also be used to label or score a test instance (e.g., box plot rule).
The model can be expressed by the data itself, e.g., in a histogram, by kernel functions or particles, which is typically used when the distribution of the data is unknown. The test instances or samples may be evaluated by statistical hypothesis tests. For instance, the Wilcoxon signed-rank test [131] compares two related samples to determine if they have the same underlying distribution (which is unknown and does not have to be the normal distribution).
The principal component analysis (PCA) is used to project the data to lower dimensions, i.e., it reduces the dimensionality of the data to a set of uncorrelated variables. A test instance can be marked anomalous when the projection on the components result in a high variance meaning that the test instance does not fit the typical correlation of the data.
However, simple tests, Gaussian models and histograms are nowadays mostly replaced by (deep) neural networks which stand out handling multivariate and non-linear data.
Machine Learning or Data Mining Typical anomaly detection techniques based on machine learning can be used with data where no domain knowledge is available (e.g., black-box components like IP cores). The models may be updated during operation. When the desired behavior is known it can be expressed as formal model (specification-based monitoring).
Classification-based anomaly detection learns a model (SVM, neural network, Bayesian networks, rules or decision trees) given labeled training data (e.g., states and observations of the system) to cluster the test data into normal classes and anomalies or outliers [31]. Instead of labeling a test instance to a class, one may use scores representing the likelihood of a test instance being an anomaly. For instance, the authors in [132] use recurrent neural networks to detect anomalies in real-time data. The network models short and long term patterns of time series and serves as a prediction model of the data. The error between predicted and actual value serves as an anomaly score.
Nearest-neighbor-based detection techniques measure the distance from a data instance under test to k neighbors to identify anomalies. Different metrics (e.g., euclidean distance) are applied to specify an anomaly score -that is the likelihood of a data instance to be an anomaly. Another approach is to measure the density that is the number of instances in the area specified by the data instance under test given a radius. The Nearest-Neighbor's complexity increases with the power of two of the number of data instances. Unsupervised.
Data instances are first distributed into clusters (by clustering algorithms, e.g., expectation maximization, k-means, self-organizing maps, many of which use distance or density measures). An anomaly is a data instance that does not fit into any cluster.
Information-theoretic By investigating the information content described by, e.g., the entropy of the information, one may draw conclusion about anomalies in the data (for information-theoretic measures characterizing regularity in data see [133]). When the entropy exceeds a threshold the test instance is marked as anomaly. The threshold is defined by the set of anomalies. In highly irregular data the gap between threshold and maximum entropy may be low (the set of true anomalies is small).

Fault-Localization
When the fault detection only gives us the information about a failure happened in a subsystem, we need means to identify the exclusive part causing the failure. This is often performed by root cause analysis [134] or fault-localization [135][136][137][138][139][140][141]. In the software engineering community there is a considerable amount of literature about (semi-)automatic techniques assisting the developer to localize and to explain program bugs (for a comprehensive survey we refer the work in [141]). A well-established statistical approach, is the spectrum-based fault-localization (SFL) [139], a technique that provides a ranking of the program components that are most likely responsible for the observed fault.
This approach has been employed recently also to localize faults in Simulink/Stateflow CPS models [135][136][137][138]140], displaying a similar accuracy with the same method applied to software systems [137]. Although the classical SFL is agnostic to the nature of the oracle and only requires to know whether the system passes or not a specific test case, in [135], the authors have introduced a novel approach where the oracle is a specificationbased monitor. This enables to leverage the trace diagnostic method proposed in [142] and to obtain more information (for example the segment of time where the fault occurred) about the failed tests improving the fault-localization.
Often this approach is only applied offline for debugging processes, however, it can be used to isolate a failed HW/SW component from the system to avoid fault propagation or trigger its recovery.

Recovery or Mitigation
Broadly speaking, a system can be adapted by changing its parameters or its structure (architecture) [32,36]. Following four action types of possible re-configurations are defined by [37] (splitting structural adaptation into further classes): re-parameterization to change the parameters of a component, re-instantiation to create and remove components, rewiring to redirect connections between components or relocation to migrate functionality to another platform. The latter three action types require redundancy to some extent. We extend and refine these types in the following (Fig. 8). Unless otherwise stated, the adaptation can be applied on different architectural levels of the system. For instance, the change of the clock speed or other hardware parameters is the re-parameterization on the physical level of a device. Changing the receiver of a software component's output is rewiring on the process/task level.

Re-Parameterization
In general, a re-parameterization (or reconfiguration) switches to another configuration of one or more components that is typically no longer the optimal setting, i.e., the quality of service is decreased (graceful degradation). Adaptation of parameters requires knowledge about the underlying algorithm of the erroneous component and is therefore typically performed by the component itself or within a subsystem. The configuration can be selected by optimization [143], or a reasoner based on a set of rules, an ontology or a logic program [37]. Approaches from the control theory use state observers or estimators to derive parameters to mitigate stochastic faults [39]. For instance, an adaptive Kalman filter (AKF) [144] changes its filter parameters during runtime based on the inputs. For instance, the measurement covariance can be increased when an input signal gets worse or even permanently fails (cf.: a traditional KF or state estimator mitigates noise and transient failures only).

Runtime Enforcement
Runtime enforcement [95,97] merges runtime verification with adaptation. This powerful technique ensures that a program conforms to its specification. A so-called enforcer acts on the interface of a component changing inputs or outputs to comply with a set of formal properties. The enforcer uses an automaton and/or rules to correct the IO in case of faults. This approach has been pioneered by the work of Schneider [145] on security automata which halt the program whenever it deviates from a safety requirement. Since then, there has been a great effort in the RV community to define new enforcement mechanisms with primitives [146][147][148][149][150] or that support more expressive specifications [151][152][153].

Redundancy
Redundant components ensure availability (passive) and increase reliability (active). Failed components can be re-instantiated, replaced by spares, mitigated by voting or fusion, rewired or relocated [37,41].
Re-Instantiation or Restart A straightforward fault-tolerance method is to restart a failed software component. The tasks or the system typically saves checkpoints or output messages of components on a periodic basis to roll back to a healthy state [154]. The restart might be combined with a re-parameterization. Checkpointing/restart techniques are well studied for operating systems [155] and may be applied on fog nodes or cloud servers. The primary/backup approach activates a typically aperiodic backup task if the primary task fails [156]. Adaptations in hardware and software also mitigate reliability threats while considering the optimization cost constraints [157][158][159]. Similarly, runtime reconfiguration polices have also been proposed to mitigate the reliability threats in microprocessors [160].
Replacement or Cold/Hot Spares The simplex architecture [161] considers two redundant subsystems. A highly dependable subsystem jumps in when the high-performance subsystem fails. Triple modular redundancy (TMR) replicates HW and/or SW components to mask failures (through a voter, i.e., includes detection). The replicates are in the best (but most costly) case diverse w.r.t. their design such that also design and input errors can be masked [28]. Such hardware redundancy is typically added during design time and used in closed, non-elastic systems. To exploit these techniques, several reliability resilient microprocessor designs [162][163][164] and corresponding software layer controls [165][166][167] have been proposed to ensure the resilience towards reliability threats, i.e., soft errors. Typically, TMR-based solutions possess a large area and power overhead. However, adaptive-TMR solutions [168] can trade-off between power budget and reliability threats. Similarly, software and hardware error masking techniques [163,169] exploit the dark silicon (under-utilized areas) in multi-core systems [170] to mitigate faults. However, an IoT orchestrator can maintain a directory of available services and redirect resource requests if necessary.
Implicit redundancy like related observations in a system (in contrast to traditional redundancy that is the explicit replication of components) can be exploited by structural adaptation. A substitute component is instantiated to replace the failed component which includes also rewiring and possibly also a relocation [171,172] (see Sec. 7 for an example).
Agreement / Voting or Fusion Byzantine failures (inconsistent failures to different observers) typically caused by malicious attacks can be detected and tolerated using replicas (here: redundant services on different nodes of a distributed system) by agreement or consensus on the outputs [94]. The output of redundant components can be combined or fused, e.g., via filters or fuzzy logic [42]. However, through recent implementations and usage in cryptocurrencies [173,174] the attention is shifted towards smart contracts and blockchains which ensure authentication and integrity of data [44,46,47,96]. Basically, a blockchain is a series of data records each attached by a cryptographically secure hash function which makes it computationally infeasible to alter the blockchain. However, blockchains suffer from complexity, energy consumption and latency and therefore currently cannot be used for real-time anomaly detection or applied by simple nodes with low computational power and restricted battery power budgets [96]. However, it is already examined to manage access to data (authorization), purchase devices or computing power or manage public-key infrastructure in the IoT [44,175,176].
Rewiring or Redirection Broken links in mesh networks are typically reconfigured using graph theory considering node properties and application requirements [177]. A software component may route the task flow to a recovery routine [41].
Relocation Migration of software components or tasks are studied in the field of resource optimization, utilization and dynamic scheduling on (virtual) machines. Optimization algorithms [143], multi-agent systems [178] or reinforcement learning [179] find a new task configuration utilizing resources in case of a platform failure. Tasks may also be migrated in advance when the health state decreases [154]. Cloud applications boost and emerge new technologies like containerization, resource-centric architectures and microservices which ease service orchestration in complex and elastic systems. Dragoni et al. [180] prognoses increased dependability using microservices which focus on small, independent and scalable function units (cf. fault containment units in Kopetz [28]), however, security remains a concern.

Long-Term Dependability and Security
During design time only a subset of failures and threats can be considered, however, the changes of the system itself or the environment can not be predicted which may lead to new possible fault scenarios. Moreover, over the period of time (especially when considering systems deployed for several decades like autonomous vehicles), new attacks can emerge (adversarial machine learning, though, ML is decades old theory), new vulnerabilities in the system can be unleashed (some recent examples are Spectre and Meltdown in decades old technology of high-end processors), and attackers may get more powerful and intelligent (e.g., learning based attacks).
We therefore believe that the IoT needs enhanced self-adaptation techniques (may be cognitive in nature) to achieve long-term dependability and security. For instance, apart from traditional fault-tolerance like backup hardware/software components or checkpointing and restarting, self-healing is a promising approach which is related to self-adaptation and self-awareness. Self-aware systems learn the models of the system itself and its environment to reason and act (e.g., self-healing) in accordance to higher-level goals (e.g., availability) [36].
The key feature of self-* or self-X techniques is continuous learning and optimization which is performed during runtime to evolve the models upon system changes.
To design and build a long-term dependable and secure IoT of smart CPS, the following research questions need to be addressed first: 1) How to detect and separate subsystem failures and minimize the failure dependencies of the subsystems?
How to guarantee the resilience of the system when applying machine learning and/or self-adaptation?
2) How to detect and recover compromised components with minimal performance and energy overhead? How to learn from unknown attacks on-the-fly and devise appropriate mitigation strategies online, e.g., online on-demand isolation, new fail-safe modes, etc. besides investigating fast learning of on-going attacks to minimize the attack surface?
3) How to ensure the robustness of the resilience mechanisms itself?
To address these challenges, following techniques are envisioned to ensure long-term dependability and security.

Verification and validation
Ensuring the complex dependencies and integrity of several components and subsystems within a system is a very challenging research question. The state-of-the-art on dependability and security assurance is based on model-driven design that consists of specifying rigorously the structure and the behavior of the systems using formal models. These models are amenable to formal verification techniques [181][182][183][184] that can provide comprehensive guarantees about correctness of the system's properties. The accuracy of these models and the test coverage limit the validity of the assurance. The addition of data-driven learning-enabled subsystems introduces uncertainty in the overall design process and may result in an unpredictable emergent behavior. This is because the operational behavior of these subsystems is a function of the data they train upon and it is very difficult to predict.
This lack of predictability force to think novel approaches for ensuring long-term dependability and security. Here we could envision at least two possible interesting research directions to pursue.
One idea could be to take inspiration by the natural immune systems that protect animals from dangerous foreign pathogens (i.e., bacteria, viruses, parasites). In our case, we can think to have a specialized subsystem that learns both how the surrounding environment evolves and how to best react to attacks. However, this approach would leave the system vulnerable during the learning process.
Another possible direction is to provide mechanisms to enforce dynamic assurance of security and dependability at runtime. A similar approach in control theory can be found in the Simplex Architecture [161] (SA). SA consists of a plant and two version of the controllers: a pre-certified baseline controller and a not certified high-performance controller. A decision module decides whether to switch between the two controllers depending on how much close is the high-performance controller to violate the safety region. In our case, we could envision a dedicated decision module (for example a runtime monitor) enabling a certain degree of autonomy and trust to its subsystems depending on how much the overall system is far to violate a certain safe and secure operating conditions.

Intelligent and Adaptive Systems
Long-term dependability and security can be achieved by intelligent and adaptive systems (i.e., so-called smart or cognitive systems) that consider uncertainties and changes throughout the lifecycle, and increase their inherent robustness levels on-the-fly autonomously through continuous self-optimization and self-healing.
Many of the techniques presented in the last section do not handle dynamic systems or systems that evolve over time. However, the approaches can be evaluated and extended by artificial intelligence, or new techniques developed to cope with the elasticity of the IoT.
Machine learning-based fault detection and recovery are replacing traditional (pre-configured) techniques because of their ability to extract new and hidden features from the complex and enormous amount of data [185,186]. To design intelligent and adaptive ML-based secure sub/-systems, first a trained model must be acquired with respect to safe, secure and dependable behavior while considering uncertainties, unforeseen threats and failures, and design constraints. Next, the trained model is integrated within these sub/-systems for online threat and fault detection under the area and power constraints. However, ML-based techniques typically do not consider the limited computational resources, complexity, probably poor interoperability or real-time constraints of the IoT. Therefore, one has to apply scalable and/or distributed techniques.

Robustness
The subsystem ensuring resilience shall be robust against the elasticity or dynamicity of the system. The machine learning models need to be updated or reconstructed from the basic building blocks on system changes. Models like deep neural networks (DNNs) and recurrent neural networks (RNNs) are effective in classifying real-world inputs when trained over large data sets. Unfortunately, the decision-making systems using NNs cannot be analyzed and rectified due to the currently used black-box models of NNs.
However, AI systems used in industry (in particular safety-critical CPS) need to follow the strict regulations and are expected to explain the reasoning behind their decision-making which is not viable when using ML-based systems [187]. Recent advances in AI, e.g., biologically inspired NNs, may provide the necessary information to get certified. For instance, Hasani and Lechner et al. [188] are able to interpret the purpose of individual neurons and can provide bounds on the dynamics of the NN.
Moreover, a NN has several security and reliability vulnerabilities w.r.t. data, e.g., data poisoning, model stealing or adversarial examples [61,189]. To ensure the robustness in such NN-based decision system, several countermeasure have been proposed. A common approach is to encrypt the data or underlying model [190][191][192]. However, encryption works as long as the encryption techniques and confidence vectors remain hidden from the adversary. Moreover, it requires additional computational resources for encryption and decryption. Other approaches are, e.g., watermarking [193,194], input transformation [195] and adversarial learning [196,197]. Note that these countermeasures protect the NNs against known attacks only. Therefore, to ensure the robustness also under unknown attacks and unforeseen circumstances formal verification-based approaches [198,199] are emerging as an alternate solution.

Roadmap
We are investigating techniques for anomaly detection and self-healing to ensure resilience in IoT for CPS.

Goals
The overarching goal of our research is to provide guidelines, methods and tools to enable a safe and secure IoT for CPS.
Our contributions are two-fold. We increase the dependability of the IoT (and in further consequence, the CPSs using it) by self-healing and the security by developing (semi-) automatic configuration, testing and threat detection. We plan to address the following research questions: • How to improve the resilience of the IoT by fail-operational mechanisms?
• How to verify and monitor IoT components?
• How to detect anomalies in the IoT with minimum performance and energy overhead?
• How to ensure high resilience even under unpredictable attack and failure scenarios?
• What architectural requirements are necessary to ensure resilience with these mechanisms?
In summary, the key research goals of our contribution are: • Propose novel design methodologies and architectures for scalable resilience in IoT for CPS.
• Propose an energy-efficient analysis (verification) and threat detection.
• Propose a framework to design low power and ML-based run-time anomaly.
• Propose a methodology to identify and assert the runtime safety and security properties.
• Propose a self-healing mechanism for the IoT.

Challenges
The resilience of systems using anomaly-based detection and self-healing raise the following research questions and challenges.
• C1: Resource Limitations. The majority of IoT components are resource-constrained devices. The developer often has to trade off power, time and costs against resilience. Typically, small IoT devices like commercial off-the-shelf (COTS) microcontrollers may provide insufficient capabilities. Some technologies might therefore need hardware implementations (e.g., RV monitor) or should be designed as a lightweight and fully distributed, layered, or clustered service (e.g., a monitor per subsystem).
For instance, one major challenge in anomaly detection is the data acquisition under the consideration of power and design constraints. This raises following research questions: 1. How to extract/acquire and analyze a particular characteristics during run-time while considering the design and power constraints?
2. How to reduce the area and energy overhead of the data acquisition, i.e., power-ports, for runtime measurement and modeling?
• C2-1: Interoperability and Complexity. The IoT is a large dynamic network of heterogeneous components. In particular, COTS or components protected by intellectual property (IP) may not provide a proper specification of its behavior for some of the detection and adaptation methods. Furthermore, new devices or subsystems may introduce unknown interfaces (here: unknown to the resilience-enabling technologies). In particular, this raises following research questions: 1. How to identify the reference communication behavior without any reference system?
2. How to model the communication behavior which can be used to identify the anomalous behavior?
In anomaly detection, for instance, one of the major challenge is to identify the appropriate golden/reference behavior which can be used to compare with online/offline behavior. This raises following particular research questions: 1. How to model/identify the reference/golden behavior that covers the key characteristics and can be scalable?
2. How to obtain the labeled data for supervised training to extract the reference model?
3. Which modeling techniques and corresponding characteristics are appropriate to identify the anomalous behavior with complete coverage?
• C2-2: Interoperability and Sharing. The devices of a CPS are specified during design time having a specific application in mind. The things of an IoT will most likely be shared between applications while different fog/cloud applications might request different QoS of the devices, e.g., regarding dependability. The methods therefore must also consider and combine the requirements of different applications and the value of trust of the information (e.g., used to derive actions). Due to the vast size of an IoT, a central mechanism most likely will not be able to cope with all the input data necessary to achieve resilience (considering memory and time constraints).
• C3: Real-Time and Scalability. One major shift from sensor networks to the IoT is the control and manipulation of actuators from the distance, i.e., the IoT comprises a cyber-physical system. The CPS typically has to satisfy time constraints (rates, deadlines) in order to function correctly. In such real-time applications the probing of information by a monitor or changes in the system (e.g., connection of new things, updates, recovery) shall not influence the timing behavior of the CPS. Furthermore, the timeliness to detect and react to critical failures has to be considered.
However, the complexity and dynamicity of the network will leave the door ajar for some faults, e.g., physical faults, design errors or zero-day malware. Therefore a proper never-give-up strategy [28] to cope with unconsidered failures has to be developed.  Figure 9: A brief history of computer systems and our roadmap towards resilient IoT for CPS. Figure 9 depicts the evolution of embedded systems (milestones as junctions), their goals and requirements (as lines). The lower part of Fig. 9 summarizes our milestones (1.i-iv, 2.i-iii, 3) given below.

Milestones
1) How to improve the resilience of the IoT by fail-operational mechanisms? How to monitor IoT components?
The IoT will most likely contain many heterogeneous components with different capabilities of resilience. We therefore consider fail-operational mechanisms that target the dependability of the information exchanged between IoT components. The mechanism shall be applicable within the fog and/or cloud running on an independent component or may be applied in an IoT device itself if the performance requirements for the mechanism are satisfied. We use given implicit redundancy of information provided by distinct IoT components to self-heal the IoT. To this end, the major effort lies in i) developing and extending our redundancy model, ii) implementing a self-adaptive fault detection, iii) applying fault diagnosis, and iv) recovery considering currently available information.
2) How to verify IoT components? How to detect anomalies in the IoT with minimum performance and energy overhead?
We propose a methodology which consists of the following phases: i) security vulnerability analysis, ii) lowpower and iii) ML-based anomaly detection. The first phase of the proposed methodology is to analyze the IoT for CPS for the security vulnerabilities. Unlike the traditional simulation and emulation techniques, we plan to leverage the formal verification for analyzing the security vulnerabilities. After identifying the security vulnerabilities and the corresponding parameters, i.e., communication and side-channel parameters, the next step is to use this information to develop online anomaly detection techniques. In this project, we plan to leverage two key characteristics, i.e., communication behavior and power (dynamic and leakage) to develop the low power and ML-based anomaly detection techniques.
3) What architectural requirements are necessary to ensure resilience with these mechanisms?
Finally we collect the architectural requirements of our developed mechanisms to be added to design guidelines for resilient IoT.
In the following, we present a case study to demonstrate how the above mechanisms can be employed in a real-world use case to detect, diagnose and mitigate faults.

Case Study: Resilient Smart Mobility
To illustrate the effectiveness of our approach, we perform a case study on mobile autonomous systems, i.e., vehicle-to-everything (V2X) communication in automated driving. The network connects sensors, controllers and actuators, buildings, infrastructure and roadside systems.
In particular, let's consider vehicles driving on a highway (Fig. 10). Radar sensors are mounted along the street and form a collaborative sensor field. In order to improve object detection and classification, a multiobject tracking scheme is employed, which uses subsequent sensor measurements in the form of prediction and update cycles to estimate vehicle locations. The tracking data can be used for, e.g., traffic congestion forecast or accident investigations. A set of radar sensors is connected to a fog node, that is a computing unit and IoT gateway in the near area of the sensors. The tracker -a software component running on a fog nodetracks the vehicles on the road segment covered by the associated radars. Some vehicles (e.g., autonomous cars) are equipped with distance sensors like radar, lidar or depth cameras. The fog node(s) of these cars can connect to near fog nodes of the street (directly over a vehicular network called VANET, or via the mobile network over the cloud). Additional MEMS sensors can support energy management, health and comfort in road transportation.
We assume the IoT infrastructure (things, fog, cloud, network) is given and propose methods to increase the resilience of the IoT. Failures of the radar sensors in our example will lead to inaccurate or even unusable tracking results. Failure scenarios like communication crashes and dead batteries (fail-silent, fail-stop) are relatively easy to handle (e.g., watchdog/timeout). However, the sensor measurements received by the tracker running in the fog node may be erroneous due to noise (e.g., communication line, aging), environmental influences (e.g., dirtying of the radar) or a security breach (e.g., hacked fog node that collects data of a group of sensors). To detect a failure of the sensor one has to create particular failure models for each possible hazard (c.f., aging, dirtying and a security breach). A simple method detecting a faulty sensor value in different failure scenarios is to check against other information sources, i.e., exploit redundancy. However, explicit redundancy that is replicating observation components is costly.
Self-healing can be applied to react also to failures not specifically considered during design-time. A very promising way of achieving self-healing is through structural adaptation (SHSA), by replacing a failed component with a substitute component by exploiting implicit redundancy (or functional and temporal redundancy) [200]. We use a knowledge base [171,172] modeling relationships among system variables given that certain implicit redundancy exists in the system and extract a substitute from that knowledge base using guided search (Sec. 7.2). The knowledge base can also be used to monitor the system by comparing the information of variables against each other, i.e., to detect failures (Sec. 7.1).
SHSA can be encapsulated in separate components listening and acting on the communication network of the IoT, e.g., as tasks monitor, diagnose and recover running on a fog node (Fig. 11).  SHSA monitors the information communicated between components (typically the sensor measurements or filtered/estimated observations), identifies the failed component and replaces messages of the failed component delivering an erroneous output by spawning a substitute software component. SHSA considers the currently available information in the network, i.e., can be applied in dynamic systems like the IoT (components may be added and removed during runtime). The knowledge base, in particular the relationships between the communicated information, can be defined by the application's domain expert or learned (approximated by, e.g., neural networks, SVMs or polynomial functions, see also [200]).
Alternatively, the monitor and diagnose task may be installed in the cloud analyzing the logged tracks to trigger maintenance of radar sensors. The requirements needed by SHSA regarding the architecture of the system (e.g., communication network) and a reference implementation of SHSA can be found in [200].

Detection and Diagnosis
In our future work, we want to use the SHSA knowledge base described below to perform plausibility checks upon related information.
As our focus is on adaptation of the software cyber-part in a CPS (cf. dynamic reconfiguration of an FPGA), we assume that each physical component comprises at least one software component (e.g., the driver of the radar in the vehicle) and henceforth, consider the software components only. The CPS implements certain functionality, e.g., a desired service (e.g., collision avoidance). The subset of components implementing the CPS' objectives are called controllers.

SHSA Knowledge Base
A system can be characterized by properties referred to as variables (e.g., the position and velocity of a tracked vehicle). The values of system variables are communicated between different components typically via message-based interfaces. Such transmitted data that is associated to a variable, we denote as information atom, short itom [201]. A variable can be provided by different components simultaneously (e.g., two radars with overlapping field of view). Each software component executes a program that uses input itoms and provides output itoms. An itom is needed, when it is input of a controller. A variable is provided when at least one corresponding itom can be received.
Variables are related to each other. A relation is a function or program (e.g., math, pseudo code or executable python code) to evaluate an output variable from a set of input variables.
The knowledge base is a bipartite directed graph (which may also contain cycles) with independent sets of variables and relations of a CPS. Variables and relations are the nodes of the graph. Edges specify the input/output interface of a relation. For instance, Fig. 12 models the relationships between the variables in the tracking use case (only relevant nodes, relationships and edge directions for the scenario in Fig. 13 are shown). The knowledge base can also be encoded by a set of rules, e.g., written in Prolog. It is then possible to further customize the model, e.g., to follow the requirements and constraints of a CPS application.
A proper data association identifies which itoms or measurements represent the same variable, e.g., links the different position itoms (x, y, v)| * to each other. For instance, the GPS position (x, y, v)| GP S of a vehicle (transmitted by the vehicle itself) has to be linked to the corresponding radar track (x, y, v)| radar (provided by the radar).
Subsequently, the redundant itoms can be used, e.g., to monitor a radar sensor, to substitute a failed radar or to increase the accuracy of a tracking application by sensor fusion. The interested reader is referred to [172] and [171] for more details on the SHSA knowledge base.

Fault Detection by Redundancy
An itom has failed, when it deviates from the specification. Our monitor uses the knowledge base to periodically perform a plausibility check to identify a failed itom. The automatic setup of a runtime monitor follows successive procedure: • Select the variable to be monitored (typically the corresponding variable to the itom under test), e.g., the position of a vehicle.
• Collect the provided itoms (e.g., subscribe to all available messages). Note, the availability of variables may change from time to time which should trigger a new setup of the monitor.
• Extract relations of the monitored variable and available variables from the knowledge base (similar to the search of valid substitutions in Sec. 7.2).
The instantiated monitor for the position of a vehicle is depicted in Fig. 14. At each time step the relations are executed to bring the available itoms (provided variables) into the common domain (variable to be monitored) where the values are compared against each other. The monitor returns the fault status or a confidence / health / trust value for each itom used in the plausibility check.  The confidence may be expressed by a distance metric or error between the itoms in the common domain. The trust or confidence of a radar may be accumulated from the individual confidence values of the tracked vehicles, i.e., the vehicles in the field of view of the radar. As soon as the confidence falls below a specific threshold for a specific amount of time the status of the respective itom is classified as failed.
The monitor can identify failed itoms in the common domain, however, when the output of a relation mismatches in the common domain, all inputs of the relation are marked faulty. To avoid additional monitors (a monitor for each input variable is necessary to identify the failed itom) a fault localization can be performed.

Anomaly Detection
In addition to SHSA, we are developing a low power runtime anomaly detection and ML-based runtime anomaly detection [52] to ensure: i) a secure and safe platform for automated driving, and ii) secure V2X communication.
Low Power To address the key challenge of power overhead in CPS, we propose a methodology that leverages the traditional low-power online anomaly detection techniques, in particular, assertions, sensor-based analysis and runtime monitoring. In this methodology, the first step is to identify an appropriate detection scheme based on the security threats, security metrics and design constraints. Second, based on the selected technique, the setup of corresponding assertions or sensor-based runtime monitoring is developed and implemented.
We propose to use communication behavior-based assertions to identify the online anomalies with low   Fig. 15 shows that in case of a denial-of-service attack, output packets of the communication channel are less than the input ones. However, in case of flooding, jamming, and information leakage attacks the traffic in the communication channel is more than the input data injection. Therefore, it can be concluded that the communication behavior can be used to identify the anomalous behavior. However, extracting the communication behavior without any golden circuit is not straight-forward which raises the following research challenges: 1. How to identify the reference communication behavior without any reference circuits/systems? 2. How to statistically model the communication behavior which can be used to identify the anomalous behavior?
3. How to measure and analyze the communication for low power runtime anomaly detection?
Machine Learning With the increasing trend of connected devices, the number of communication channels also increases exponentially. Thus, communication behavior-based assertions are not feasible to handle the large number of communication channels and corresponding communication data. Therefore, to increase the scope of the online anomaly detection for larger CPS with big data analysis, in this project, we propose to explore machine learning algorithms to extract the hidden features from the side-channel parametric and communication behaviors (i.e., power and communication behavior). The first step to develop a ML-based anomaly detection is to select an appropriate ML algorithm based on the design constraints, security threats and complexity of the measured data. Then, train and implement the ML algorithm based on measured data with minimum power and area overhead.
To illustrate the effect of intrusions on power profile, we analyzed the MC8051 with and without trust-Hub benchmarks, i.e., MC8051-T200 and T400, in Xilinx power analyzer. The experimental analysis in Fig. 16 shows that intrusions in MC8051 have a significant impact on the power distribution with respect to different pipeline stages (see labels 1 to 4). Therefore, it can be concluded that the power profiling of the processing elements/controllers in CPS can be used to identify the abnormalities.  3 4 How to model the power behavior in such a way that it can be used during runtime measurements? Though the power profiling of the microprocessor can be used to detect an anomalous behavior, powerbased ML training and runtime measurement is not easy. Therefore, the following research challenges must be considered while designing the ML-based online anomaly detection: 1. How to extract the power profiles of the processing elements (controllers in CPS) for efficient ML training? 2. How to reduce the area and energy overhead of power-ports for runtime measurement and modeling?

Fault Localization [135]
The fault detection mechanisms described in the last sections can identify failed data on the communication network. In order to recover the failed component responsible for the wrong information we have to apply fault localization.
The engineers often design CPS using the MathWorks TM Simulink toolset to model their functionalities. These models are generally complex hybrid systems that are often impossible to analyze only by using the reachability analysis techniques described before. A popular technique to find bugs in Simulink/Stateflow models is falsification-based testing [125,126,202]. This approach consists in monitoring an STL property over traces produced by systematically simulating the CPS design using different set of test cases. For each generated trace the monitor returns a real-value that provides an indication as how far the trace is from violation. This information can be used to guide the test case generation to find an input sequence that would falsify the specification. However, this approach does not provide any information concerning which is the failed component and the precise moment in time that is responsible for the observed violation. To overcome this shortcoming, in [135] Bartocci et al. have recently introduced a new procedure that aids designers in debugging Simulink/Stateflow hybrid system models, guided by STL specifications. This approach combines a trace diagnostics [142] technique that localizes time segments and interface variables contributing to the property violations, a slicing method [203] that maps these time segments to the internal states and transitions of the model and a spectrum-based fault-localization method [139] that produces a ranking of the internal states and/or transitions that are most likely to explain the fault.

Recovery or Mitigation
A failed itom can be replaced by a function of related itoms. To this end, the knowledge base is searched for relationships using provided variables and spawns a substitute.

Replacement
The substitute search algorithm traverses the knowledge base (Fig. 12) from the failed but needed information as root to find a valid substitution [172].
A substitution of a variable is a connected acyclic sub-graph of the knowledge base with following properties: i) The output variable is the only sink of the substitution. ii) Each variable has zero or one relationship as predecessor. iii) All input variables of a relation must be included (it follows that the sources of the substitution graph are variables only).
A substitution is valid if all sources are provided, otherwise the substitution is invalid (Fig. 17). Only a valid substitution can be instantiated (to a substitute) by concatenating the relationships which take the selected itoms as input (e.g., best itoms of the source variables).  Substitutions can be found by depth-first search of the knowledge base with the failed variable as root. The search may stop as soon as all unprovided variables are substituted [171]. In [172] we present a guided search approach using a performance measure for substitutions.
The result of the search -the substitution -is instantiated in a substitute [200]. In particular, the substitute subscribes to the input itoms and concatenates the functions or programs from the relationships. The substitute then periodically publishes the output. To avoid inconsistencies and fault propagation, the failed component (probably publishing erratic messages) should be shut down as soon as possible.

Conclusion
This paper summarizes the state-of-the-art of detection and recovery to react to failures in IoT for CPS. We further presented the main challenges and a roadmap towards a resilient IoT. The summary of the main challenges identified for existing and new resilience methods are: • Limited resources of computation and power (e.g., for runtime data acquisition).
• Limited knowledge of device and interface semantics (e.g., to retrieve a reference behavior for anomaly detection or model the redundancies in the system).
• Ensure and do not alter (real-time) behavior by adding or applying resilience techniques.
• Provide long-term dependability and security, that is, ensure resilience also after environmental, functional or technological changes of the system.
• Adaptation, verification, validation and robustness of the resilience techniques.
Moreover, we introduced some of our key solutions on an automotive example. The SHSA knowledge base presented in Section 7 describes implicit and explicit redundancy in a communication network. It can therefore be exploited to monitor, replace or fuse information. Because SHSA is based on redundancy it can handle various fault scenarios. Especially permanent faults in the IoT can be detected and recovered given some redundancy exists. As long as the failed components can be isolated and replaced by redundant information the methods can handle physical, development or interaction faults manifested as failures at the components' interfaces.
The monitors tackle the requirement on fault detection by voting over redundant information or comparing it to some reference behavior (R1). An additional fault localization identifies and triggers a disconnection of the failed component to avoid fault propagation. The substitution replaces failed information with redundant one (R2).
The presented techniques need a reference behavior, common understanding of the information or access to relevant redundancy (C2). Therefore, the IoT should provide proper interoperability (e.g., in form of standards). Under some constraints (bounded or static SHSA knowledge base, estimation of the worst-case execution time of relationships) SHSA is suitable for real-time applications [171]. However, solutions to increase scalability have to be investigated (C3). Moreover, the individual IoT devices might not have the resources to implement detection and recovery (C1). In future work we therefore want to focus on a distributed approach of the mechanism (e.g., by splitting the knowledge base for subsystems, or monitor in a distributed fashion like agreement protocols do).