Driven by Data or Derived Through Physics? A Review of Hybrid Physics Guided Machine Learning Techniques With Cyber-Physical System (CPS) Focus

A multitude of cyber-physical system (CPS) applications, including design, control, diagnosis, prognostics, and a host of other problems, are predicated on the assumption of model availability. There are mainly two approaches to modeling: Physics/Equation based modeling (Model-Based, MB) and Machine Learning (ML). Recently, there is a growing consensus that ML methodologies relying on data need to be coupled with prior scientific knowledge (or physics, MB) for modeling CPS. We refer to the paradigm that combines MB approaches with ML as hybrid learning methods. Hybrid modeling (HB) methods is a growing field within both the ML and scientific communities, and are recognized as an important emerging but nascent area of research. Recently, several works have attempted to merge MB and ML models for the complete exploitation of their combined potential. However, the research literature is scattered and unorganized. So, we make a meticulous and systematic attempt at organizing and standardizing the methods of combining ML and MB models. In addition to that, we outline five metrics for the comprehensive evaluation of hybrid models. Finally, we conclude by shedding some light on the challenges of hybrid models, which we, as a research community, should focus on for harnessing the full potential of hybrid models. An additional feature of this survey is that the hybrid modeling work has been discussed with a focus on modeling cyber-physical systems.


I. INTRODUCTION
Just as the microscope empowered our naked eyes to see cells, microbes, and viruses, thereby advancing the progress of biology and medicine; or as the telescope opened our minds to the immensity of the cosmos and has enabled humankind to explore and understand the space in spectacular detail, in the current century, ComputingScope (computational ''instruments'' for ''viewing'' and analyzing data) will help successfully decipher and navigate another infinite: the staggeringly complex multi-modal (text, image, video, and sensors) information in all facets of our lives.
The associate editor coordinating the review of this manuscript and approving it for publication was Fan-Hsun Tseng .
ComputingScope will give us a vision of the whole and help us synthesize actionable information. Recent success of Deep Learning (DL) [1]- [8] acts as a harbinger of the potential of ComputingScope. While the DL has made a significant impact in domains such as natural language processing (NLP), computer vision, and speech recognition, its impact on cyber-physical systems (CPS) is only recently being understood. In order to realize the potential of DL frameworks in CPS applications, there is a growing consensus that ML methodologies relying on data need to be coupled with physics-based modeling (prior knowledge) techniques.
The technology push driving the integration of a myriad variety of sensors, computational intelligence, communication, and control to bring ''smartness'' into a variety of CPSs is the immediate next frontier for ComputingScopes. CPSs are complex systems capable of Cognition, Communication, Computation and Control (4C) [9]- [13]. Consequently, CPS has four main components with unique functions. They are physical infrastructure consisting of sensors for sensing the stimuli to curate raw data, actuators for responding to the stimuli, a computational intelligence platform for computing the response contingent upon the stimuli, and a network that interconnects all of them. Figure 1 illustrates the architecture and functioning of a CPS. The development of intelligent computational techniques (computation component of the CPS) that are capable of harnessing CPS sensor curated raw data to generate actionable intelligence (typically hidden in data) is the focus of this review. A multitude of CPS applications, including design, control, diagnosis, prognostics, and a host of other problems, are predicated on the assumption of model availability. There are mainly two approaches to modeling: Physics/Equation based modeling (Model-Based, MB) and Machine Learning (ML). Model-based methods assume the availability of an accurate system model, while data-driven methods are based on machine learning. Purely data-driven ML methods ignore any knowledge about the physical/abstract systems. Additionally, ML approaches require a large amount of labeled training data that is typically unavailable. MB approaches require excellent physics models and good specification of parameter values. When building models of complex CPSs, we are often limited by the unavailability of the parameters of the system components due to incomplete technical specifications, hidden physical interactions, or interactions that are too complex to model from first principles. Hence, we often make simplifying assumptions (e.g., linear approximations) and construct coarse models that imperfectly describe the behavior of the real system. A prudent approach is to use hybrid methods that use the physics of the system and prior knowledge about the domain to guide the construction of machine learning techniques such as Deep Neural Networks (DNNs). The principal goal of this review is to discuss challenges related to the development of hybrid methods that combine multi-physics equation-based models with data-driven machine learning models (such as DNNs) to enable predictive modeling of CPSs in the presence of imperfect models and sparse and noisy data.
Our review is complementary to the existing review in the hybrid modeling domain [14]. The key differences include but are not limited to (1) our focus is different and is primarily on CPS applications, (2) we outline a completely different framework for categorizing hybrid modeling techniques, (3) we provide an update on recent hybrid modeling papers in the last five years with a primary focus on deep-learning techniques, and (4) we outline several new directions for research in CPS hybrid modeling domain.
The paper is organized into eight sections. Section II discusses the existing techniques of modeling CPS. The hurdles faced by the existing CPS modeling techniques are elaborated in section III. The necessity of combining ML and MB for eliminating those hurdles are outlined in section IV. The methods for combining ML and MB models are discussed in section V. The metrics for evaluating the performance of the hybrid models are deliberated in section VI. The new challenges which may emanate from hybrid modeling are explored in section VII. And finally, section VIII concludes the paper. Figure 2 summarizes the main components of our work.
Progress in CPS automation can be seen as belonging to three eras. Era one, where all the dirty dangerous [54] work involving onerous human labor, industrial equipment was automated. This was succeeded by automation of the dull [54] activities, which consisted of automated interfaces, clerical chores, service transactions, etc. And finally, now, in the third era, CPS is being empowered to make decisions [54]. In the third era, AI in general and machine learning, in particular, is going to assist CPS in making automated decisions to accomplish physical tasks. Previously, automation captured VOLUME 8, 2020 the segment of the work, which was more manual but less cognitively challenging. Automation powered by artificial intelligence has the potential to capture some part of the remaining segment of the work. It can help augment human capabilities by being capable of doing mental work along with the menial work. While doing so, the CPS has to meet a few additional goals that are still open for research. The CPS must bring the precision, accuracy, and latency from the computing world to the physical world. The asynchrony in time and space must be fixed. The CPS must cater to the privacy and security of itself and the associated users. The intrinsic and extrinsic heterogeneity, complexity must have to be addressed. The internal, external, and crossfunctional dynamics must be comprehended and modeled. It must abridge the gap between the discrete and logical functions from the virtual computing world and the continuous physical dynamics of the real world.
Several attempts have been made at modeling CPSs to harness the revolutionizing potential of the CPSs. Depending on the intended goal of the CPS and the characteristics of the CPS, the modeling techniques have broadly diversified into four categories: (1) Physics-based equations; (2) State machine based; (3) rule and agent based; and (4) datadriven models. All these modeling paradigms are pictorially represented in Figure 3. These modeling techniques encompass white-box models, grey-box models and blackbox models [55], [56]. The physics-based equations and state machines are used to model the continuous and discrete dynamics of a system respectively. They are purely based on the physical dynamics of the system. They are robust but lack intelligence. Rule and agent based models leverage the data generated by the system, along with the knowledge of CPS. Hence, rule and agent based models are more capable. The data-driven methods take advantage of data. The following subsections include a brief critical appraisal of these modeling paradigms.

A. PHYSICS EQUATION BASED CPS MODELING
The physical dynamics of all systems, including CPS, are continuous in nature [57]. They evolve with time. Such continuous dynamics are effectively captured by physics-based equations. These physics-based models usually take the form of equations. The equations can be simple dependence equations or ordinary differential equations or partial differential equations or a combination of them [58], [59] (Figure 3(a)). For a time(t) dependent system of n independent variables denoted by x i where i = 1, 2, . . . , n and m target variables denoted by y j where j = 1, 2, . . . , m, a simple dependence equation can be of the form If the dynamics takes the form of ordinary differential equations [60], then it will be g x, y j , d k y j dx k = 0 In case of complex systems, it is difficult to identify the individual impact of a parameter accurately. In such cases, the dynamics takes the form of partial differential equations [61]. Then the equations will be in the following form: where k, l = 1, 2, . . . p are the indices of the differential operators. The equation-based models are limited to modeling continuous dynamics. Causality, conservation laws, objectivity, and composability are the main attributes of physicsbased equations [62].

B. STATE MACHINES BASED CPS MODELING
State machines are used for modeling discrete dynamics [59], [63]- [65]. Discrete dynamics involve a set of discrete sequential steps, e.g., the number of times a ball bounces under gravity. The number is going to be a natural number, which is discrete in nature. A fundamental difference between the continuous and discrete dynamics is that unlike continuous dynamics which is inherently smooth, discrete dynamics is discontinuous. Discrete events are instantaneous in nature. So, the derivative is zero before and after the event but tends to hit infinity at that instant. Discrete dynamics can be modeled using finite state machines or extended state machines depending on the number of states. They can be event-triggered or time-triggered. Mathematically, a finite state machine is a tuple of five variables (States, Inputs, Outputs, Update, InitialState) [9], [65]. State machines can be broadly differentiated into Mealy machines and Moore machines. If the output is produced when a transition happens, it is a Mealy machine [66]. If the output is produced when the machine is in a particular state, it is a Moore machine [67]. They are convertible into each other. A simple hypothetical finite state machine model is depicted in Figure 3(b). State machines are an efficient way to model discrete dynamics. Causality, time invariance, linearity, stability, and memorylessness are some of the properties common to both physics-based equations and state machine based methods [9].

C. RULE AND AGENT BASED CPS MODELING
CPS can also be represented using a set of agents and semantically defined rules. The state machines are unlikely to capture the interdependencies among different states. Rule and agent based models promise to capture the intra-and inter-dependencies among the cyber and physical components along with the heterogeneity and complexity of the CPS. Such systems are also referred as multi-agent system (MAS) [68]- [70] and agent based models (ABM) [71], [72]. MAS or ABM describes the system by describing its constituents. As a result, they are quite effective at modeling a whole system when the whole is bigger than the sum of its constituents due to their interactions [71]. MAS is supported by a set of sensors that sense the environment and create data. The data is acted upon by smart agent(s) using decision making algorithms (Figure 3(c)). The agents make decisions based on a semantically defined set of rules. As they use data and semantic rules, MAS is capable of extracting insightful information from the data. Temporal continuity, social skills, and autonomy are the main attributes of an agent [73], [74]. These intelligent MAS are not completely black box in nature. They incorporate the recognized semantics of the system into the model. E.g., the semantics addresses the difference between the flow in the cyber (data) and physical domains (power in energy systems vs. vehicles crossing a toll booth). A simple example could be a parking lot where the CPS can guide the oncoming vehicles for an empty parking slot and stop accepting additional vehicles if the lot is full.
Unlike the physics-based or state machine based models, which are limited by the need for adaptability and interoperability [74] in CPS, rule and agent-based systems are not limited by them. The former methods are neither smart nor adaptable. But, rule and agent based systems are smart and adaptable. This is why the research work in rule and agentbased models of CPS has seen multifold growth in recent times. Even then, as the rule and agent based models exploit the semantics of the system, complex systems pose a serious challenge to MAS [75] like those in physics-based or state machine based models.

D. DATA-DRIVEN CPS MODELING
The data-driven models completely rely on data. They obviate the knowledge of system dynamics. The data-driven models must be able to model the inherent characteristics of the CPS, i.e., continuous (regression) and discrete (classification) dynamics of the CPS.
Attempts at data-driven modeling of CPS for capturing the hybrid dynamics tried to exploit concepts from bayesian learning [76], algebraic geometry [77], bounded error [78], compressive sensing [79], inference of state transition logics [80], regression approach underpinned by recursive clustering and linear multi-class classification [81], symbolic regression [82], hybrid stable spline algorithms [83] etc. An excellent review of earlier methods of identifying hybrid systems can be found in [84]. The earlier methods [76]- [78], [85] mostly focused on simple piecewise affine systems which use linear transition rules [80]. Recent methods try to capture the non-linear dynamics from relatively complex systems [79]- [83]. Recently, DNNs (Figure 3(d)) with non-linear activation functions received widespread recognition for their ability to learn complex nonlinear functions [86]- [89]. They have also started to find use in modeling CPS [90], [91]. Currently, there is a consensus that only models powered by artificial intelligence can achieve acceptable levels of autonomy by coping up with the uncertainties of the real world [92]. The DNNs, which come under the umbrella of artificial intelligence and have taken the canonical form of AI techniques, are best equipped to make the CPS autonomous. Nonetheless, the deep learning techniques are plagued by their own limitations, which are briefly discussed in section IV. The primary reasons for the poor performance of purely ML-based data-driven methods in CPS modeling for state assessment of the CPS are (1) physical systems work in dynamic environments where system performance is changing continuously, (2) the inability of ML methods to generalize beyond their initial set of training data, (3) sparse training data set, (4) ML methods are agnostic to underlying physics, resulting in predictions that are sometimes inconsistent with the laws of physics, and (5) in predictive CPS maintenance applications, the primary focus is on anomaly detection that can lie in the tail end of the data distribution. Hence, the current work focuses on combining conventional physics knowledge with neural networks (NNs) to overcome those limitations.

III. CHALLENGES IN MODELING CYBER-PHYSICAL SYSTEMS
The models of the CPS must model every component of the CPS. In other words, it should constitute the models of physical components and processes, computational algorithms, software, and networks. Interpretability and composability of the physics-based models are the main drivers behind their widespread adoption. Interpretability prevents failures. Composability facilitates the construction of complex CPS. Together they make the physics-based models of CPS ubiquitous and indispensable. But, the model-based systems are plagued by a diverse set of challenges such as solver dependency of the model, Zeno behavior, non-determinacy, heterogeneity of the system, maintaining the consistency among the components of the model, distributed behavior and misconnected model components [93].
Different numerical solvers used in a CPS model may use different numerical precision, leading to different behaviors of the CPS models [60], [93]. As simple components evolve into complex systems, understanding and modeling the interaction among the components becomes even more challenging. Ensuring consistent evolution among all the components is also difficult. In addition to this, the time lag created during computation and communication must also be modeled. The unevenness in measurements of time across the components, loss in communication, network delays, etc. must also be incorporated [93]. Our goal is not to discuss the methods to address all the above-mentioned challenges. Rather, we mainly focus on specific CPS modeling challenges that can be eradicated or minimized by combining model-based and data-driven methods.

A. DISCRETIZATION OF THE CONTINUUM
In cyber-physical systems, cyber components like microprocessors and digital communication networks portray discrete dynamics. Whereas, the physical elements of the CPS like electro-mechanical, humane, and chemical components portray continuous behaviors [9]. Hence, modeling a CPS can be split into two parts: continuous behaviors and discrete behaviors. The models must address the semantic dichotomy between discrete and continuous behaviors and their interactions. For modeling any CPS, we must integrate continuous behavior with discrete behavior. The computational models, which are inherently discrete, must model the continuous behavior using some means like numerical approximation or numerical integration [57], [60], [94]. The time precision of the computational models shall allow modeling the discrete dynamics. Hence, the time precision of the computational models should be finer than that of the discrete system dynamics.
However, the time precision of the computational model is fixed and cannot be refined beyond a certain limit. That poses severe challenges to the computational models to model the continuous dynamics portrayed by the physical elements when the continuous events become superfine. When the physical events become superfine, the time period of the events diminish, and uncountably many events happen in a small time interval. For example, consider a ball bouncing freely under gravity. It is subjected to wind drag. And as the collisions are not perfectly elastic in nature, some energy is lost during the collisions. Gradually, the amplitude (bouncing height) of the ball diminishes. After a certain point in time, the amplitudes become so small that it becomes difficult to differentiate two different free falls of the ball. The time period of the falls also reduces considerably. Consequently, the frequency of the events (collisions) shoots infinity. Such behavior is called as Zeno behavior, where infinite events occur in a finite time interval [95]. That leads to huge difficulties in modeling a system. The point after which the events can't be modeled any more by the computational model is called Zeno point. Although challenging [95], there is some prior work that attempted to determine the Zeno point of a cyber-physical system [96], but there is a dearth of work on modeling Zeno behavior.

B. ENVIRONMENTAL CHAOS AND LIMITS OF DETERMINISM OF MODELS
In addition to the challenge of discretization of the continuum, two more challenges prevent the models of CPS from becoming robust. They are chaotic behavior of the environment and the limits of determinism of the models [95]. Deterministic systems are predictable. The output can be predicted accurately if the inputs, system parameters, and the initial conditions are known precisely. However, most nonlinear feedback systems encounter uncertainties. Even real-time systems that seem robust, non-chaotic, like the standard scheduling algorithms, can exhibit chaotic behavior in the presence of uncertainty [97]. Chaotic behavior is so fundamental that it weakens determinism. Determinism is an invaluable property as it facilitates definitive analysis of the CPS and enables complex and reliable designs [95]. However, deterministic models are never complete. E.g., a singlethreaded imperative program is deterministic, if only the output is taken into account. The same program becomes non-deterministic if the time of computation is also taken into account. So, indeterminacy is inherent. Limiting cases will always arise at which the deterministic models are going to fail, and the model will no longer be valid. When the limiting cases are unidentifiable, non-deterministic models are the only way out. In such cases, Heisenberg's uncertainty principle seems more relevant [95]. However, the scope of Heisenberg's uncertainty principle is limited to quantum mechanics. It does not extend up to Newtonian mechanics. However, for practical cases, even Newtonian mechanics is inadequate. It has to be supplemented by Mertonian laws to make it more practicable.

C. EVOLUTION FROM NEWTONIAN SYSTEMS TO MERTONIAN SYSTEMS
With the advent of data-driven modeling, the proliferation of Merton's systems has drastically increased than those of Newton's systems. It is only because of the close interaction between actual and artificial life. The CPS has become cyberphysical social systems (CPSS) [98]- [100]. Even though some preliminary low fidelity MB exists for the CPS, which can be adapted for modifications, adapting them for processes that lack well-understood first principles (E.g., mobility behavior, advertisement click rates, energy consumption in apartments and systems with human intervention) is infeasible [58]. Modeling such systems has been possible with the support of big data, cloud computing, internet of things, NNs, reinforcement learning, etc. [101]. This evolution can be broadly restated as traditional systems with small data, and big Newtonian laws have taken over by the modern systems with big data but smaller Mertonian laws. The deterministic systems have taken the background. The foreground has dominated by the statistical systems governed by probabilistic laws. Newton's laws exclusively will be inadequate for modeling CPSs. We have to introduce Merton's laws for modeling such things. Merton's systems are governed by Merton's selffulfilling prophecy. It says a prediction directly causes itself to become true, because of the feedback between belief and action [102], [103]. Such systems are governed by free will. Hence, they cannot be controlled but can be influenced to achieve the desired objective in a probabilistic environment. The uncertainty in the complex spatio-temporal dynamics of the environment creates a chasm between the deterministic model and the actual physical systems. Ordinarily, causality prevails in Newtonian systems. In the case of Merton's systems, causality is a luxury that is almost unachievable with the available resources for complexity, diversity, and uncertainty [101]. The gap can be mitigated by integrating the model-based approaches based on Newtonian laws and the data-driven approaches based on Merton's philosophy.

D. ADVENT OF BIG DATA
It is important to note that physics-based models for modeling a system do not actually need data to model the system. Rather, they need a deeper understanding of the causal relationship of the system's parameters and variables [104]. So, they are effective at modeling any system of simple to moderate complexity. More importantly, they are quite effective at modeling systems with small to medium scale of data. However, recent years have seen a huge surge in the amount of collected data. More data directly translates into VOLUME 8, 2020 better information about the system. For complex systems, where inferring the causal relationship between the input and output variables could be a daunting task, a large size of data helps. Data harnessing has seen a surge because of pervasive sensing, rapid computational capabilities, the proliferation of the internet and its users. It led to the rise of big data. Big data needs bigger and effective processing technologies. The processing technologies started with model-based algorithms. But, are shifting into data-driven techniques. For a modelbased approach, limiting cases will keep on accumulating as more and more data gets collected. More data will cover the state space better. Hence, it will produce accurate and robust models depending upon the effectiveness of the models for processing the data.
CPS is going to be ubiquitous. It is going to touch and alter every part of life and the ecosystem we are living in. They will be pervasive. Data will be created, processed, and acted upon continuously in real-time. Continuous creation of data will shove the CPS into the scope of big data. Big data is characterized by its characteristic V's [105]- [110]. Although significant variation can be observed in the number of V's available in the literature, all of them have the first three V's: volume, variety and velocity. The volume refers to the huge amount of data that needs to be collected, processed, and stored. Variety is the complexity of the data. It can be structured, unstructured, or a mix of both. Typically, structured data can be organized into a tabular form, whereas unstructured data cannot. Velocity refers to the speed at which data is being generated, which in turn needs adequate processing capabilities. Although MB can be effective at handling the velocity, it will be inefficient at handling the volume and variety of big data. The variety increases the dimension of data. Every additional dimension can be ascribed to a new system parameter. That parameter may have an unanticipated influence on the model. That will contribute to increasing the complexity of the MB. Additionally, with a humongous volume of data, the state space will be covered almost completely. Hence, almost all the exceptional cases will be captured by the data. Every exceptional case is potent to uncover a new phenomenon. And, every phenomenon may require a new model altogether. So, a model encompassing all the exceptional cases will become intractable. The solution is data-driven methods. Although they are not likely to be explainable, they will definitely model the state space better.

IV. SYNERGY BETWEEN MACHINE LEARNING AND PHYSICS-BASED MODELS
Both the ML models and physics-based models can be used for system modeling purposes. As stated previously, the MBs do not need data to model the system. They need the understanding of the system dynamics [57]. MBs need data for calibrating their effectiveness. On the other hand, ML models need data for modeling the system. Data without any consideration of the dynamics of the system [58] empowers the ML methods to be applicable to different types of systems. However, data can never be complete because of the discrete nature of the input parameters and the output variables of the system [111]. Irrespective of whether data acquisition is expensive or not, data can never fill the state space. So, the chasm between the system and the data generated by it is ingrained in ML models. They tend to fail beyond the regime of the training data [58]. Thus lack extrapolability. This is evident from the success of convolutional neural networks (CNNs) with the development of ImageNet database [112], [113], which were prone to failure with older databases [114], [115]. So, it is wise to conclude that ML models, although accurate, can never be complete. The DNNs operate in a monolithic manner, hence inhibiting modularity, hierarchy, and composability. The whole network learns simultaneously. They only care for learning a mapping from the input to the output variables without any care for the intermediate variables. Additionally, as ML models do not pay any heed to the causal relationship of the system variables, they lack interpretability. As the ML models learn, they are susceptible to learn spurious but accurate mappings from the input data to the output data. That makes ML a double-edged sword.
Contrastingly, MBs have the potential to be complete if all the system dynamics from intrinsic to extrinsic and micro to mega-scale are taken into account. However, such dynamics are insurmountably difficult to capture even with delicate simulations and meticulous observations. They can be expensive in terms of cost, computation, and time. So, some dynamics are often ignored [57]. A few assumptions are made. That simplifies the model of the system by disregarding the impact of those dynamics. That makes the MB incomplete and inaccurate [111]. For example, the impact of wind drag, energy lost due to the non-elastic nature of collisions, the elasticity of the material of the ball, the buoyancy of the medium is often ignored in case of the bouncing ball. Although inaccurate, the MB tend to extrapolate the system behavior [58]. If simulations are carried out delicately and observed meticulously for capturing the dynamics accurately, then the MB can become complex [57]. Complexity compromises tractability. Tractability is of paramount importance in control applications. Even sometimes, the impact of some dynamics is consciously ignored to make the model amenable. Even then, the MB extrapolate far beyond the operating conditions utilized for model validation with reduced accuracy [55].
Often, it is easier to acquire data from a system using simulations than understanding the system dynamics. It is quite easy to fit an ML model to the data originating from the system. Data is sparse. But, ML models discover a continuous function to fit the same sparse data. Hence, they fill the state space. So, although data can never fill the state space, the model can. With the same data, it is easier to validate an MB, provided it is available. But if the data does not include any anomaly, neither the ML can model it, nor the MB can be validated against it. E.g., consider the magnitude of acceleration of the ball bouncing under gravity. The magnitude of the acceleration is equal to the acceleration due to gravity when the ball is in air. The ball is in the air during most of its flight. But, the magnitude is different from the acceleration due to gravity at the moment when the ball hits the floor. It appears as an anomaly in the plot. If the data does not capture the moment, neither the ML nor the MB can guarantee consistent behavior. So, it is essential for the data to capture all the events. The same goes for the bouncing ball when it exhibits Zeno behavior. As the data can't be recorded after the Zeno point due to the limitation of the simulator, ML models cannot model Zeno behavior.
However, the causative agents behind the Zeno behavior can be identified, and MB can be used to model them. The causative agents for the Zeno behavior of the bouncing ball are the inhibiting forces due to the environment. Air drag, the viscosity of the medium, non-elastic collisions, deformable nature of the material of the ball contributes to the inhibiting forces. The initial height of the ball and buoyancy of the medium are the driving forces. These aspects of the system can only be brought into the model using MB. Hence, it is sometimes prudent to complement the ML with MB [58], [111], [116]. By combining them, one can actually recover a complete model of the CPS. We feed the knowledge we have into the MB. The ML model discovers the knowledge which we do not know [117]. Nonetheless, all models are wrong, but some are useful [118], [119]. Hybrid models are comparatively more useful and more efficient as they harness the best of both worlds. The efficacy of hybrid models over pure MB and ML models are pictorially shown in Figure 4 for predicting the path of an oncoming quadrotor from a hexarotor in an autonomously operating swarm of unmanned aerial vehicles (UAVs) used in CPS applications. Additionally, hybrid models can be re-calibrated conveniently. As new data becomes available, the re-calibration of MB can face the difficulties associated with the solving of underlying highly complex optimization problems [120]. Whereas, the ML gets calibrated in an incremental manner. That eases the re-calibration of a hybrid model.

V. HYBRID MODELING: INFUSING PHYSICS INTO DATA-DRIVEN MODELS
Consider a CPS predictive modeling task such as the path prediction of an oncoming quadrotor from a hexarotor in case of an autonomously navigating swarm of UAVs ( Figure 4) for CPS applications. Given input state variables X (Figure 4(d)), model parameters P MB , and target variable Y (state variables of the quadrotor at a future instant of time), the physics-based models, f MB : [X, P MB ] → Y maps input state variables and parameter values to the target variable Y . Predictive modeling in a purely model-based approach is primarily focused on ''calibrating'' model parameters P MB using observational data. Typically, in model-based approaches, we make simplifying assumptions and build simple physics-based models of key system components. For example, we model the response of a mechanical CPS system as a response of a linear massspring-damper component, although the behavior is typically nonlinear. Therefore, Y Phy (output of the calibrated model) typically is an incomplete representation of the target variable due to abstraction, simplification, and missing physics in f MB , and this causes the model output Y Phy to be different than observations Y as shown in Figure 4(a). The data-driven machine learning approach, however, is focused on training a model f ML : X → Y over a set of training data to produce estimates of outputsŶ given inputs X. The ML model also cannot predict the path exactly (Figure 4(c)) as they have limited capability to capture the system dynamics. Both these approaches have their limitations.
In a hybrid modeling approach, one creates a hybrid combination of MB and ML approaches that combine the power of both (Figure 4(b) and 4(f)). One way to combine them is to use the output of f MB as an input to f ML or f Hybrid : The machine learning component will thus encapsulate the remaining unmodeled complexity of the system in a lumped form. In this approach, the ML model acts in a complementary fashion to enable accurate model prediction. If the physics-based models are accurate, then the hybrid model will ensure Y Phy =Ŷ . However, even if the ML model leads to more accurate modeling capability but lacks the mechanistic coherency of the underlying system (for example, violates the physics of the overall system), it cannot be used in the CPS modeling domain. Therefore, we constrain the ML model to learn in a fashion that respects the physics of the underlying process. Additionally, one can use synthetic labeled data from physicsbased models to complement the experimental training data and train the ML models to address the data sparsity issue. Furthermore, a tuned f Hybrid outputŶ can be used to generate additional data to fine-tune the parameters P MB (coarsely calibrated initially using sparse observations) of the MB physics equation in an intelligent fashion.
There are two main novel aspects of hybrid modeling approach: (i) the hybrid combination of MB and ML complements each other to create more precise predictive models that respect the physics of the underlying system and (ii) the use of both the physics-based models and sensor acquired data to train the ML model and estimate the parameters of MB model in a symbiotic fashion. There has been several works across a wide variety of applications which attempted at combining the ML and MB models. The terminology for referring such architectures is not consistent. The keywords which were used to refer such architectures include 'using prior information' [121], [122], 'physics guided' [123]- [127], 'residual physics' [128]- [131], 'physics based' [132], 'physics fusion' [128], 'physics infused' [133], 'theory guided' [14], [134], 'hybrid' [135]- [137], 'physics informed' [138]- [143], [143]- [146], 'physics constrained' [123], [147], [148] etc. However irrespective of the name of the hybrid model, they share a lot of commonalities. Depending on the nature of the interaction between the P MB and P ML in a hybrid model, the existing work in hybrid modeling domain can be broadly categorized into three categories: (1) approaches that use P MB as preprocessing or inputs to machine learning model (such as deep learning) ( Figure 5 1 ), (2) innovative network architectures that imbibe physics of the problem at hand in P ML (Figure 5 2 ), and (3) loss functions that enforce physics infusion in the learned DL model (Figure 5 3 ). The three major categories are motivated by the control levers of a DNN for learning better functions to obtain accurate mappings. The three levers are better data which can be acquired by better preprocessing techniques, infusing good features into the network, and better loss functions. A single DNN can incorporate all the control levers in an integrated fashion, as shown in Figure 5. We lump other existing work in the hybrid modeling domain that does not fall exactly into one of three categories defined above into a fourth category termed ''Miscellaneous''. Additionally, we primarily focus on the work done in hybrid modeling, where the used ML approach is Deep Learning (NNs). However, similar ideas could also be used in conjunction with other ML approaches such as Support Vector Machines (SVM) [149]- [151] and Random Forest (RF) [152], [153] etc. Additionally, as the prior research work on hybrid modeling in the cyber-physical space is quite limited, we open up our scope of the hybrid models to all existing applications to make it comprehensive and complete.

A. PHYSICS BASED PREPROCESSING (PBP)
Almost all the methods for processing data follow three steps: data preprocessing, feature extraction, and feature processing. The sole purpose of preprocessing is to ease the use of data in the succeeding steps. Noise removal, data transformation, data reduction, instance selection, dimension reduction, etc. constitute the preprocessing techniques [154]. Unlike the feature extraction techniques, which could be very specific, the preprocessing techniques can be used in a wide variety of applications [155]. Preprocessing is essential for ML models. Normalization is the simplest form of preprocessing. Bandpass filtering, downsampling [156], channel selection, time window setting [157], data cleaning, encoding [158] are a few more examples of simple preprocessing techniques. In the first class of hybrid architectures, the MB is used for preprocessing the data before feeding them into the ML model. The MB preprocesses the input data x and converts it intox. The ML learns a mapping between the preprocessed datax and the actual outputs y and predicts the actual output asŷ with an error i given by i = y −ŷ. Such physics-based preprocessing hybrid models are illustrated in Figure 6(a).
As the DNNs are capable of learning from pristine data, they have almost eliminated the need for data preprocessing. However, the elimination of data-preprocessing comes at the cost of deeper and complex networks. Again, insufficient and noisy data adds to the pain of the complexity of the ML models. Hence, preprocessing can help to alleviate the pain. Raw data is usually a time-series signal. Techniques like fast Fourier transform (FFT) can be applied to convert it into the frequency domain. Consequently, the features also get transformed. Ordinarily, frequency domain representations are used for stationary signals. Due to their high sensitivity, they are widely used for anomaly detection tasks [159]. For nonstationary signals, time-frequency representations are used.
The time domain and frequency domain representations are one-dimensional representations. Wavelet transform, empirical mode decomposition, short-time Fourier transform can be used to convert the 1D representations into 2D timefrequency domain representations [160].
The preprocessing techniques for data analysis can be broadly split into time-domain analysis, frequency domain analysis, and time-frequency analysis. Mean-variance, empirical mode decomposition [161], and kurtosis estimation fall under time-domain analysis. Fast Fourier transform, bispectrum analysis are examples of frequency domain analysis. Short-time Fourier transform (STFT), wavelet transform [162], [163], Hibert-Huang transform [163], sparse decomposition [164], wavelet packet transform comes under the umbrella of time-frequency based preprocessing techniques [165].
Hybrid architectures which use physics-based preprocessing have found applications in manufacturing industry [166]- [169], healthcare industry [170]- [172], energy industry [173], and for processing hydrological data [174]- [176], meteorological data [177] among several others. Compressed sensing was used for extracting low dimensional features from raw temporal data before passing it into a stacked denoising autoencoder DNN [178]. Other preprocessing techniques like spectral kurtosis [179] and envelope analysis [180] were used for extracting the frequency band with the most useful information. The output was further processed by using fast Fourier transform before feeding it to a CNN [166], [167]. Ensemble empirical mode decomposition (EEMD) [181] was used for decomposing the input data into a set of intrinsic mode functions (IMF) and a residual series [174]. Each IMF was then supplied to an ANN for further processing. Data preprocessed with empirical mode decomposition (EMD) were supplied to a radial basis function network [175] and Elman NN [176], to an autoencoder [170], into a long short-term memory (LSTM) network [173] and a deep belief network (DBN) [182]. EMD was also used for extracting low-level features before proceeding to a DNN for extracting high-level features [183]. The frequency spectrum of raw data was fed to a DNN based stacked auto-encoder [184]. Time-frequency domain data and energy spectrum features of the data were used for preprocessing before feeding the data into deep Boltzmann machines, stacked auto-encoders, and deep belief networks [185]. Nonlinear soft threshold and digital wavelet frames were used to denoise the data before feeding it into a stacked autoencoder [186]. Wavelet analysis was used for converting raw data into 2D time-frequency images, which were subsequently fed into a deep CNN [187]. Data transformed as wavelet packet energy image was supplied to a CNN [188]. The signal was decomposed using Morlet wavelet to get a multiscale spectrogram image [189]. The spectrogram image was passed into a CNN for further processing. Deep Boltzmann machines usually accept data preprocessed by wavelet transform or Teager-kaiser energy operator [190], [191]. A computationally inexpensive delayed error normalized least mean square (DENLMS) [171] filter was used for preprocessing the data [171]. Additionally, discrete wavelet transform was used for extracting higher-level features like heart rate variability (HRT) before passing the data to the ML model [171]. The short-time Fourier transform is computationally less expensive than wavelet transform and Hilbert-Huang transform. Discrete wavelet transform was used for removing noise before feeding the data to a recurrent neural network (RNN) [169]. Figure 7 summarizes this class of hybrid models. The impact of a few preprocessing techniques like PCA, scaling, and normalization on the performance of autoencoders were studied in [192]. Several preprocessing techniques like envelope metrics, statistical metrics, frequency, and discrete domain metrics were reviewed in detail by [193] for feature extraction. For a comprehensive review of preprocessing techniques [194] can be referred.
Although the preprocessing techniques encapsulate a wide variety of goals like noise removal, data transformation, feature extraction, most of them are shallow features. They do not add any system-specific information to the ML model.

B. PHYSICS BASED NETWORK ARCHITECTURES (PBNA)
The PBNA tries to explore the best combination/fusion of NN architecture and data emanating from the MB component. A plethora of such architectures has surfaced recently. The non-existence of the answers to the questions 'how do the NN actually learn the features of a system?' and 'what is the best architecture for introducing MB into an ML model?' are providing the thrust for such architectures. The quest has germinated into a new research domain called neural architecture search [195]- [198]. Broadly, the architectures can be one or a combination of DNNs, CNNs, RNNs, and other NNs with physics. The innovative architectures can ingest the output of the MB into a particular node of a specific layer of the ML (Figure 6(b)) or the output of the MB can be embedded into the NN (Figure 6(c)). In addition to the output of the MB, they can also take raw input data as another input (Figure 6(b)).
Such physics-based architectures have found use in environmental data analysis [123], dynamical systems [133], biological systems [133] and image processing [132]. The output of the physics model was used as an additional input feature for a multilayered perceptron for modeling the temperature of a lake [123]. The output of the MB was fed into the last layer of an LSTM network along with the unprocessed inputs for modeling an inverted pendulum and growth of cancer cells [133]. The physics-based motion model of aerial vehicles was sandwiched between two RNN modules for multistep prediction of the non-linear dynamics of the aerial vehicles [137]. It enhanced the training and convergence speeds in addition to the accuracy of predictions. A data augmented residual model, and a recurrent data augmented residual model [131] were formulated for creating a physical simulator offering universal uncertainty estimates. In Toss-ingBot [129], the output of the physics-based controller was fed to FCN ResNet-7 [199] for robotic applications. A deep lagrangian network (DeLAN) [200] was developed for learning the motion of mechanical systems like robot tracking control. It incorporated Lagrangian mechanics into the hybrid architecture. The fusion methods were categorized into series and parallel models depending on the way the MB and ML are connected [120], [201], [202]. A combination of series and parallel connections was used for predicting the failure of a turbine blade [201]. The MB was fused with an ANN and SVR [203] for modeling rainfall-runoff. The work in [132] attempted to recover the shape of an object by letting a CNN based autoencoder to figure out a way for combining the physical solution and polarized images of the object. A tensor basis neural network [204] was proposed where rotational invariance was fed to the last hidden layer of an MLP in the form of a tensor for modeling Reynolds stress anisotropy. An adaptive neuro-fuzzy inference system [205] was infused with the predicted tool wear for prediction of remaining useful life [136].
The architectures discussed so far directly fuses the output of the MB into the ML model. But, such direct fusion can't be done in case of spatio-temporal data and accompanying spatio-temporal physics models. Such data are encountered in videos, fluid, and thermal applications. In such cases, the output of the MB is deeply integrated into the ML model by embedding the output of the MB model at all steps of time and space (Figure 6(c)). Such deep embedded hybrid models have been used in fluid flows [139], [140]. The outputs of MB was fed into a NN as a set of collocation points. The number of collocation points can be increased to increase the influence of physics in the hybrid architecture. They developed the architecture for both continuous and discrete-time models. This architecture was used for both solving [139] and discovering [140] nonlinear PDEs. Invariance or symmetry attributes were embedded into a NN [206] by determining the invariant basis and training the NN on them. It was used for turbulence modeling and solid mechanics.
In addition to these, various works tried to encrypt the MB as a NN and fuse it with an ML model. An MB transformed as a NN coalesces better with a NN. A cellular NN [207] was used to solve the PDE in Hybrid-Net [135] before attaching it to a ConvLSTM network for fluid and thermal applications. A symbolic multilayered NN inspired from equation learner [208], [209] was used for blending prior knowledge in [210] for learning accurate PDEs from data. In [211], the LEarned Alternating Direction Method of Multipliers (Le-ADMM) [212] method is unrolled into a NN for solving inverse problems. It was appended to another pure NN or U-Net [213] for reconstructing mask-based lenseless images into traditional images from lensed cameras. Such a physics-based unrolled network has also been used for image reconstruction [214] and optimizing codedillumination [215] in case of a LED array microscope.
As figuring out the best architecture for a particular problem is still an unsolved problem, a physics-based neural architecture search [128] model has been developed for predicting the trajectory of a projectile and for estimating the velocities of rigid objects after a collision.

C. PHYSICS BASED REGULARIZATION
In this class of hybrid models, the physics-based equations are used as an additional regularization term in the loss function of the NN. The NN is constrained to acquiesce to the laws of physics. The principle behind this class of hybrid architectures is penalizing/minimizing the violation of physics-based constraints or governing equations. The physics equations y = f (x) are restated as y − f (x) = 0 for using them as a constrained loss. Minimizing the loss y − f (x) becomes an additional goal in the loss function ( Figure 6(d)). The physics-based pre-processing and fusion architectures need the MB to be evaluated or require the PDEs to be solved. But, regularization based approaches directly engulf the equation as a constraint, thereby altering the computational cost.
Such hybrid models have been used for analysis of environmental data [123], machine vision applications [148] and fluid dynamics [216]. Temperature-density and density-depth relations were used as constraints for modeling the temperature of a lake at different depths [123]. A weighted constraint function was used to penalize data that are inconsistent with prior knowledge in case of the video of a projectile [148]. The loss function of the pix2pix architecture in a conditional GAN [217], [218] was penalized by using a constraint enforcing module for spatial sensitivity analysis for predicting urban land use [219]. The loss of a fully connected NN was made to include boundary conditions and residuals of the governing equations for surrogate modeling in hemodynamics [220]. Reinforcement learning was also used for learning non-parametric models under constrained state spaces in continuous environments [221]. It was used for dynamic applications like quadcopter navigation and cart-pole problem. Physics has also been used as a prior in reinforcement learning [222]. As low fidelity MB's capture the general trend and the high fidelity MBs additionally capture intricate local details, a multi-fidelity physics constrained NN was proposed for incorporating linear and nonlinear PDEs [223].
Approximation and monotonicity constraints were also incorporated into the loss function of a NN [224] for predicting oxygen solubility in water.
Various regularization based hybrid approaches were used in case of availability of no or less labeled data. A constrained convolutional autoencoder was used for surrogate modeling and uncertainty quantification in steady-state flow applications [147]. An architecture with both fusion and constrained regularization was used in [225] for multiphase flows. It also included an additional loss term for the difference between the output of MB and the actual output. It performed well under data scarcity. A NN was trained with a video dataset without any labels in a supervised manner purely by using a physics-based constraint [148].

D. MISCELLANEOUS ARCHITECTURES
Accurate inference of the physical dynamics of a system by using conventional principles may need mountainous efforts even with plenty of data. The ML models only learn a mapping between the values of the input and output variables. So, neither the NN based ML models nor the hybrid architectures discussed so far can learn new interpretable physics. It is because of two factors. First, the data by itself is meaningless if not underpinned by physical dynamics. E.g. compare the ball bouncing on earth (g earth = 9.8 ms −2 ) with the same ball bouncing in a viscous medium on moon (g moon = 1.625 ms −2 ). The energy lost while bouncing on earth is purely due to the gravity of earth. On the moon, it will split into the energy lost due to the moon's gravity and the viscosity of the medium. Consider that the viscosity of the medium is controlled on the moon for imitating the ball's bouncing behavior on earth. Now, if the data is fed to ML models to model the dynamics, the ML models can learn the same mapping in both cases. It is because the ML models disregard the underlying physics. In the moon's case, the ML model can ignore the effect of viscosity, which accounted for the difference between the accelerations due to the gravity of the moon and the earth. So, the data doesn't know what it represents. Secondly, the activation functions used by NNs have abysmal physical significance. Hence, the unknown physics which they are capable of discovering is perceived with huge skepticism. So, several data-driven physically meaningful ML methods were developed for discovering physics from data. They use physically significant mathematical operations or interpretable activation functions or use classical NNs to unravel the physics-based equations underlying the system dynamics. In such cases, data can be used to extract physics, and the same physics can be used for complementing the ML models.
Sparse Identification of Nonlinear Dynamics (SINDy) [235] used a thresholded least square-based approach to establish a relationship between the target function and a library of candidate functions consisting of algebraic and transcendental functions. Dynamic Mode Decomposition (DMD) [236], [237] built on proper orthogonal decomposition (POD) can be used to extract spatio-temporal correlations. The DMD tries to infer a linear operator at least for small time periods that approximates the spatial data, which can have millions of degrees of freedom. Koopman operator [238], an infinite-dimensional linear operator, was used for extending DMD for nonlinear dynamics [237], [239]. The inefficiency of linear operators for modeling the non-linear dynamics led to extended dynamic mode decomposition (eDMD) [240], [241]. SINDy, DMD, and Koopman theory has also been extended for control applications in [242]- [244]. The conservation laws of a dynamic system were discovered by symbolic regression [245].
In addition to the above, the equation learner [208], [209] used interpretable activation functions (sine, cosine, identity) in a DNN for learning the governing equations of the system dynamics. PDE-Nets [210], [246] exploited the correlation between convolutional filters and differential operators for learning the PDEs governing thermal dynamics. An n-gram model and a recursive neural network were used in an attribute grammar framework for discovering efficient mathematical identities [247]. Two DNNs were used for discovering non-linear dynamics in terms of non-linear PDEs for spatio-temporal data [248]. These techniques have been used to model dynamic and fluid systems. Variational Autoencoders are known to encode features like physical properties. The discriminator of a GAN can learn a feature level similarity metric. So, in 3D-PhysNet [249], a deep variational encoder with a discriminator was trained using the data from a finite element simulator for discovering the behavior of deformable objects under external forces.
In other systems where the dynamics of some components are unknown, a NN can be used to learn the dynamics of those components. The NN can be used as a module along with the modules corresponding to other components. In Neural Lander [250], the dynamics of a UAV was learned using NN, and the MB was used for its control. A physics infused DNN was used for learning each feature and then concatenated them for predicting the properties of organic compounds [251].

VI. METRICS FOR HYBRID MODELS
Error metrics are essential components for evaluating the efficacy of any numerical modeling scheme. The intention of this section is to provide a summary of a variety of performance metrics that are suitable for the CPS hybrid modeling domain. The outlined summary will help improve general understanding of metrics for hybrid modeling and facilitate their selection in different applications, including the CPS domain.

A. UNIVERSAL ERROR MEASURES
The current section discusses the errors that are fundamental to the holistic evaluation of a model. Consider a collection of data S = (X , Y ) where each sample is represented by (x i , y i ) where x i represents the input variables and y i represents the output variables. X and Y represent the set of input and output variables respectively. Both x i and y i can be multivariate. The model learns a mapping f Hybrid : x i → y i defined as f Hybrid (x i ) =ŷ i such thatŶ ≈ Y with minimum total error E. The total error E (= i=n i=1 i ) is the sum of the individual errors i , given by i = y i −f Hybrid (x i ). There are several ways to define the total error. A detailed list summarizing several measures of error are outlined in Table 1. The mean square error (MSE) and the mean absolute error (MAE) are the most prominent and commonly used measures of error among all of them. The MSE and MAE are defined below:

1) MEAN SQUARE ERROR
The mean square error has taken the canonical form among all the loss functions. Mean square error is the mean of the sum of squares of the difference between the actual and the predicted values. It is defined as It is closely related to the quadratic loss function. It is an even, continuous, and differentiable function. The MSE differentiates the error due to bias and variance in a smooth manner. It is the sum of the variance and the square of the error due to bias. Addressing the error due to bias and variance has been a prime cause in statistical data analysis.
The cross-validation technique widely used in machine learning is based on minimizing the overall error by reducing individual errors due to bias and variance. This is why the MSE has taken the canonical form.
It heavily penalizes the outliers. That is good in the case where failure can be fatal or expensive, e.g., manufacturing, autonomous driving, security, etc. However, if only MSE is used as a metric to accept or reject a dataset, a single distant outlier can outright reject an almost perfect dataset.

2) MEAN ABSOLUTE ERROR
The mean absolute error is the main contender against the mean square error for taking the canonical form. It is the mean of the sum of the absolute values of the difference between the actual and predicted values. It is defined as It penalizes all the errors evenly. The MAE doesn't split into errors due to bias and variance directly. It is not differentiable everywhere in its domain. That limits its applicability.
Due to the evident advantages of MSE over MAE and other error measures, MSE has been used to illustrate the ideas in the subsequent sections. VOLUME 8, 2020

B. UNIVERSAL METRICS OF A MODEL
Is training a hybrid model on the complete dataset really necessary when the speed of generating data has already surpassed the capacity of storing data [252]? If not, how much data the model actually needs for optimal performance? As noise is inseparable from the data, how to evaluate the robustness of the model to noise? Accuracy is an indispensable metric for every model. CPS is no exception. But for a CPS, the latency between the stimuli and response is also of significant importance. So, how to gauge the latency? The answers to all these questions lie in a systematic set of evaluation metrics.
If we disregard the system-specific evaluation metrics, up until now, all the models were only evaluated for accuracy within the range of the training dataset. Holistic evaluation has never been a priority. Even the cross-validation technique used for model selection in the case of NNs evaluates the model within the range of the training data. As the model can never be trained on the complete space generated by the CPS, can the same model perform well beyond the range of the training data? We propose five metrics for the evaluation of a hybrid model used for modeling CPS. They are as follows:

1) INTERPOLABILITY
The model learns a continuous function f Hybrid from the sparse dataset S. So, the complete manifold M derived from the sparsely populated dataset S may not exactly represent the subspace S of the space S that is actually carved out by the system (Figure 8(a) and 8(b)). Mathematically, S → M ≈ S ⊂ S. Hence, the interpolability metric is used to evaluate how accurately the manifold M represents the system's subspace S. It is defined as the mean of the sum of the squares of the difference between the predicted outputs and the actual outputs within the range of the training data ( Figure 8(b)). Mathematically, the interpolability metric can be defined as A lower error translates into better accuracy of the model against unseen data within the range of the training set. Most deep learning techniques conclude their evaluation with the interpolability metric. That's why the earlier deep learning architectures failed drastically against unseen datasets until the development of ImageNet database [112], [113], which included data from diverse categories. Still, they are vulnerable to failure against unseen data. Their success can be attributed to the diverse data they have been trained upon, which ultimately reduced the amount of data they can ever be exposed beyond the space of the training data.

2) EXTRAPOLABILITY
The data is always sparse and is likely to extend its horizon as more and more data is collected. As all the data can never be collected, it is wise to evaluate every model if it performs well beyond the range of the training data S. It is defined as the mean of the sum of the squares of the difference between the predicted and the actual outputs beyond the range of the training data (Figure 8(b)). Mathematically, the extrapolability can be defined as The interpolability metric and the extrapolability metric are mutually exclusive. A model with good interpolability and extrapolability can be said to be complete as it can perform well within and beyond the boundaries of the dataset. The interpolability and the extrapolability metrics together constitute the generalizability metric. However, the extrapolability of a model of a CPS is quite difficult to evaluate as it will be difficult to identify the exceptional behavior of the system, simulate it, and collect the extraneous data. A simple example could be expecting a model that has been trained for classifying the predecessors of human beings (e.g., monkeys, apes, and gorillas) to perform well on classifying human beings. The hybrid models have portrayed excellent extrapolability [133], [200], [215], [224], [225], [253].

3) OPTIMAL SIZE OF DATA
Every dataset has different local and global behaviors. When data is sparse, it captures the global behavior. When data is abundant, the local behavior is also captured along with the global behavior. The local variation can be due to noise in the system. So, continuing collecting data to capture the local behavior when you are actually capturing noise is unwise. But, as no clear demarcation exists between local and global behavior, it is difficult to know when to stop harnessing data.
More data doesn't guarantee better information about the system. More data doesn't even improve the model when the accuracy of the model has already stagnated. The performance of a model is sensitive to the size of data until it saturates. It saturates when the complete global behavior is captured. As harnessing data can be expensive, continuing amassing data even after the stagnation of the accuracy of the model is imprudent. So, the accuracy of the model MSE Int shall be evaluated on a running basis, and the data harnessing from the same subspace shall be stopped once the accuracy plateaus. Once the MSE Int converges, new data shall be sampled beyond the convex hull of the training data. It can be mathematically described as where w is the window size, p is a random point, and ε is the convergence parameter. If the boost (improvement) in MSE is insignificant (MSE Boost ≤ ε), then sampling data from the current subspace shall be stopped and moved to a new subspace beyond the convex hull of the current subspace (Figure 8(b)). The running average is equivalent to 1-dimensional convolution with average pooling. Hence, the MSE Boost is equivalent to the difference in the pooled averages of 1-dimensional convolutions. The size of the data at the stopping point is the optimal size of data required for modeling the system in the current subspace (Figure 8(c)).

4) ROBUSTNESS TO NOISY DATA
As hybrid models equip ML models, they are not insensitive to noise in the data. So, the model must be evaluated against noisy inputs. Noisy inputs tend to affect the local behavior of the model. The chaos of the environment could be a factor behind noisy data. Artificial noise (Figure 8(d)) shall be introduced into the system during simulations to create noisy data. And the performance of the model must be evaluated against those noisy data. For an artificial noise N defined as N ∼ (µ, σ 2 ) where µ and σ 2 are the mean and variance of the noise respectively, the MSE is defined as: The noise may or may not have a zero mean. The deployed CPS is likely to encounter noisy data. So, the CPS models must be robust against the impact of noisy data. Robust models are those that are resistant to noise. Hybrid models have been proven to be more robust to noise than pure MB and ML methods [133].

5) MODEL COMPLEXITY
The performance metrics discussed so far are measures of the performance of the model. They are sufficient for modeling offline systems. But, for online, deployable systems like CPS, additional performance metrics must also be included. An accurate CPS with a delayed response is not desirable. So, additional metrics for gauging the delay, which is mostly due to the complexity of the model, must also be included. Model complexity translates into the time of processing and memory requirements. A complex computational model will require higher memory and time for processing. The computation time can be estimated based on the number of floatingpoint operations (FLOPs). As CPSs are live systems, complex computational models can delay the response time of the CPS, thereby compromising the dynamic interaction of the CPS with the environment. Memory is required for storing the parameters of the model, e.g., weight matrices, bias matrices, and physics parameters. Bigger, deeper networks and complex physics models tend to need more parameters and thus necessitate more memory. Equipping miniature mobile CPS with huge memory may not be feasible. So, If memory requirement is high, deploying them online can be difficult. FLOPs and memory can serve as good implementation metrics. VOLUME 8, 2020

VII. DIRECTIONS FOR FUTURE WORK IN HYBRID MODELING DOMAIN
In CPS, multiple simple components combine for a common bigger cause. Their combination compounds their interaction and impact. Consequently, their individual limitations diminish, and new challenges arise. The new challenges could be due to their interaction or because of any individual component which subverts others. Similarly, in a hybrid model, the ML subverts several shortcomings of the MB, as discussed in section IV. Due to this, there has been a lot of interest in the hybrid modeling domain. Despite the significant recent progress in the hybrid modeling domain, there are various challenges that throttle down the full-fledged growth of hybrid modeling. They must be addressed in order to exploit the full capacity of hybrid models. In the following subsections, we discuss a few relevant directions for future work in the hybrid modeling domain.

A. HYBRID MODELS: MODEL SELECTION
Model selection is crucial in the hybrid modeling domain. Model selection in physics-based and machine learning domains are quite intuitive. But, not so in the case of hybrid models. Appropriate MBs are selected based on the system, its dynamics, and associated parameters. The data generated by a system can be broadly classified into spatial data, sequential data, or a combination of them. Several MBs model them individually or collectively. Among the deep learning techniques for data processing, CNN and LSTM architectures are prominent. The CNNs are effective at processing spatial data, and LSTMs are efficient at processing sequential data. Their combination is used for processing a combination of spatial and sequential data. Autoencoders [6] are used to extract the abstract features of a dataset in an unsupervised manner. So, there is a clear intuition for model selection in both the physics and the deep learning space. However, there is no clear guideline for model selection in the hybrid modeling space. A well-defined set of guidelines for selecting hybrid models will improve the performance of hybrid models in all aspects and broaden their scope of applications.

B. BENCHMARK PROBLEMS IN HYBRID DOMAIN
A set of benchmark problems eases the evaluation and comparison of different algorithms. Benchmark problems are essential for the growth of every domain. Currently, the ImageNet database [112], [113] and the inverted pendulum [254], [255] serve as the benchmark problems for object detection and reinforcement learning tasks respectively. But, no benchmark CPS problems or dataset exists for evaluating and comparing the hybrid models. In addition to that, the benchmark problems are specific to each application and evolve with the success of the state of the art algorithms. The benchmark problems in object detection and recognition evolved from CalTech 101 [114] to CIFAR-10 [115] to Ima-geNet [112], [113]. In general, the growth of the state of the art techniques can be directly inferred from the complexity of the benchmark problems. So, the benchmark problems for hybrid modeling in CPS shall also evolve with the expertise. Also, as CPS also has diverse applications, it may require the formulation of unique benchmark problems for each domain.

C. REGRESSION PROBLEMS WITH LOW AMOUNT OF DATA
Supervised learning algorithms try to ''learn'' accurate models that can predict the value of the dependent attribute (output) from the attribute variables (inputs). Regression is one of the two main groups of supervised learning problems (classification being the other) in which the output variable is a real or continuous value. Regression problems manifest in a multitude of CPS domain problems. Modifying, training, and tuning hybrid models for a regression problem is non-trivial in comparison to classification problems.
Additionally, in between the spectrum from full physics to no physics and from zero data to big data, there lies an area in the hybrid modeling domain that has limited data and limited physics. A large class of CPS modeling tasks belongs to the limited data and limited physics realm. Training accurate hybrid models for regression problems with a low amount of data is a fertile field for future research. The prior works [133], [200], [215], [224], [225] has shown that hybrid models considerably outperformed both MB and ML in terms of interpolability and extrapolability in a variety of applications with limited data. However, the application problems in these existing work are limited to simple systems. They have not been exclusively tested in CPS applications. Hence, an exclusive and comprehensive evaluation of hybrid models against CPS with limited data is required.

D. LEARNING WITH LESS LABELS
With enough data, one can build accurate hybrid or datadriven models. Accurate hybrid models currently require lots of labeled data. The commercial world (image, video, and text analytics) have curated large sets of labeled data for training models through cheap but effective crowdsourcing methods. Unfortunately, crowdsourcing techniques are not amenable for creating datasets for the CPS domain. Alternate means of data labeling can result in a 100x higher cost and longer time to label. Developing novel hybrid models that learn with fewer labels opens up a new avenue of research and practical applications. Specifically, semi-supervised hybrid learning approaches that use less amount of labeled data along with a huge amount of unlabeled data is an interesting avenue of research in the CPS hybrid modeling domain.
Moreover, as CPS finds applications in diverse domains, labeling may require domain-specific expertise. So, utilizing techniques like active learning [256] and semi-supervised learning [257] for hybrid modeling in general, and CPS in particular, is a fruitful avenue of research. Hybrid learning techniques are yet to use concepts from active and semisupervised learning. Additionally, a set of metrics for unsupervised learning [258] in CPS should also be developed.
The success of unsupervised learning in CPS hybrid modeling can truly make the CPS ubiquitous.

E. INTELLIGENT COLLECTION OF DATASETS FOR HYBRID MODELS
Data collection is an expensive process. Moreover, data is always sparse and can never completely represent the state space. So, data shall be collected in an intelligent manner to capture the global behavior of the system. As the model is trained on the set of acquired data, the model tends to improve its interpolability (MSE Int ). However, the performance of the model against new data, i.e., extrapolability (MSE Ext ) is not guaranteed. The extrapolability of the model will be at par with the interpolability if the new test data is in the vicinity of the space of the training data and vice versa (Figure 8(b)). Hybrid models are capable of learning the complete model of the system. Hence, to develop a complete model of the system, new data shall be collected from the regions of the system's space that are far from the training set, in which the model is likely to fail. The MSE Ext will be high. The system shall be simulated to capture exceptional cases like system failure and abrupt behavior. Such data acquisition will be instrumental for systems where the consequences of failure can be catastrophic. The model shall be trained on such data for reducing its bias and until its accuracy saturates. Then, again a new region shall be sought. And the process shall be repeated. To conclude, the acquisition of new data and training of the model shall be conducted in a non-cooperative game-theoretic approach for influencing the mean square error. Additional techniques for data collection, as described in [259] can also be used. The accuracy of the model can be expected to improve in steps where each step corresponds to a new region of data that is not in the proximity of the trained data.

F. GEOMETRIC MACHINE LEARNING IN CYBER-PHYSICAL SYSTEMS
CPS has opened up the gates for the flood of big data in every imaginable domain. Big data is accompanied by high dimensional input data [111]. When the dimensionality of the data is very high, the useful data is actually embedded in a low dimensional manifold. Learning such manifolds comes under the purview of geometric deep learning. Recent work in the CPS [260], [261] has forayed into geometric deep learning, where the data is non-Euclidean. In a non-Euclidean setting, the properties of Euclidean geometry like translation invariance and stability against local deformations [262] are no more valid globally. They are only valid locally. So, the CNNs, which leverage these two properties for processing spatial data, cannot be used anymore. That invites novel concepts from geometric deep learning into the cyber-physical space. Among the geometric deep learning techniques, structured graph CNNs [261] and spectral clustering [263] techniques have already been utilized for CPS. Other techniques like spectral CNNs [264], [265], graph CNNs [266], [267], geodesic CNNs [268], manifold learning should be explored for CPS applications. Geometric deep learning techniques can be instrumental for processing big data [269] in a cyberphysical setup.

G. IMBALANCED DATA AND DATA AT TAIL
The data, in general, is likely to be imbalanced, i.e., not all the classes will have even representation. Its impact becomes critical in security applications, disease classifications, fraud and fault detection, etc. In most cases, misclassifying the minority class (e.g., disease or a perpetrator), which does not have enough data for allowing the algorithm to learn, is probable. The impact of misclassification in case of disease prediction or security applications can be fatal or catastrophic. The techniques for controlling the adverse impact of class balance manipulates the data or the learning algorithm or both [270]. However, their performance at mitigating the impact of class imbalance is not as expected and hence debatable [270]- [272]. Additionally, data at the tail-end, for example, anomalies in CPS system behaviors are hard to curate. An anomalous data sample can be considered to be generated by distribution (tail-end) that is significantly different from the normal behavior distribution. In general, anomalies are too expensive, time-consuming, and sometimes even dangerous to record. Therefore anomalies are not observed abundantly on real CPS systems. We still have a long way to go for correctly addressing imbalanced data and data at the tail in the context of CPS hybrid modeling.

VIII. CONCLUSION
This is a diligent attempt at disseminating the collective intelligence of hybrid models in an organized manner. We have classified the hybrid models into physics-based preprocessing, physics-based network architectures, physics-based regularization, and miscellaneous categories based on the way the MB is brought into the hybrid architecture. The manner in which their synergy can help eliminate the challenges posed by big data, probabilistic nature of CPS, environmental chaos, and limits of determinism to the existing modeling techniques of CPS is also elucidated. In addition to this, we have illustrated how the MB can eradicate the limitations of ML models arising due to the time resolution of the data acquisition system. Overall, the combination is a win-win situation. Moreover, we proposed five metrics for all-round performance evaluation of a hybrid CPS model. The discussed metrics go beyond the conventional metrics of accuracy within the training dataset. Finally, the challenges faced by hybrid models that deter the growth of hybrid models in the CPS domain are briefed. We hope the current work to become a stepping stone at the organized dispersal and growth of CPS hybrid models.