Towards Data-driven Optimal Control: A Systematic Review of the Landscape

This literature review extends and contributes to research on the development of data-driven optimal control. Previous reviews have documented the development of model-based and data-driven control in isolation and have not critically reviewed reinforcement learning approaches for adaptive data-driven optimal control frameworks. The presented review discusses the development of model-based to model-free adaptive controllers, highlighting the use of data in control frameworks. In data-driven control frameworks, reinforcement learning methods may be used to derive the optimal policy for dynamical systems. Attractive characteristics of these methods include not requiring a mathematical model of complex systems, their inherent adaptive control capabilities, being an unsupervised learning technique and their decision-making abilities, which are both an advantage and motivation behind this approach. This review considers previous reviews on these topics, including recent work on data-driven control methods. In addition, this review shows the use of data to derive system dynamics, determine the control policy using feedback information, and tune fixed controllers. Furthermore, the review summarises various data-driven methods and their corresponding characteristics. Finally, the review provides a taxonomy, a timeline and a concise narrative of the development of model-based to model-free data-driven adaptive control and underlines the limitations of these techniques due to the lack of theoretical analysis. Areas of further work include theoretical analysis on stability and robustness for data-driven control systems, explainability of black-box policy learning techniques and an evaluation of the impact of the extension of system simulators to include digital twins.


I. INTRODUCTION
C ONTROL systems regulate the behaviour of various industrial systems by providing a control response to the system's current state. These regularisations referred to as the control policy within control frameworks, state the prohibitions and permissions of the system and govern the actions taken by the controller. Simply, the control policy maps states to actions. Note that the system is also referred to as the plant, and is interchangeably used with the term system throughout this paper.
The field of control theory is vast, with fast-evolving literature and process controller designs. Traditionally, the design of controllers employed by various industrial systems has been model-based. Hence, the control policy is dependent on the mathematical model representing the physical system's dynamics to determine the actuation to be applied to the system, given its current state.
Before developing and studying model-free adaptive control frameworks, model-based frameworks were extended to complex and nonlinear systems by making assumptions to simplify the task of encapsulating and modelling both the system's physical dynamics and the experienced external disturbances. However, these approximations and simplifications are not practical and restrict the performance of these systems. Due to the challenges which come with precisely modelling complex systems, model-dependent control systems are neither widely applicable nor feasible [1]. Furthermore, fixed controllers, used in primitive control systems, use a predefined control policy that is applied to the system irrespective of any changes experienced. It is highlighted that control frameworks with fixed controllers and model-dependent control systems are not efficient in achieving the goal of adaptive control, which requires handling uncertainties or disturbances and predicting scenarios beyond the classified objects in the operational environment whilst prioritising safety [2], [3]. More recently, Data-Driven Control (DDC) frameworks have been developed to address some of these feats and have become prominent, given both the explosion of available data in various industries and the accessibility to the computational power of modern-day computers.
While the limitations and drawbacks of fixed controllers and model-based control frameworks led to the development of DDC frameworks, there is no mutual exclusivity in utilised methods. DDC methods may be model-independent or used to enhance model-dependent control systems. For example, some DDC frameworks which directly use only the input-output (I/O) information of a system to determine a control policy using learning-based or iterative techniques are independent of the considered system's mathematical model. This model-independence is both an advantage, and a motivation behind model-free control frameworks [4]. The many uses and objectives for using data for control systems include: developing a controller for a model-free framework, system identification, construction of stochastic or uncertainty models, and tuning of fixed controllers [1]. It is important to highlight that model-free control is a particular application of DDC methods, as a data-driven technique is used to extract the need for a mathematical model representation of a system by either deriving the policy directly from the I/O data of the system, or for identifying the system. Whilst the construction of stochastic disturbance model and tuning of fixed controllers use datadriven techniques in conjunction with model-based control frameworks.
Furthermore, another reason for the development of DDC frameworks is to construct an adaptive optimal control policy that finds a control strategy for a dynamical system over some time such that the objective function is optimised and the control policy evolves to adapt to changes. Reinforcement Learning (RL) is an example of an iterative unsupervised learning-based method with the inherent characteristics of adaptability, which is contrasting and advantageous compared to past fixed-controllers. These capabilities highlight the evolution and advancements of the process of controller designs. However, drawbacks include that the stability analysis of these methods is primitive and a formidable challenge [5], [6], and that during the exploration phase of determining the control policy, the RL agent may apply actions that do not satisfy the action constraints which may leave the safety of these techniques questioned. These learning-based techniques require modern-day compute power to provide realistic and computationally efficient responses to online feedback signals. Another promising research avenue is the synergy of model-based frameworks and data-driven learning-based techniques, which is further discussed by [7] and in this review. An overview of the use of data in control systems is given in Fig. 1.
This literature review summarises the use of data in control systems. The primary objective is to provide a concise narrative of the development timeline and taxonomy from traditional model-based control systems to model-free data-driven optimal adaptive control frameworks. This study hopes to provide a single review of DDC techniques which can be used by both intelligent control and RL communities.
The main challenge highlighted in previous autonomous control reviews to date is to develop a control framework that is robust to disturbances, such that the system converges to the desired target within a minimum time and for stability to be maintained. These are pertinent to address in the safety of the operation of these systems [8]. This literature review points to literature on related work to analyse control techniques and emerging directions in this field.
The literature review is structured as follows. Section II gives a brief technical introduction of the classification and terminology of control frameworks. Section III details the methodology and the procedure used to conduct this literature survey. Section IV describes the timeline and taxonomy of control systems from their primitive stages to current novel data-driven control techniques. A description and the development of model-based and model-free control frameworks are respectively discussed in Section V and Section VII, while controller tuning techniques are discussed in Section VI. Section VIII discusses emerging trends in this area of research, while Section IX draws some final conclusions.

II. TERMINOLOGY AND CLASSIFICATION OF CONTROL SYSTEMS
This section considers key concepts and terminology of control theory and highlights essential features on which control systems are classified. These characteristics include the number of inputs and outputs of a system, the type of I/O data, the techniques used by the controller, and the configuration of the information used by the controller.

A. CONTINUOUS-TIME AND DISCRETE-TIME CONTROL SYSTEMS
Based on whether the signal used in a control system is continuous or discrete determines whether the control system is a continuous-time or discrete-time system. A continuoustime control system has all the system's variables defined as a function of time. Conversely, if the system variables are defined at distinct discrete-time steps, then the system is a discrete-time control system [9].

B. SISO AND MIMO CONTROL SYSTEMS
Single Input and Single Output (SISO) control systems have one input and have one output signal. Whereas systems that have more than one input and more than one output are called Multiple Input and Multiple Output (MIMO) control systems [10]. A linear control system obeys the superposition theorem [11], the system is governed by linear differential equations, and the output or the response varies linearly with respect to the input or the actuation. In contrast, nonlinear systems do not necessarily satisfy the superposition principle, the system is governed by equations of nonlinear nature, and the outputs do not vary in a linear relationship with respect to the input [12], [13].

E. MODEL-BASED AND MODEL-FREE CONTROL FRAMEWORKS
Model-based control systems use the physical dynamics of the system's structure, given in the form of a mathematical representation, in determining the actuation signal to be applied to the system. In contrast, model-free control systems use linearisation techniques and learning-based techniques to develop a controller based on historical data or the output of the plant at each iteration and not on any assumptions on the system model [14].

F. ONLINE AND OFFLINE DATA
Offline measured data is a historical data set, whilst online measured data is the information obtained from real-time channels. Online measured data allow for real-time updates, whilst the usage of offline measured data requires regular updates to account for new trends that can be seen in recently obtained measurements.

G. FIXED CONTROL AND ADAPTIVE CONTROL
A fixed control system has a predefined control architecture that is used in determining actuations to apply to the plant irrespective of any changes in the environment. However, in contrast, an adaptive control system adjusts the control method with respect to the control system's parameters. There are two particular adaptive control categories: direct and indirect. The direct adaptive methods directly respond to the output of the plant, thereby iteratively updating the control policy or the mapping of the I/O data. Indirect adaptive control methods estimate the parameters of the plant and use the estimated model to adjust the controller by fine-tuning the controller's parameters [15], [16]. are the same. Which means that the agent learns directly from its experiences. When the behavioural and target policy are the same, the agent both selects the actuation and uses the selected actuation.
Off-policy methods have more freedom in exploring the environment than on-policy methods. Off-policy methods update the policy by merely estimating future rewards and actions given by the generated data and are independent of the agent's actions [17], [18]. In contrast to on-policy methods, in off-policy methods, the agent does not select its own actuations but instead learns from exploration. Hence the target policy and the behavioural policy are different. Q-learning is an example of an off-policy method.

III. METHODOLOGY
The structure of this systematic literature review is primarily based on the guidelines provided by [19]. The review studies the use of data in control systems with a particular focus on the development of data-driven methods for model-free adaptive control. The literature review gives an overview of the uses of data in control system frameworks, the timeline and taxonomy of the development of controller designs from model-dependent designs to model-independent designs, reviews common adaptive control techniques and underlines their strengths and weaknesses, discusses the limitations of the literature, and indicates recent advances and emerging directions.
To keep the narrative of the development on control techniques from primitive stages to current control frameworks concise, this study only includes common datadriven control techniques and omits hybrid techniques used. The reviewed methods are critically discussed to highlight both their applications and limitations. As summarised in the introduction in Fig. 1, the three categories of data in control systems are model design, controller tuning and policy derivation. These are common in similar literature surveys [20], [21]. Other classifications include adaptive or fixed controllers, but these topics have been discussed alongside model-based, and model-free controllers as this literature review focuses on adaptive controllers. Other characteristics on which control systems are classified include switching mechanisms and whether the control framework is a distributed system. The use of historical data in improving model design includes system identification and modelling of the stochasticity experienced by a plant. Methods considered for controller parameter tuning using data include Iterative Feedback Tuning (IFT), Virtual Reference Feedback Tuning (VRFT), Correlation Based Tuning (CBT) and non-Correlation Based Tuning (nCBT). Literature on the use of data in feature selection used in feedback control is pointed to. The primary focus of this survey is the use of data in policy derivation. The modelbased techniques consider Model Predictive Control (MPC) and its data-driven extension Data-driven Model Predictive Control (DDMPC); Model-free Adaptive Control (MFAC) techniques include Iterative Learning Control (ILC), Lazy Learning (LL), Dynamic Linearisation Techniques (DLT) and prominent RL based methods such as Deep Q-network (DQN), Deterministic Policy Gradient (DPG), Deep DPG (DDPG), Trust Region Policy Optimisation (TRPO), and Proximal Policy Optimisation (PPO). This review points to other intelligent control techniques such as Bayesian probability, fuzzy logic and evolutionary computation used in control frameworks in Section IV, however, it focuses on iterative learning-based methods which may or may not be neural network (NN) based for model-free policy derivation techniques. Furthermore, the narrative is centred around discrete-time control systems, given their predominance as they are easier to integrate, have a lower computational cost than continuous-time control computations, and have a more comprehensive range of developed algorithms available to solve problems of this nature [22].
The literature appraisal and selection process entailed using the following search words: 'Data-driven control', 'Model-free adaptive control using data-driven techniques', 'Data-driven model predictive control', 'Intelligent control', and 'Learning-based control'. Publications between 2011-2021 from peer-reviewed journals and conference proceedings were considered. It must be noted that the Background, Section IV, includes earlier works dating back to the late 19 th century. Furthermore, textbooks considered were not restricted to the mentioned time frame.
Searches for literature on the aforementioned keywords were performed in Google Scholar, Web of Science, IEEE Xplore, Science Direct, Annual Reviews and Springer Link. From the returned results after searching the aforementioned keywords in the various databases, survey papers were first perused and then other returned articles' abstracts, introductions and conclusions were analysed. Articles that gave insight on topics considered under data in control were read in their entirety and included in this survey. Finally, only articles written in the English language were only considered.

A. RECENT LITERATURE SURVEYS AND DEVELOPMENTS ON DATA USED IN CONTROL SYSTEMS
Seminal literature surveys on data-driven methods for control methods were seen from the late 20 th century. Table 1 highlights the main contributions and topics discussed in the respective literature surveys conducted between 2011-2021 that are closely related to this review. These survey papers were seeds in searching for literature included in this review. The main contributions of this review are also included in Table 1. This literature review aims to provide a single review that can be referenced by both the robotics and automatic or intelligent control communities to discuss the various uses of data for optimal and adaptive control. Data in control for the various topics such as controller tuning, and both Neural Network (NN) based and non-NN based frameworks have been reviewed [20]. Furthermore, this review provides an extension and contribution to developments since 2018 and provides a timeline and taxonomy of both control frameworks, which are dependent and independent of the model dynamics.
Data usage in both model-based and model-free frameworks are respectively discussed in Section V and VII, an overview of controller tuning using data-driven techniques is given in Section VI, and emerging trends are presented in Section VIII.

IV. BACKGROUND
From 1760 to 1840, society's once agrarian-handicraft economy slowly transitioned to one dominated by mechanised factory systems and machine tools, hence transforming societies to be more industrialised and urban. In modern history, this transition period is known as the Industrial Revolution. During the late 18 th to 19 th century, the industrial sector had not only become fast-growing but had also initiated making adaptations of the available technology. This initial progress was shortly followed by analysing the designs of continuously operating process systems to improve and optimise their performance. Various attempts at maintaining accurate control of these dynamical systems led to both practical and theoretical development being done in the field of Control Theory, as first proposed by [25]. The reader is referred to the survey paper [26], [27] on the early progression of control theory.
Control systems or control engineering is a discipline that practically applies control theory to design systems with desired behaviours in a given environment. Control systems can be formally described as a device that generates autonomous behaviour through computation and actuation [1]. Feedback systems are a particular process that may form a part of control systems to improve the performance of control systems by returning the output of the system to be utilised as a part of the system's input. Feedback controllers were widely used in the early years of the 20 th century for voltage, current, and frequency regulation; boiler control for steam generation; electric motor speed control; ship and aircraft steering and auto stabilisation; and temperature, pressure, and flow control in the process industries [26]. As a result, the controllers' design were tailored specifically to these applications. However, most of these controllers were designed without a thorough understanding of the control system's dynamics and the actuating control device. This lack of understanding was due to poor theoretical backing at the time, with no common language to discuss these types of problems. Fortunately, since the control systems applications were simple regulations, the undeveloped theoretical rigour was not detrimental. Although, there were more complex mechanisms involving complicated control laws which were being developed, such as the automatic ship-steering mechanism devised by Elmer Sperry in 1911, which incorporated Proportional Integral Derivative (PID) control and automatic gain adjustment to compensate for the disturbances caused when the sea conditions changed [26].
During World War I (1914-1918), major developments emerged in stochastic systems, including the fire control work done by [28]. From 1935 to 1940, advances in the understanding of control system analysis and design were being made by several independent groups around the globe. However, the beginning of the transition period leading to the formalisation of modern control theory took form after the conference on "Automatic Control" held in July 1951 at Cranfield, England, and the "Frequency Response Symposium" held in December 1953 in New York [29]. The wartime experience during World War II (1939)(1940)(1941)(1942)(1943)(1944)(1945)) demonstrated the power of the frequency response approach to the design of feedback systems and revealed the weaknesses of any design method based on the assumption of linear deterministic behaviour. The two assumptions which facilitate control algorithm design are: there is no human interaction with the system, and precise knowledge of the environment is known with which the system interacts [1]. However, these are not practical assumptions when considering industry scenarios. The nature of real systems are not necessarily linear, real measurements contain errors and are contaminated by noise, and in real systems, both the process and the environment are uncertain. In order to have the bestsuited controller for the system, the design techniques to be used should consider the following behaviours: linear and nonlinear, deterministic and non-deterministic, and the presence of noise or measurement error. In the 1980s, post World War II, research had begun to make optimal feedback logic more robust to disturbances and variations in the measurements received from the systems [27], [30]. This research topic has rapidly grown since its conception and is still a topic of research to date.
The development of control system frameworks through key historical events such as the Industrial Revolution, World War I, and World War II highlighted the absence of systematic methods to handle hard constraints imposed on control systems. This had resorted to ad-hoc methods, such as single loop controllers augmented by various selectors. The birth of MPC had brought about a means to accommodate the requirement of having controllers take imposed constraints into account. MPC has a predictive capability and can better encapsulate dynamic characteristics of dynamical systems than traditional PID controllers. In addition, adaptive control techniques were developed to account for uncertainties or adaptations of control systems, with these being either model-based or data-driven approaches [20].
The evolution of controller objectives and designs are highlighted in this section. In summary, initially, control systems performed predefined actions based on the system's current state response. However, this objective was satisfied by fixed controllers as the understanding of control theory grew and developed model-based techniques to encapsulate the system dynamics better. Due to the complexities of the considered plants, the possibility of accounting for disturbances brought about the idea of model-free adaptive control. Essentially, none of these methods is explicitly VOLUME V, 2021

Reference
Model-Based

Model-Free
Controller Tuning in designing the controller. The use of data for both controller design and controller tuning are summarised. • Distinguishes and classifies the various methods based on their characteristics. [20] (2018) ✓ ✓ ✓ ✓ ✓ • Classifies adaptive control techniques based on whether or not the methods are model-based or data-driven and describes the approach used in deriving the policy. • Model-based approaches include a discussion on 'adaptive regulation' considers unknown disturbances without explicitly modelling the system. • Using learning-based techniques to improve controllers in a model-based framework. [1] (2018) ✓ ✓ • Proposed formulations are given of the use of data to formulate uncertainty in the model design. Prediction models, which include environment models that are used to navigate and manipulate objects, can be deterministic, stochastic or scenario-based. • The review includes an overview of predictive control frameworks. An analysis of these methods discuss ensuring recursive feasibility, convergence, robustness, constraint satisfaction, and computational tractability are discussed. • The impact of the prediction horizon length is analysed. • The properties of linear MPC state-feedback policies with or without disturbances are presented. [23] (2019) ✓ ✓ • Gives an overview of the recent progress of RL for process control.
• Highlights best-suited systems and underlines constraints or limitations of the various applications. • Compares the characteristics of MPC to RL. [24] (2020) ✓ ✓ • Highlights, critically compares and reviews RL methods used in process control. • Classifies the various RL methods. • Underlines the shortcoming of RL, which include un-established stability theory and not accounting for constraints in model-free frameworks.
• Provides a detailed chronological description of the evolution of control frameworks from primitive model-based techniques to DDC techniques for adaptive optimal control. • Provides a single review of reference which is used to discuss the ensemble of data uses in control systems, with a primary focus on policy derivation from data. • Points to recent works on the development of theoretical analysis of modelfree methods which include studies and formalisation on stability, robustness and convergence. Furthermore, the need for studies on the explainablity and interpretability of black-box algorithms. • Points to emerging directions of work in this field of data-driven optimal control including the requirement of development of high fidelity simulators to be used in the process of agent training and, the digital twin to optimise the end-to-end process of the development of control frameworks. model-free as the system dynamics are captured through various function approximation methods. These methods include traditional statistical methods and learning-based methods.
Intelligent probabilistic and statistical methods include fuzzy logic [31], [32], Kalman filters, particle filters [33], [34], Bayesian optimisation [35], amongst others. Since the conception of the fuzzy logic method, stability analysis of the technique has been formalised, it has been applied in both model dependent and independent control frameworks and has been applied to problems in a range of industries [36]. The reader is referred to the following surveys and applications of this technique to control problems [37]- [40]. Although fuzzy logic in control theory has shown success in several applications, unfortunately, in some cases, its drawback limits the application to control systems. Fuzzy logic drawbacks include it not being considered a systemic approach to solving problems, inconsistent performance, and significant training and validation requirements. There are multiple applications of Kalman filters to control problems, as reviewed in [41], as they are computationally efficient in terms of memory use. However, they assume that both the system and the observations are linear. Bayesian optimisation has been directly applied to control problems and as an optimisation technique for hyper-parameter tuning. Bayesian optimisation is sensitive to the parameters used, and the difficulty of estimating the Bayesian optimisation model is itself a drawback.
This literature review focuses on the development from MPC, DDMPC, and learning-based model-free adaptive control techniques. These are further discussed in this section and this paper.

A. MODEL PREDICTIVE CONTROL
MPC is a feedback control algorithm that uses the model representation to forecast behaviours by solving an online optimisation problem to select the most suitable control action, such that the system being acted upon (plant or process) is driven towards the desired target. This advanced model-based process control method was born in the petrochemical industry in the late 20 th century. This class of model-based control methods require an explicit dynamic model of the plant to predict the impact of future actuations of the control variables based on the feedback or output from the plant. MPC is commonly known as Receding Horizon Control (RHC) as, in brief, at each discrete time step, the future actuations to be applied to the plant are determined. This set of actuations is obtained using the dynamic model, and at each sampling time, the set of future actuations is updated based on the updated feedback from the system. For details on the early development of MPC, the reader is referred to the survey paper [42]. MPC is a MIMO advanced process control method, whilst the PID controller is traditionally SISO, however, has been extended and applied to MIMO systems. Furthermore, the ability to acknowledge constraints and the predictive capability of the MPC framework are seen as an improvement and advantages in comparison to traditional PID controllers. In contrast, PID controllers are model-free in comparison to model-based MPC frameworks.
The statement made in [43] encapsulates the objectives of MPC: "One technique for obtaining a feedback controller synthesis from knowledge of open-loop controllers is to measure the current control process state and then compute very rapidly for the open-loop control function. The first portion of this function is then used during a short time interval, after which a new measurement of the process state is made and a new open-loop control function is computed for this measurement. The procedure is then repeated.". This statement guided the development of the family of MPC controller designs into mature techniques to tackle control problems in the industry with a strong theoretical basis. The MPC model was designed to solve multi-variable, constrained, infinite horizon, and possibly nonlinear optimal control problems via finite horizon solutions with a receding horizon implementation. These finite horizon solutions involve optimising the objective function for the (finite) prediction horizon, where the predictions are based on a mathematical model of the dynamical system to be controlled in real-time. Some of the most primitive work on MPC, which laid the foundation of this field, and the applications of MPC in industry include the description of successful applications of Model Predictive Heuristic Control (MPHC) in 1978 [44] which wss later known as Model Algorithmic Control (MAC), and the outline of Dynamic Matrix Control (DMC) [45], [46]. Both algorithms, MAC and DMC, make explicit use of dynamic process model.
Having a theoretical foundation set up for MPC in the late 20 th century, the early 2000s focus was on the development of the MPC controller design to reduce orders of magnitude of computation time to compute online optimisation efficiently. Such real-time responses could be given to the technology to which it was applied, thus requiring fastsampling rates. Initially, explicit MPC control laws were determined offline to achieve speed up through a customised algorithm, which proved to be orders of magnitudes faster than the generic solver. However, as the horizon size or states and constraints increased, the number of polyhedral regions scaled, making the lookup task in a table difficult to implement in practice. Hence, [47] proposed methods include a combination of table storage and online optimisation, or simplifying the problem by imposing equality constraints as proposed in [48], or using approximate primal barrier interior point method adorned with several customised features like fast Newton step computation and a fixed barrier parameters as suggested in [49]. The online approach is imperative and provides an added advantage of weighted parameters horizon size on model parameters which can be changed as required, unlike explicit methods where entirely new lookup tables would have to be constructed.
Given the potential of MPC, it has been widely applied to applications including fields of power electronics [50], [51], data centre cooling [52]- [54] and unmanned autonomous vehicles (UAVs) [55] amongst others. The reader is referred to [56] for a detailed review on the development of MPC. In Section V-A a detailed description of the MPC method is given.

B. DATA-DRIVEN CONTROL
In recent years, information has been available in abundance. For example, data or information recorded from plants have been used to model system dynamics, design stochastic models representing noise [1], fine-tune controllers [21], and derive the control law merely using I/O data and learning methods [20], [21]. Control frameworks that use data-driven approaches may be applied to modelbased or model-free systems and may use either or both online and offline data. The definition of DDC varies throughout the evolution of this field and in the literature. In some instances, DDC refers to a model-free framework that use data with intelligent algorithms to derive the control policy. In contrast, in other instances, DDC refers to the general use of data in control irrespective of the dependence of the framework's dependence on the mathematical model of the system [21]. In this literature review, DDC is seen as the latter.

1) Data-Driven Model Predictive Control
MPC is a powerful technique; however, its performance is determined by the accuracy of the representation of the dynamic model used and the assumption that there is no VOLUME V, 2021 external disturbance. It is not realistic to encapsulate the dynamics of complex nonlinear systems in a model representation and assume that there are no external interactions with the system. Thus, DDMPC are studied, as they use data-driven techniques to extend the MPC frameworks. Historical data from the system is used to model the dynamics of the plant to be used in the MPC framework [4], datadriven approaches have been used to formulate stochastic MPC models to encapsulate the uncertainty that the system endures to autonomously improve the performance of repetitive tasks [1]. If system identification is omitted and the control policy is determined solely from the data or the feedback information, this method is commonly referred to as data-driven optimal control and can forms a part of model-free frameworks.
Extension of the MPC framework using both unsupervised and supervised learning techniques have been studied. Unsupervised learning techniques include, clustering algorithms [57], mixture Gaussian learning method to detect false data points in a smart grid estimation framework [58], non-Bayesian learning for fast convergence [59]. Supervised learning techniques utilised include applying regression [60] in conjunction with online modelling methods to estimate the mathematical model of nonlinear time-varying systems. The reader is referred to [61] for the stability analysis of the DDMPC framework.

2) Data-Driven Controller Tuning
In control systems with fixed controller architectures, datadriven approaches have been used to fine-tune the controller parameters. Some of the earliest works in this field include the tuning of the PID controller [62]. Prominent iterative methods for controller tuning include IFT and CbT. Noniterative methods include VRFT and nCBT. These methods are discussed in Section VI.

3) Learning-based Data-Driven Control
Adaptive control methods initially were designed for modelbased frameworks, which use the plant's dynamical system representation to make decisions whilst handling uncertainties. However, the proposition of model-free learning-based methods for adaptive control was seen as promising as it does not rely on exact physics and mathematical modelling of the considered system. Instead, the aim is to use learningbased methods to iteratively adapt the control law, which better encompasses dealing with the disturbances' negative effects and the effects of parameter variation. As much as this is a method with potential, it comes with its drawbacks of slow convergence and the possibility of not being able to interpret the learned control law [63].
Model-free DDC control methods, which use learningbased methods and data to derive the control law, may be NN based or non-NN based. DLT, LL and ILC [64]- [66] are a non-NN based methods. DLT is a DDC method which is considered a fundamental tool for discrete-time nonlinear systems [64]. LL is classified as a non-NN based machine learning method. ILC is a learning-based method that iteratively updates the control policy for repetitive tasks through successive iterations. Although first proposed in 1978 [65], ILC had not drawn much attention as it was published in Japanese. However, in 1984 [66] the work was published in English. For more details on ILC, the reader is referred to a survey [67] and various industrial applications of ILC [68], [69].
Model-free DDC frameworks have become prevalent amongst the control and robotics communities in the recent past. Particularly NN based learning methods that have been used to develop model-free DDC frameworks, which include RL [70] and learning from demonstration (LfD) [71], [72]. RL approaches have shown the capability to realise the optimal control; common methods or frameworks used are Q-learning and Actor-Critic (AC) architectures. A comprehensive review of some of the earliest works in this field is discussed in [73], [74]. Data-driven policy derivation methods are further discussed in Section VII.
The birth of RL can be attributed to the culmination of trial-and-error search psychology in the animal kingdom, Dynamic Programming (DP) and optimal value functions. DP optimises the input trajectory by using a function where the unknowns are also functions generated by the system's state information in conjunction with a value function [75]. However, the optimisation problem, once reformulated, could potentially be intractable due to the curse of dimensionality. This is a drawback of DP, hence the proposition of Approximate DP (ADP) [76]- [78]. ADP approximates the control policy by using an offline iteration algorithm or an online update algorithm [79]. RL leverages one such ADP method to solve for the optimal policy offline. The design of ADP may take one of many forms that are dependent on the structure of the agent.
RL was formulated with the aim of minimising the loss function over time for dynamical control systems [80], [81]. RL, an area of machine learning, which was developed as an optimal sequential decision-making method, is considered an adaptive control algorithm as it can account for uncertainty without having to be reliant on a finite number of formulated stochastic models like in the DDMPC framework [23], [82]. Unlike MPC and DDMPC frameworks which are reliant on mathematical models, RL is model-independent which is advantageous, particularly for industrial processors that are nonlinear or a MIMO system (possibly both), as it is not a trivial task to model their complexities mathematically [83]. A drawback of MPC is that its performance is proportional to the length of the prediction horizon. However, for more complex systems, to ensure computational feasibility, the prediction horizon is shortened, which could result in sub-optimal results in the long-term [23]. In contrast, some RL algorithms conquer this challenge by pre-computing the optimal solution offline [24], [84]. Furthermore, unlike MPC, RL does not have the online computational demands of trajectory optimisation methods. The development of RL for control systems discussed in [23], [24].
In summary, this section gives the timeline and development from primitive model-based control techniques to current day model-free control techniques. The taxonomy of this section is summarised in Fig. 3, which classifies methods based on their dependence on a model of the system, the use of data and if the controller tuning methods are iterative or non-iterative.

Model-Based
Model-Free

V. MODEL-BASED CONTROL
Model-based control techniques, MPC and its data-driven extension, DDMPC, are discussed in Section V-A and Section V-B, respectively.

A. MODEL PREDICTIVE CONTROL
The main objectives of the MPC controller are to prevent the violation of input and output constraints, maintain outputs within specified boundaries whilst propelling the system to the desired reference trajectory, and control as many process variables as possible with limited available sensors or actuators [93], [94]. The basic structure of the MPC framework is summarised in Fig. 4, and the corresponding MPC trajectory for a SISO system is given in Fig 5.   changes caused to the system by independent variables. The predictive trajectory may or may not be followed due to disturbances. Independent variables that the controller cannot adjust are taken as disturbances, and dependent variables in these processors are other measurements that represent either control objectives or process constraints. Since the MPC model follows an iterative process, as a result of the inherent nature of feedback algorithms, the output after the first input from the set of actions allocated over the prediction horizon, Output, is fed back into the controller through updating the Dynamic Model with respect to the reference signal, Reference, the objective function, Objectives, and constraints, Constraints. Based on the residual, the difference between the measured output and the reference set, the prediction horizon is re-initialised, and the next set of control actions are determined. This process is executed multiple times to try and get the system acted upon to behave as desired. Formally, repeatedly solving a constrained optimisation problem to choose the control action whilst accounting for predictions of future costs, disturbances, and constraints over a moving time horizon are known as the RHC. The prediction horizon is iteratively shifted forward, hence MPC is commonly known as the RHC method. The idea of receding horizons dates back to the 1960s [43] and was used to ensure constraints are satisfied, limits on control variables and sophisticated feedforward action are maintained. MPC's predictive capability, ability to optimise over the current horizon while accounting for the future, which is obtained by the iterative optimisation over a finite horizon, and take into account model constraints are some of its many advantages [95]. However, the drawback of MPC includes the computational inefficiencies which arise due to MPC being a complex algorithm. Hence the system dynamics scale [95] and its dependence on the dynamic model of the system. The cost, time and effort of capturing an accurate dynamic model of systems are the largest obstacles in MPC [96], [97].
The predicted control trajectory in an MPC framework is iteratively updated at each instant t over the interval [t, t + N ], where t is the current time, and N is the number of discrete future time-steps which is also known as VOLUME V, 2021  the prediction horizon length. The corresponding predicted control inputs,û (t + k|t) for k = 1, . . . , N , and outputs, y (t + k|t) for k = 1, . . . , N , are determined based on the plant's dynamic model and the current state x t . From the set of predicted actuations, only the first actuation is applied to the plant. The plant's state is then re-sampled, and the future predicted trajectory is recalculated [98], [99]. The MPC relies on the discrete-time state-space model of the plant to predict the plant's future actuations over the receding horizon, which is used in the design of the controller and can be expressed by and the corresponding measured output is given by where A, B, and C are the discrete state-space plant model dynamic matrices, u is the control input which is also known as the manipulable variable, y is the measured output vector, and x is the state variable vector. The objective is to find the control sequence that minimises the quadratic cost function where Q and R are respectively the state and the control cost weight matrices. This objective function is subject to the linear inequality constraints on the system inputs: ∆u min ≤ ∆u t+k ≤ ∆u max , k = 1, . . . , N,   where u min and u max respectively are the minimum and maximum bounds of the control actions, and ∆u min and ∆u max are respectively the minimum and maximum control increments. This general MPC model can be reformulated to be more realistic and include noise or be developed with an infinite prediction horizon or terminal constraint for a more robust model. The reader is referred to [56], [94], [100], for further details on MPC models, to Table 2 for an overview of the literature on the development of MPC and Table 3 for applications of MPC to control problems.

B. DATA-DRIVEN MODEL PREDICTIVE CONTROL
MPC's potential is limited by the accuracy of the model representation of the system and the available actuations. DDMPC is an extension to MPC which aims to provide means to enhance the powerful MPC framework by using data for system identification [1], [110], [111] and encapsulate disturbances in the model of the plant through datadriven stochastic model predictive control to satisfy constraints in the presence of uncertainty and achieve recursive feasibility [1].
The step of model identification estimates the nominal model of the system using data has been prominent for linear systems, but more recently, system identification has been studied for nonlinear systems [112]- [114]. The development of data-driven stochastic MPC, used to encapsulate disturbance in the model, is described by [115]- [118] are
[131] (2021) Energy management for a semi-closed greehouse. [132], [133] (2021) Data center cooling. [134] (2021) Quadcoptor trajectory tracking. [135] (2019) Trajectory tracking. summarised by [119]. DDMPC is an adaptive control technique that combines the model-based MPC method with data-driven learning techniques. DDMPC extends on the MPC framework by learning from the trajectory data of the system at every time step to construct a safety set which is used to learn in which region of the state-space the system should operate [4], [72], [119]- [121]. Although DDMPC is an extension of MPC, it shares the drawback of MPC that this framework is dependent on a mathematical model of a system, however, the advantages of DDMPC is that it may be easier to model the system and its uncertainties using data rather than through merely using physics, and is simpler to integrate into control frameworks than MPC [122].
Applications of the DDMPC framework range from the mechatronics [123] to home assistance appliances [124]. The reader is referred to the following work [119], [125], [126] for details on the guarantees of the robustness of the DDMPC framework. An overview of the literature on DDMPC is given in Table 4, and the applications of DDMPC to control problems are tabulated in Table 5.

VI. DATA-DRIVEN CONTROLLER TUNING
Data-driven methods used to tune the parameters of fixed controllers include IFT, VRFT, nCbT and CbT.
VRFT and nCbT are offline direct, non-iterative datadriven methods used to optimise the controller. The optimal parameters of the controller are thus identified using a single I/O data set of the control plant. VRFT [136] and nCbT [137] are both used to select the parameters of linear timeinvariant systems (LTIs). VRFT formulates the controller tuning problem by introducing a virtual reference signal for parameter identification. However, the nCbT method does not introduce a virtual signal and performs better than VRFT even if the data is noisy as it uses a correlation-based approach.
Given that VRFT and nCbT are offline methods, if any changes are made to the plant, the plant's parameters must be re-tuned. A drawback of VRFT is that its performance relies on whether or not the system dynamics are sufficiently encapsulated in the data set through the plant's sensors. The reader is referred to the following references on the extension of VRFT: applications of VRFT for nonlinear systems [138] which, in contrast to the linear implementation, is an iterative method; an extension of VRFT for MIMO systems [139], and the study of the robustness and other extensions of this method [140].
IFT and CbT are iterative data-driven controller tuning methods. IFT [141] is a model-free method which at each successive iteration, optimises the fixed-structure controller's parameters using the feedback received from the closed-loop system. This technique is suited to doing precise, repetitive tasks. IFT applies the quasi-Newton method, which is a gradient-based method that has its own drawbacks [142], the convergence rate is reliant on how good the approximation is of the positive-definite matrix, and the method is computationally demanding with respect to both storage and computation [143]. CbT is a correlationbased tuning method and is closely related to IFT. However, it differs with respect to the means of obtaining the gradient estimates, and CbT only uses one experiment per iteration. The reader is referred to the following references [144], [145] on the extension of CbT to MIMO systems.
A summary of the controller tuning methods are tabulated in Table 7 and the literature on the tuning of controller is tabulated in Table 6.

VII. MODEL-FREE CONTROL
MFAC [148], as the name suggests, do not require precise quantitative knowledge of the system. This DDC method has been favoured as it simply uses online or offline I/O data measurements of the controlled system to determine the control policy and has the potential to adapt to environmental changes or disturbances [149], without the explicit use of parametric or non-parametric models of the system to be controlled during adaptation [21]. Properties of MFAC include not requiring system identification, controller tuning, controller design specific to the process and an exact VOLUME V, 2021 Reference Approach [141] (1998), [146] (2002) IFT. [136] (2000) VRFT. [139] (2004), [138] (2006) Extension of VRT for MIMO and nonlinear systems. [144] (2004) CBT. [140] Study of robustness on VRFT. [147] (2012) nCBT. mathematical model representing the system's dynamics (including nonlinear dynamics). In addition, closed-loop stability analysis is available to guarantee stability [149].
Given the many advantages of model-free adaptive controllers, the potential of extending their capabilities to various applications in the automatic control industry is currently being studied and applied. An example includes model-free adaptive controllers directly replacing PID controllers used in SISO systems, with the advantage of omitting the step of controller tuning [149]. MFAC framework being applicable to MIMO systems is a characteristic that is both attractive and the reason behind the attention these frameworks are currently receiving.
A summary of the advantages of the data-driven learning methods for policy derivation discussed in this section includes that they do not rely on the exact physics and mathematical modelling of the considered system and can adapt the control law, which better encompasses dealing with the disturbances' negative effects and the effects of parameter variation. However, irrespective of the potential of data-driven control methods for policy derivation, they come with the drawbacks of slow convergence and the possibility of not being able to interpret the learned control law [63].
A comparison table is presented in [21] on the classification of various control methods based on the following characteristics: whether or not either or both online and offline data are used, the system is suitable for SISO or MIMO systems, if the design encapsulates nonlinear model dynamics or only LTI systems, whether or not the optimal policy is iteratively updated or directly learnt from a single data set, if the RL algorithm is an on-policy or off-policy algorithm, whether or not the algorithm is NN based or not, and their respective computational demands.
A particular distinction between model-free control techniques is whether or not the technique is NN based. Non-NN based and NN based techniques are discussed in Section VII-A and Section VII-B, respectively.

A. NON-NEURAL NETWORK BASED METHODS
Prominent non-NN based methods used in policy derivation include DLT, ILC and LL, which are discussed in Section VII-A1, Section VII-A2 and Section VII-A3 respectively.

1) Dynamic Linearisation Techniques
Earlier work on MFAC studied the application of DLT for discrete-time SISO nonlinear systems [148], [150]- [153]. Given that this is a model-free framework, a sequence of identical local dynamic linearisation data models were built along the closed-loop system's dynamic operation points using a DLT, with a pseudo-partial derivative (PPD). The I/O measurement data of a controlled plant is used to estimate the time-varying PPD, which is iteratively updated. The DLT includes compact-form dynamic linearisation (CFDL), partial-form dynamic linearisation (PFDL), and full-form dynamic linearisation (FFDL). The reader is referred to [149], [153] for details on these methods, for which stability and convergence can be proven under certain assumptions. Most of these methods have been designed for SISO nonlinear plants; however, they cannot be directly extended and applied to MIMO systems without addressing input coupling. These are discussed in [153]- [155]. These methods are favourable as they do not require external training or testing. However, their computational burden and the impractical assumptions made to prove stability and convergence discourage them from being used.

2) Iterative Learning Control
ILC [65], [66] is well-suited for systems that perform repetitive operations through the tracking of output errors and tracks actuations from previous iterations. ILC guarantees convergence as the number of iterations approach infinity. ILC is a model-free data-driven adaptive control method that requires very little knowledge of the plant and uses both online and offline data to directly determine and update the control policy. The reader is referred to the following literature surveys on ILC [68], [156]- [160].
Critically reviewing the ILC method, it is highlighted that the performance of this method with respect to convergence to the desired trajectory relies on unrealistic assumptions, making it an unrealistic method to apply to plants with significant uncertainty [161]- [164].

3) Lazy Learning
LL is a class of supervised machine learning algorithms that was applied to the control field [165]. LL was developed to build a relationship between the input and output data. Historical data is used as the training set. In addition, LL algorithms use online data for real-time updating. Examples of LL methods include K-nearest neighbours, local regression and lazy naive Bayes rules. LL is a powerful technique. However, its computational cost is high, a requirement for large amounts of training data, the impact that noisy training data has on the training phase, and the lack of theoretical analysis are drawbacks [166].

B. NEURAL NETWORK BASED LEARNING
NN parameterised model-free adaptive controllers use NN structures to implicitly represent the system's dynamics. The development of NN based optimal control techniques are  Overview of data-driven controller tuning techniques, distinguished based on the use of online and offline data, if the method is iterative or non-iterative, applicable to fixed or adaptive controllers, and if the controller tuning method is suitable for linear or nonlinear systems.
commonly classified as RL for optimal control, event-based control, signal processing, machine intelligence for control and intelligent control, amongst others. NNs are used in MFAC by creating a multilayer perceptron NN with weight factors updated as the controller's behaviour varies. The adaptation of the weighting assists in iteratively reducing the error value. The 'memory' characteristic of the controller is valuable and provides adaptive characteristics which make them suitable for learning-based techniques.
RL, derived from neutral stimulus and reaction, is a machine learning method that envelopes both supervised and unsupervised learning. The increased popularity of RL algorithms is attributed to their success in addressing sequential decision-making problems [18]. RL algorithms aim to develop agents to learn how to take favourable actions in an environment to maximise the notion of cumulative reward. RL methods are particularly used when the stateaction space is too large to be completely known but can use some experience samples, or when the model is unknown but experiences can be sampled to determine a policy. RL use NNs to approximate this policy function or a value function.
The three methods used in RL to determine the optimal policies are Dynamic Programming (DP), Monte Carlo (MC) methods and Temporal Difference (TD) methods. From these three methods, DP is mathematically well established but is model-based, MC method is model-free but does not use online data, hence updating the estimate of the value policy happens at the end of the episode [167], [168]. Whilst, the TD method is model-free and is implemented using online data that can be used to update the value function.
1) Dynamic Programming: Given that the model precisely encapsulates the plant dynamics, DP can deterministically find the optimal policy, however, it is unrealistic to expect an accurate model of the non-trivial systems. Popular DP methods include policy iteration and value iteration methods [24], [75]. 2) Monte Carlo Method: MC finds the optimal policy by estimating the average returns for different policies by sampling multiple sequences of states, actions and rewards under the determined policy. MC is most suitable for systems that have finite tasks with explicit terminal states [24], [169]. 3) Temporal Difference: TD method is widely used in RL as it has a relatively cheap computational cost and can learn from experiences (like MC methods) with bootstrapping (like DP methods). Furthermore, TD is a model-free method and instead learns the dynamics from interactions with the system. Another favourable characteristic of TD is that it does not require waiting until the end of a training episode to update the value function [24], [170].
Adaptive Dynamic Programming (ADP) [76], [77], an extension of DP and an optimal control scheme [171] which is suitable for linear plants with quadratic objective functions over an infinite horizon. This method can be extended to nonlinear plants, models with different cost functions, and systems defined for finite horizons. The reader is referred to the following literature surveys for the development of ADP [172], [173]. The application of NNs to DP problems was proposed to derive the value function, such that the framework is model-free and robust to disturbance. ADP is a TD learning method that updates the current estimate of the value function at either each or over a few iterations rather than at the end of a full episode [170]. This is an attractive characteristic as updates do not only occur at the end of an episode. Some prominent ADP NN based schemes with an adaptive critic structure include Q-learning [174]- [177], SARSA [167] and AC methods. DP, in a deterministic fashion, finds the optimal policy, however, it is model-based and computationally demanding for complex tasks. Asynchronous or offline DP methods have been developed, however, they perform poorly when less common states are encountered. Both TD and MC approximate DP solutions using less computational power and are model-independent. The MC method finds the optimal value by averaging the value function over the sample trajectories of states, actions and rewards, unfortunately, the variance in the samples trajectories are high. TD combines the ideas of DP and MC methods into one unifying algorithm. TD methods learn from sampled data like in MC methods, while also performing mid-trajectory learning, like in DP, however, TD methods experience high bias due to estimating values through previously estimated values which is commonly referred to as bootstrapping. For a comprehensive introduction to these methods, the reader is referred to [18]. A summary of these methods' characteristics are tabulated in Table 8.
Data-driven optimal control is where RL meets control theory. The controller is designed using input-output data from the system, which is passed through NN based control methods or intelligent methods, commonly referred to as 'black-box' approaches, which implicitly learn the system's dynamics. In contrast to model-based control systems, explainability, robustness and stability provided by deterministic models are not provided or are currently being studied [21]. RL methods commonly model the problem as a Markov Decision Process (MDP). MDP is a multi-stage discretetime representation of the stochastic optimal control problem and a classical formulation of sequential decision making where both immediate and future rewards are considered [18], [178]. MDPs can be expressed as a tuple ⟨S, A, P, R, γ⟩, where S is the set of states s, A is the set of actions a, P is the set of state transition probabilities p, R is the set of rewards r, and γ is the discount factor accounting for all rewards, where γ ∈ [0, 1] [179]. The set of states and actions are specific to time t, hence at any given time t a set of states s t is a subset of S and similarly for actions, state transition probabilities and rewards. The reader is referred to [18], [24] for details on the three different MDPs: fully observable MDPs (FOMDP), partially observable MDPs (POMDP) and semi-MDPs (SMDP).
The RL paradigm, as shown in Fig. 6, consists of two components, the agent and the system. If compared with the closed-loop controller depicted in Fig. 2a, it is noted that the controller is simply replaced with an agent in the RL paradigm. The agent which is the the decision-maker is continuously learning and updating its policy. The agent attempts to learn and conquer the system through meaningful sequential interactions with the system. The system is comprised of everything the agent cannot arbitrarily change.
Relating to the overview of process control, Fig. 2a, the agent would be the controller's logic, and everything else would make up the system. RL algorithm's decision-making process is formalised in the MDP.
The optimal solution to a RL problem refers to the policy that generates the highest reward over a trajectory. Formally, the optimal policy must satisfy the principle of optimality which is defined as: the optimal policy π * is optimal if and only if V π * (s) ≥ V π̸ =π * (s) for all s ∈ S [180].
Two main model-free methods used in RL algorithms are value-based and policy-based methods. AC approaches are hybrid approach that employs both value functions and policy searches [130].
1) Value-Based Methods: Value-based methods do not store an explicit policy but rather a value function from which the policy can be implicitly obtained. The value function V returns the expected value of the return R of being in an initial state s and subsequently following the policy π, is defined by the state-value function as follows The optimal state-value function is the corresponding state-value function for the optimal policy π * , defined by Using V * (s), the optimal policy could be derived by choosing all the actions available at s t and selecting the action a that maximises E st+1∼τ (st+1|st,a) [V * (s t+1 )].
The transition dynamics τ is not available, hence the state-action function is constructed. The state-action function returns the expected value given the initial action a and the policy π is subsequently followed from the initial state, the state-action value function is defined as Given the state-action value function Q π (s, a), the optimal policy can be retrieved by greedily choosing the action with the highest value. Under this policy, the value function can be defined by maximising Q π (s, a): V * (s) = max a Q π (s, a) [181]. Prominent value-based methods are SARSA and Qlearning. Value-based methods are best suited for when using a finite set of actions, rather than continuous action space problems. 2) Policy-Based Methods: Policy-based methods directly learn the optimal control policy π * and do not need to maintain a value function model. Frequently, a parameterised policy with respect to θ, π θ is chosen. The parameter are selected to maximise the expected return E [R|θ] using either gradient-based or gradientfree optimisation [181]. Successfully trained NNs with encoded policies are discussed for both gradient-based methods in [182] and gradient-free methods in [183].
Policy-based methods are discussed in detail by [184].
Policy-based methods are useful when the action space is continuous or stochastic. One disadvantage of policybased methods is that they use the MC technique, which uses the total rewards. As a result, the agent has to traverse an entire episode before any learning occurs, which potentially results in a high variance when there are drastic changes. 3) Actor-Critic Methods: AC method, shown in Fig. 7 policy search methods. The AC methods are TD methods with two independent memory structures representing the policy and the value function. The actornetwork determines how the agent behaves (policybased) by proposing a set of possible actions given a state. The critic-network measures how good the action taken is (value-based) and returns the probability distribution over the actions that an agent can take based on the given state. AC methods are TD learning methods that do not use the total reward. Instead, a critic model approximates the value function at each discrete timestep, unlike policy-based methods based on MC, which increases the learning rate. The values function replaces the reward function of a policy gradient algorithm that calculates rewards at the end of the episode [181] and instead updates the value function within the episode. AC is an on-policy method with two separate parametric structures represented by NNs, the actor-network for optimal policy evaluation and the critic-network for the value function. The actions taken by the agent or the actor-network are evaluated by the critic, which represents the reward function, and the objective function using the TD approach [168]. Q-learning and AC methods are prominently used methods in data-driven learning-based MFAC control systems. Q-learning [185] is a RL method that aims to learn the value of applying an action in a particular state. Q-learning is particularly considered an adaptive control method based on its inherent properties of stochastic transitions and rewards without adaptation. Q-learning is formulated as a finite-state, finite-action MDP, which derives the optimal policy by maximising the expected value of the total reward over a series of successive iterations. Q-learning and Deep Q Network (DQN) are an off-policy method with a slow convergence rate but high efficiency. Unlike the valuebased method, Q-learning and AC methods guarantees convergence for nonlinear methods, have a reduced variance estimate of the expected value, and their sampling is efficient via the TD updates [168].
RL algorithms use the three aforementioned methods, DP, MC and TD, to solve for the optimal policy coupled with value-based, policy-based and AC methods. DQN [186], Deterministic Policy Gradient (DPG) [187], Deep DPG (DDPG) [188], Trust Region Policy Optimisation (TRPO) [189] and Proximal Policy Optimisation (PPO) [190] are major contributions to the field of RL and have been widely applied to control systems in determining the optimal policy. A summary of the characteristics of these RL methods is described in Table 10. The reader is referred to the following reviews on RL methods [20], [23], [24], [191].
A summary of the properties of the MPC method and an array of the data-driven optimal control methods are tabulated in Table 9, a summary of notable related literature on DDC for MFAC is given in Table 11, and Table 12 tabulates the learning-based data-driven applications.

VIII. EMERGING TRENDS
The ultimate goal of automated control would be to develop a uniform data-driven framework that is based solely on the I/O measurements and is widely applicable to various industries. RL and deep RL do hold promise in this regard; however, they are still in their infancy to obtain this for complex systems.
[153] (2011) ‡ DLT, PPD, CFDL and PFDL. [199]  An overview of model-based and data-driven adaptive control. [201] (2013) ‡ ‡ , [202] (2016) ‡ ‡ , [203] (2017) ‡ ‡ RL and machine intelligence reviews for optimal control. [204] (2020) ‡ ‡ A survey on the recent advances in robot LfD. [201] (2013) ‡ ‡ , [202] (2016) ‡ ‡ , [203] (2017) ‡ ‡ RL and machine intelligence reviews for optimal control. † Key methods used in the development of MFAC framework. ‡ Non-NN based. ‡ ‡ NN based.  recently, complementary model-based and data-driven control frameworks like DDMPC, data used in the study of controllers, system identification, and uncertainty modelling have been prominent over modular methods. The gap in the literature is the application and development of multi-scale and hierarchical learning structures, such as using learning methods alongside model-based controllers or pre-processed offline data, which could be used in feature extraction. Furthermore, the literature on the handling of uncertainty does not account for irregularities such as time delays or feedback over varying time intervals but only noise in measurement.
It is highlighted that the development of the theoretical analysis of model-free methods has not been established. Stability, robustness and convergence guarantees are nascent properties in process control. However, proving stability for nonlinear systems and for model-free frameworks [70] is not trivial. Challenges to prove stability and convergence under stochastic conditions include proving effectiveness in terms of performance, learning rate and utilised reward function [18], [167]. This is one of the main challenges with RL. Recent works on theoretical analysis the formalisation and analysis include [5], [129], [212].
Several other areas of improvement of RL methods include accounting for data inefficiency, constraint handling, means to discourage policies from arriving at intractable states, and the construction of representative simulators. Data inefficiency refers to the requirement of lengthy periods of training data to improve the efficiency of a policy derivation and initial agent training, especially if simulators cannot be used in the training process due to their inaccuracies. Emerging fields used to try and inject prior knowledge into the agent include transfer learning [213], [214], including the concept of a replay buffer or experience replay [186], [215], [216] as used by DQN, and increasing learning efficiency using eligibility traces which essentially combines TD and MC methods into unifying algorithm which allows for agents to update multiple value functions per iteration, like MC, without termination of an episode [18]. Alternative methods to increase the rate of the training process includes exploiting heuristics for RL, such as heuristically accelerated RL (HARL) [217], [218] and meta RL [219]- [221] which use simulations to train the agent; RNN is a common algorithm used in this regard. Finally, alternative methods suggest using two modular structures instead, one for offline decision making and another for online high-level RL.
Another critical challenge of using RL in process control is scalability. Emerging trends include using multi-agent RL methods [222] and LfD [4]. Since exact methods are not feasible for problems with more than 100 states [24], [186], [189], [216], recent work with numerous states have used multi-agent RL to achieve optimality [223]. Q-learning and other deep RL methods have been useful for various industrial applications.
The promise of RL agents in a plethora of industries can be unlocked with the development of robust and good agents. The objective of RL algorithms is to develop an agent to take actions to maximise the cumulative reward. The training process of these agents requires a high volume of trial-and-error episodes in a given environment to optimise for the given reward function. In the light of safety and being cost-conscious, high fidelity simulators are nascent, especially to derive research on the development of RL-based algorithms [224]- [228]. With the increase in computing power and the availability of vast amounts of data developing simulators that apply a mathematical function to input data and returns an output is possible. Some commonly used simulators include MATLAB Simulink and ANSYS for engineering problems, Gazebo and MuJoCo for robotics and physics-based simulations, Bottleneck simulators which are model-based RL simulators that have also been proposed [229], amongst others.
Extensions to simulators include digital twins [230]- [232]. Digital twins provide a virtual representation of the real-time digital counterpart of physical systems or processors. A digital thread is a data pipeline used to obtain data through sensors from the design stage to build and, finally, the operation of the physical system or end product. This obtained data is then feedback to the digital twin. Using the amalgamation of the information from the digital thread with the digital twin, performance information can be extracted, and credible updates made can then be applied along the design, production, and end product or system stages. Thus, a means to holistically optimise the end-to-end process. Both manufacturing and engineering industries are moving from using knowledgebased intelligent processes to data-driven, and knowledgeenabled smart processes [233], [234]. The former has been used for informed decision making, whilst the latter uses real-time transmission and analysis of data across the endto-end process with the aid of simulators and optimisation mechanisms, providing positive impacts throughout the process. These techniques are used to improve the performance of end-to-end cycles of engineering or manufacturing but also are suggested to be used to build resilient models by incorporating preventative measures that account for disruption risks. These frameworks make use of cyberphysical integration and digital twins. The reader is referred to [214] for manufacturing applications using digital twins and cyber-physical systems, [235] for the discussion of managing disruption risks, and [236] for a survey on digital twins technologies, techniques and engineering perspectives.
Through the evolution of controller frameworks, NNbased techniques have particularly been prominent for MFAC. These NN-based control policy derivation techniques have been critically discussed in this review. Their black-box nature, in most cases, provides an improvement to the control method and thereby, the system performance, however, they lack in providing insight into the updates and development made through the stages of training and adaptation. The need for white-box models or techniques which are explainable and interpretable in both design and inner logic is crucial to unleashing further enhancements in controller designs to make context-based recommendations and to increase user trust through transparency [237]- [241]. This area of research is commonly referred to as Explainable Artificial Intelligence (XAI).
A summary of references related to emerging trends are given in Table 13.

IX. CONCLUSION
The development of model-based predictive control to datadriven control techniques is motivated by eliminating the step of mathematically modelling plants, especially nonlinear complex ones, to develop policies robust to disturbances directly from I/O data and to use data to fine-tune fixed design model-based controllers.
It is highlighted that model-based frameworks are restricted to the accuracy of the mathematical model representing the plant. However, if the model accurately represents the plant, the model-based framework with fixed controllers is robust. The paradigm of learning the control policy directly from the feedback signal has been prominent in the recent past as it discards the requirement of modelling the physics of the plant but, as a result, has to explore a greater search space in deriving the optimal policy.
In this review, the taxonomy and timeline of data-driven control techniques were given, the corresponding references have been summarised in the respective sections It is noted that there is an overlap of studies between the control and the RL communities working on developing robust adaptive optimal policies using online I/O data from the controlled plant. Drawbacks of these methods include having to optimise weight functions, parameters and other coefficients of the learning functions to improve the performance of these methods. These methods are powerful and hold promise, but their potential is restricted by the limited theoretical analysis of convergence, stability and robustness.
Future research in this field would focus on providing the theoretical analysis for the RL based methods, constructing high fidelity simulators which in turn would be a catalyst in the development and research in this field, providing insight to learning-based black-box techniques such that they are interpretable and explainable, commonly referred to as XAI, as well as optimising the end-to-end process of developing and actualising control frameworks with the aid of digital twins and digital threads. Thus, with the ultimate goal of developing a uniform framework that can be used for adaptive optimal control across various applications; a framework that is independent of controller tuning and system identification.
KRUPA PRAG is a postgraduate student at the University of the Witwatersrand, Johannesburg, South Africa. She is an Associate Lecturer in the School of Computer Science and Applied Mathematics at the University of the Witwatersrand. Her research interests include optimisation, optimal control theory and computational intelligence.
MATTHEW WOOLWAY received the Ph.D. degree in process engineering from the University of the Witwatersrand, Johannesburg, South Africa. He is currently an industry data scientist and a Research Associate in the Faculty of Engineering and the Built Environment at the University of Johannesburg. Broadly, his research interests comprise of computational intelligence, artificial intelligence and optimisation.
TURGAY CELIK received the second Ph.D. degree from the University of Warwick, Coventry, U.K., in 2011. He is currently a Professor of Digital Transformation and the Director of the Wits Institute of Data Science at the University of Witwatersrand, Johannesburg, South Africa. His research interests include signal and image processing, computer vision, machine intelligence, robotics, data science, and remote sensing. He is an Associate Editor of IET ELL, IEEE Access, IEEE GRSL, IEEE JSTARS, and Springer SIVP. VOLUME V, 2021