Active Inference Integrated with Imitation Learning for Autonomous Driving

Classical imitation learning methods suffer substantially from the learning hierarchical policies when the imitative agent faces an unobserved state by the expert agent. To address these drawbacks, we propose an online active learning through active inference approach that encodes the expert’s demonstrations based on observation-action to improve the learner’s future motion prediction. For this purpose, we provide a switching Dynamic Bayesian Network based on the dynamic interaction between the expert agent and another object in its surrounding as a reference model, which we exploit to initialize an incremental probabilistic learning model. This learning model grows and matures based on the free-energy formulation and message passing of active inference dynamically at discrete and continuous levels in an online active learning phase. In this scheme, generalized states of the learning world are represented as distance-vector, where it is the learner’s observation concerning its interaction with a moving object. Considering the distance vector entail intentions, it enables action prediction evaluation in a prospective sense. We illustrate these points using simulations of driving intelligent agents. The learning agent is trained by using long-term predictions from the generative learning model to reproduce the expert’s motion while learning how to select a suitable action through new experiences. Our results affirm that a Dynamic Bayesian optimal approach provides a principled framework and outperforms conventional reinforcement learning methods. Furthermore, it endorses the general formulation of action prediction as active inference.


I. INTRODUCTION
I N recent years, the demand for an intelligent agent (IA) learning by mimicking expert behavior has grown substantially. Advancement in active learning and communication technology has improved learning potentials in IA to make intelligent decisions and adapt and refine actions in various situations. Many future directions in technology rely on the ability of IA to behave as a human would when presented with the same situation. Examples of such fields are intelligent transportation [1], autonomous systems [2]- [4] and sports tracking data [5]. In these applications and many intelligent tasks, we face executing an action given the agent's current state and its surroundings. The number of possible scenarios in a dynamic environment is too large to cover by explicit programming. Also, an efficient IA must be able to handle unobserved scenarios. While such a task may be expressed as an op-timization problem, transferring knowledge from an expert agent is more effective and efficient than exploration and learning from scratch [6], [7]. In addition, learning by trial and error requires a supervision signal that indicates the goal of the expected behavior. Typically this supervision can come from a reward function which is calculated precisely for each task. So, as the number of actions grows exponentially in a dynamic environment, defining rewards for such problems is difficult, even unknown in some cases. One of the intuitive ways is to mimic an expert behavior by transferring knowledge through observations and by following the demonstrations step-by-step [8]. This paper presents a learning model to perform complex sequences of actions through imitation. Imitation learning (IL) [9] works by extracting information from the expert agent's behavior and its interaction with the surrounding environment to make a mapping between the observation and demonstrated behavior. Similar to traditional supervised learning, in IL, the instances present pairs of states and actions. If one exists, the state represents the agent's pose and its status with an attractor. Therefore, Markov decision processes (MDPs) [10] are commonly used to represent expert demonstrations in an IL context. The Markov property dictates that the next state only depends on the previous state and action, eliminating the necessity of the earlier states in the state representation [11]. A typical IL procedure encodes the collected expert demonstrations to state-action pairs to use them in policy training. However, a direct mapping between state and action is not enough. It can happen due to some issues such as insufficient demonstrations and performing a different task due to environmental changes, such as obstacles. Therefore, IL frequently involves another step that requires the learner agent refinement of the estimated policy based on its current situation. This self-improvement can be achieved by a quantifiable reward or learned from instances. Many of these approaches come under the reinforcement learning (RL) methods. RL allows encoding desired behavior -such as reaching the target and avoiding collisions -and relies not only on perfect expert demonstrations. In addition, RL maximizes the overall expected return on an entire trajectory, while IL treats every observation independently [12], which conceptually makes RL superior to IL. The critical drawbacks of IL are that the policy never exceeds the suboptimal expert performance and the performance of IL is still highly reliant on the quality of the expert policy. As real-world applications often need high sample efficiency, it is crucial to find a way to integrate IL and RL effectively. Several recent efforts have attempted to combine RL and IL [13]- [15]. These approaches incorporate the cost information of the RL problem into the imitation progression, so the learned policy can both improve faster than their RL complements and outperform the expert policy. Despite reports of improved empirical performance, the theoretical understanding of these combined algorithms is still reasonably limited [16], [17]. The integration of both modalities, RL and IL, enables the learning of complex skills from raw sensory observations [18]. RL is widely used for IA learning to develop decisionmaking sequences that maximize the reward for a future goal. It specifies which states and actions are desirable through sequential interaction with their environment and other agents [19]. While such reward specifications can be sufficient enough to produce optimal behavior, but it represents a significant barrier to the broader applicability of RL in complex observations where we have to consider multiple factors that affect the reward signal [20]. Inverse reinforcement learning (IRL) [21] bypasses this issue by assuming that an agent receives the sequences of observationaction tuples. It tries to learn how to map observations to actions from these sequences through estimating a reward function. By approximating this function rather than directly learning the state-action, the apprentice can learn a reward function in new scenarios that explains the observed expert behavior. Moreover, it allows adapting to perturbation in a dynamic environment [22]. However, imitating each step often becomes impracticable when the learning agent and the environment are different from those in the demonstration. Also, using IL to track an agent in motion is still a challenging task. In many cases, the agent does not have to follow the expert unconditionally. Instead, it must care about the demonstrator's intention or the goal-based imitation [23]. In parallel, Active Inference (AIn) [24] suggests a framework where the agent learns to minimize the divergence between expectation and evidence (i.e., surprise or abnormality) by selecting an action based on probabilistic decision making. Surprise is an information-theoretic quantity that can be approximated with variational Free Energy (FE) [25]. FE explains perception, action, and model learning in a Bayesian probabilistic way that provides an upper bound on the negative log-evidence or surprise [26]. The notion of AIn translates predictive coding into an embodied context and argues that surprise (or abnormality) can be minimized in two ways: either by optimizing internal predictions about the world (perception) or via acting on the world to change sensory samples so that they match internal predictions (action) [27]. This work proposes a framework integrating AIn with IL (AIL) for autonomous driving. IL is used as a pre-training step to encode an expert demonstration in a coupled Dynamic Bayesian Network (DBN) for a specific task (e.g., overtaking a dynamic obstacle). The coupled DBN is a probabilistic graphical model explaining the dynamic interactions among multiple environmental agents (an expert agent and a dynamic object). Due to its hierarchical nature, DBN can express temporal relationships among high-level variables capturing abstract semantic information about the environment and low-level distributions capturing rough sensory information with their respective evolution through time. Since IL suffers from a fundamental problem known as distributional shift (i.e., distributions over states observed during training are different from those observed during testing), the agent might fail to reach the goal in an unseen environment. The proposed AIL framework enables the agent to exploit IL by mimicking the expert behaviour under normal circumstances (when predictions match observations), i.e., selecting the same set of actions as the expert and exploring new actions through AIn under abnormal situations. In this work, the exploration phase is intended to select new actions that allow avoiding surprising states in the future, i.e., moving towards the expert reference model. In both exploitation and exploration, the agent aims to occupy environmental states that minimize the FE (i.e., equivalent to maximizing rewards in RL). The main contributions of this work can be summarized as fourfold: inference of sensory signals and decision making. • The exploration-exploitation dilemma is guided by predictive and diagnostic messages. During exploration, the imitator agent learns a new set of configurations and actions incrementally that allow it to come near the expert's reference model. • The reliance on an explicit reward signal from the environment is unnecessary. The reward is substituted by the FE measurements based on agents' beliefs about environmental states and its actual observations. • The proposed approach is validated on a real dataset consisting of sensory information collected from two autonomous vehicles. Results show that the proposed approach outperforms conventional RL methods in the number of selected actions, successful travel rate, collision probability, out of boundary probability, and imitation loss.

II. RELATED WORKS
There are different methods for learning a policy from expert demonstrations. Direct learning that learns a supervised model from the demonstrations is the most straightforward way, in which the goal is to learn a mapping from states to actions that mimic the demonstrator [33], [34]. Supervised learning methods are categorized into classification methods when the learner's actions can be classified into discrete classes [35], [36] and regression methods which are used to learn actions in a continues space [37]. Direct imitation often is not adequate to reproduce suitable behavior due to errors in demonstration [36]. Besides, indirect learning can complement direct learning by refining the policies based on expert demonstrations and learner experiences to be more accurate in unseen scenarios. The crucial role of RL to minimize the distinction in IL, which is known as probabilistic inference, have been discussed extensively in literature [38]- [41]. Most of the existing methods developed RL approaches based on different divergence metrics to show optimal control that can be formulated as probabilistic inference in a graphical model to minimize divergence between reward and policy distributions over trajectories [39], [42]. Motion prediction plays a prominent role in maximizing the convergence between expectation and evidence. An efficient prediction provides the agent with the ability to learn appropriate transitions to reach the target autonomously [43]. In [44], a motion planning approach is proposed to avoid collisions by using a specified sparse reward function. It leads to inefficient learning through facing dynamic obstacles if the learning agent does not observe enough to reinforce its actions based on the changes in the environment due to the obstacle's motion. Our work draws significant inspirations from these prior works in RL and aims to provide a probabilistic dynamic learning model which can anticipate future changes in the environment. IRL algorithms have shown promising results in defining a policy to minimize the cost functions or maximizing the entropy of the distribution on state-actions under the learned policy [45]. Early works in IL through IRL operated by matching desired features between policies and expected expert demonstration [29], [46]. Furthermore, Energy-Based IL [30], [40] is an IRL framework that estimates unnormalized probability energy of expert's occupancy measure through score matching, then takes the energy to construct a reward function as a guide for learning the desired policy. Recent expandable approaches to Max-Ent IRL [28], [31], [32], motivated by adversarial approaches to generative modelling [47], present a common view of IL through divergence minimization perspective. Our work generalizes the objectives based on insight from minimizing free energy [48] and further provides a learning model based on probabilistic interaction between the agents in a dynamic environment. Table 2 and Table 3 summarize the comparison between the proposed framework (AIL) and some existing methods from the literature.

III. PROPOSED FRAMEWORK
This section involves two main phases, the offline learning phase and the online active learning phase. In the former phase, we first provide a situation model encoding the dy-VOLUME x, 2022  namic interaction between an Expert agent (E) and a dynamic Object (O). Consequently, we provide the Learner agent (L) a First-Person (FP) model, where we assume that L tries to learn sub-optimal behavior by observing the E demonstration. In the latter phase, we present an Active First-Person (AFP) model that the L can use to update its knowledge while interacting with an object (Ô) in a continuous dynamic environment. All of the mentioned models (i.e., situation, FP and AFP) are Probabilistic Graphical Models (PGMs) that employ graph-based representation to encode various multidimensional random variables and represent causal relationships among them [49].
In this work, we propose to use a particular type of PGMs, namely, the Dynamic Bayesian Network (DBN) [50]. Due to its hierarchical nature, DBN can express the temporal relationship between high-level variables (capturing abstract semantic information of the world) and low-level distributions (capturing rough sensory information of the environment) with their respective evolution through time. State variables describing the systems' states at a specific time instant k can be categorized as either hidden variables (discrete or continuous) representing the causes affecting the systems' states evolution or measured variables expressing noisy measurements [51]. Since the network size increases over time, performing inference using the entire network would be intractable for all but trivial time duration. Fortunately, efficient recursive algorithms have been developed to perform exact inference on specific types of DBNs, or approximate inference on more general DBNs' varieties [52], [53]. Recent works studied several algorithms for inference in PGMs following a datadriven approach [54], [55]. A modern inference mechanism, namely, the Markov Jump Particle Filter (MJPF) presented in [54] can be employed to facilitate the generation of behavior based on DBN models learned computationally from data.

A. OFFLINE LEARNING PHASE
The aim of the offline learning process is to learn the situation model (i.e., the reference model that the L can use for initialization) based on the E behavior. Initialization is conducted by mapping the reference DBN structure onto the L moving reference system as an FP reference model.

1) Situation model
The situation model consists of a switching DBN [56] representing the interaction of two dynamic entities, E and O. The model is described by means of a set of observation and state variables that describe the state of the two interacting agents at a given time instant k. It is assumed that the agents' (E and O) observations are represented by variables Z E k and Z O k , respectively ( 1 ⃝ in Fig. 1). At a higher level, hidden continuous Generalized States (GSs) [57] can be formed describing the agents' instantaneous dynamics up to a chosen n-th order temporal derivative. Thus, a joint GS (X k ) ( 2 ⃝ in Fig. 1) incorporating the dynamics of multiple agents (i.e., E and O) at each time instant k can be defined as follows: whereX E k andX O k denote the GSs of E and O, respectively. Here, a GS related to agent i (i.e.,X i k ) is defined as a vector composed of the agent's state and its first-order temporal derivative, such thatX i k = [xẋ] ⊺ where x ∈ R d ,ẋ ∈ R d , i ∈ {E, O} and d stands for the dimensionality of the state vector. Each observed sensor variable Z i k is assumed to be related with the corresponding agent's hidden state variablẽ X i k by a linear relationship according to the following observation model: where H = [I d 0 d,d ] is the observation matrix that maps hidden GSs (X i k ) to measurements (Z i k ) and v k is the measurement noise which is assumed to be zero-mean Gaussian with covariance R, such that, v k ∼ N (0, R). To learn the dynamic interaction models, we first assumed that there is no external force influencing the evolution of GSs of the observed agents under the static equilibrium assumption described by the following model: where A ∈ R d×d is the dynamic matrix and w k is the process noise which is assumed to be a zero-mean Gaussian with covariance Q, such that w k ∼ N (0, Q). This implies a null acceleration, and the learning approach Vertical arrows facilitates the causalities description between continuous and discrete levels of inference and observed measurements. Horizontal arrows explain temporal causalities between hidden variables. In particular, the orange arrow encodes the interaction of couples of agents, and blue arrows represent the influence at a continuous level.
consists of observing deviations from such hypothesized equilibrium through an active approach, namely the Null Force Filter (NFF). An NFF can be interpreted as a generalized Kalman Filter (KF) [58], which uses the innovations obtained by observing an input data sequence Z i k to estimate a new situation model that describes interactions between observed agents in the GS space. The innovations can be seen as mismatches between observations (obtained by observing interaction) and predictions (based on the assumption that the observations should be quasi-static) defined as follows: The couples (X i ,υ) obtained by NFF along the interaction time series are defined as generalized errors (GEs). Those GEs can be clustered using an unsupervised method. We employ the Growing Neural Gas with utility measurement (GNG-U) [59] which outputs a set S i of (switching) discrete variables (i.e., clusters) representing the discrete level of the switching DBN ( 5 ⃝ in Fig. 1). Each cluster describes in which region of the GS space, with which difference in the dynamic motion (w.r.t the hypothesized absence of external forces) and at what time a specific interaction has occurred. The joint vocabularies of switching variables from agents' GEs, E and O, describe a specific type of interaction among the agents at multiple levels (i.e., discrete and continuous levels). Each discrete state represents a region where quasilinear models are valid to present the interactive dynamical system over time. Vocabularies are defined as: where L i is the total number of clusters associated with agent i and s i l ∈ S i is a specific cluster describing agent's motion.
Since each superstate s i is supposed to follow a multivariate Gaussian distribution it can be represented by its sufficient statistics, specifically, the covariance matrixΣ s i k the generalized mean valuesμ where µ s i P os and µ s i V represent the mean value of the states (on position) and the mean value of the corresponding derivatives (on velocity), respectively. In a time instant k, each agent i is represented by an active superstate s i k ∈ S i . Joint active superstates from different agents occurring simultaneously form an interaction configuration defined as Consequently, an additional vocabulary of dictionary configurations can be defined and included in the DBN at a higher hierarchical level, such that: where M is the total number of configurations and D m ∈ D encodes a given identified configuration composed of the position and velocity features of both agents and defined as: The inter-slice links at multiple levels among consecutive time instants are also learned to define the DBN completely. It has to be noted that the learned switching variables are associated with corresponding dynamic models at the GS continuous level. As the NFF clusters similar innovations into compact regions of the state space, in each region, it is possible to estimate the interaction force for a given agent by modifying the dynamic model of (2). Regarding linearity and gaussianity of the NFF dynamic model, the dynamic model of each agent inside a cluster s i is estimated based VOLUME x, 2022 on the quasi-constant velocity that depends on the state and derivative mean values of GEs clustered in each s i , such that: where B ∈ R d×d is a control model matrix, that maps agent's velocity estimation into following states. The variable µ s i k V is a control vector encoding the agent's motion when it is found in a region s i k that can be formulated as: whereẋ s i k andẏ s i k are the velocity components of agent i associated with s i k . The transition model defined in (8) corresponds to a cluster dependent motivated dynamics whose effects are encoded in µ s i k V and switched according to the activated configuration. The probabilistic law that regulates switching among different local forces captured by different interaction configurations can be estimated in different ways (e.g. frequentist or geometrical) and encoded in a Transition Matrix (TM). Learning the TM involves estimating the transition probabilities P(D k+1 |D k ) of switching from a current configuration (D k ) to another one (D k+1 ) and it is defined as:

2) First Person model
The First Person (FP) model can be seen as a situation model transformed in such a way that allows a learning agent L to directly use its own observations and generate state series describing its relative state with respect to another interacting dynamic agentÔ. It provides the L agent with the capability to imitate the expert motions by generating transformed sequences from the situation model ( Fig. 2). A mapping implies defining all DBN nodes of the new FP model (discrete and continuous) and probabilistic dependency models starting from the situation model nodes and links. Therefore, The FP model can be considered as an initialization generative switching model represented by a Generalized DBN (GDBN), which can be used to predict interaction states under the perspective of a learning agent. FP model initialization. The FP model depicted in Fig. 2-(right side) is initialized to allow the L agent to exploit the switching DBN corresponding to the situation model (i.e., pure IL from expert demonstrations). The discrete variables at the configuration level (top level of the FP model) represent the learned set of configurations D m ∈ D. In the FP model, the L is assumed to take the role of the E. Therefore, all clusters related to E should correspond to the clusters describing L states in a certain configuration. By providing a biunivocal mapping between clusters of E and L, the transition At the continuous level,X represents the generalized relative distance (consisting of relative distance and relative velocity) between E and O (or between L andÔ in an ideal IL setting) which are interacting in the environment. The generalized relative distance can be seen as the difference of joint GS describing the interaction at the continuous level of the two agents in a specific (D m ) and defined as: The relative positions of E and O in the situation model are illustrated in Fig. 3. The relative distance vector is highlighted as the difference in absolute coordinates and velocities of the two objects. The distance vector in the FP model is shown in Fig. 4 where the relative learner reference system is depicted to highlight the information captured in the FP model. Moreover, the observation (Z k ) of the L andÔ can be mapped onto observations (Z i k ) of both agents,(E, O) according to the following equation: To this end, a configuration D m ∈ D at the discrete level of the FP model is represented by a joint superstate of each agent at time instant k, i.e., D k = s L k , s O k . Thus, the model can predict the expected future configurations based on the dynamic transition rules encoded in the transition matrix TM D and predict GSs based on the following dynamic model:X which is characterized by the conditional probability

B. ONLINE ACTIVE LEARNING PHASE
In this section, we propose a hybrid mechanism allowing the L agent to learn how it should behave in a dynamic environment by integrating imitation learning with active inference [60]. The FP model can be used during the active learning stage providing suitable predictions and learning policies to learn the best set of actions that the L agent should take. However, the FP model must be integrated with active states describing how the agent can act in the environment to change sensory signals in order to match internal predictions of the FP model and imitate efficiently.

1) Active First Person model
During active learning, the L agent has to interact with another dynamic agentÔ in real-time. The L agent starts assessing the current situation and evaluates if the E agent has experienced the same situation by relying on the FP model that encodes the dynamic interaction between E and O. If the L is facing the same situation faced by the E, then L tries to imitate the same actions the E has performed, which are captured in the FP model (i.e., imitation learning). The L agent can observe theÔ by its exteroceptive sensor that provides the relative distance Z k (a vector incorporating the differences in positions and velocities) between the current origin of the L reference system and the other agentÔ in the The proposed AIn approach integrated with IL (AIL) involves three main steps: 1) Prediction and Perception, 2) Action selection and 3) FE calculation and Action Updates.

2) Prediction and Perception
The L employs Particle Filter (PF) to predict the configurations D k (visited by the E) and consequently estimate the relative distanceX k from theÔ at each time step k. At the first iteration (k = 1), L relays on prior probability distributions (P(X 1 ), P(D 1 )) to predict the relative distance (X 1 ) from theÔ and the expected configuration (D 1 ). In the successive iterations (k > 1), L relays on the interactive transition matrix TM to predict future configurations which guides the prediction of the relative distance at the lower level. PF propagates a set of N particles equally weighted using a specific row (π(D k )) in TM as a proposal distribution, such that, {D k,n ∼ π(D k ), W k,n = 1 N }. For each particle n representing the predicted configuration D k,n , the expected hidden states (X E k,n ,X O k,n ) of E and O can be estimated according to the following dynamic equations: where µ E D k,n , µ O D k,n are associated with clustersS E k,n and S O k,n , respectively, such that {S E k,n ,S O k,n } ∈ D k,n . Then, the relative distance from O can be approximated as follows: Thus, this approximation depends on the hypothesised configuration that explains implicitly the conditional probability P(X k,n |X k−1,n , D k,n ). In this sense, the L agent associates itself to a specific configuration (D k,n ) and predicts the relative distance from the current dynamic objectÔ which it is dealing with. The L agent receives observations (Z k ) through its exteroceptive sensor and realize actions to be done by its actuators. Once a new Z k is given -describing the relative distance between the L andÔ -the L can evaluate if the situation it is experiencing has already faced by the E in order to make decision on actions (i.e., the decision between Exploitation and Exploration). Diagnostic messages (λ(X k ) and λ(D k )) propagated from the bottom level towards higher levels inside the AFP allows defining an abnormality measurement to evaluate how much current observation supports predictions as well as updating the belief in hidden variables. The model computes the anomaly (Ω) by measuring the cosine similarity (cos(θ)) between the observed relative distance (Z k = d L z ) and the predicted relative distance (X k,n ) associated with each propagated particle as follows: The lower the angle θ, the lower the abnormality value, so more similarity is achieved. Particles gain weight according to their similarity with the observation. A high similarity value (the lower angle) gains more weight (W k,n ) than particles with low similarity. Message λ(D k ) is used to update particles' weights and it is defined as: where λ(X k ) = P(Z k |X k,n ) is a multivariate Gaussian distribution such that λ(X k ) ∼ N (Z k , v k ) and λ(D k ) is a discrete probability distribution. Consequently, particles' weights can be updated as follows: 3

) Action selection
The updated particles' weights allow the L agent to decide whether to exploit actions by imitating the E's behaviour or to explore new actions that may yield lower FEs (higher rewards) in the future. The decision between exploration and exploitation is based on two parameters, namely, the exploration rate (ϵ) and a varying threshold (ρ). The former is defined as: where α is the largest weight among all the N particles measuring the likelihood between the current L configuration and the reference configuration, such that: where 0 ≤ α ≤ 1. So, if α k is near 1, ϵ k becomes very low which means that current observation matches L's expectation and so it can exploit the same actions performed by the E. However, in other cases it might appear that α is not too high (e.g., below 0.5). In this case, it is required to evaluate the anomaly level associated with the particle index that has the maximum weight and define ρ based on a trialand-error process. Thus, action generation process depends on the decision made by the L agent whether to explore or exploit and it is defined as: where a k are the active states (i.e., actions) realizing the top level of the AFP model, } is a set of actions performed by the Expert and encoded in the situation model that the learner aims to imitate during exploitation and A + = {a 1 , a 2 , . . . , a 8 } is a set of predefined actions realizing 8 different directions 1 used during exploration. In addition, D β k is the most similar reference configuration to the observed one and β is the particle's index with the maximum weight associated with (22) defined as: Moreover, during exploration, the L saves the new configurations D + k (not seen by the E) it is experiencing along with the performed actions a + k ∈ A + in a set (C). After finishing a certain experience the L clusters all the pairs [D + k , a + k ] saved in C by employing the GNG. The latter outputs a set of clusters representing the new configurations (D ++ ) that can be appended incrementally to the probabilistic q-table (Q) that is defined as: where Y y P(a E y |D m ) + e=1 P(a ++ e |D m ) = 1 and Y y P(a E y |D ++ ) + e=1 P(a e |D ++ ) = 1 such that m ∈ M and y ∈ Y, a ++ = µ D ++ V are the new explored actions that can be exploited in the future. In addition, the L agent update the transition model defined in (10) by adding new rows and columns which are related to the new configurations incrementally. In Exploitation, if the current configuration is an observed one by the E, the L takes the adapted expert action from the FP model by activating the most similar reference configuration (D k = D β k ) to the current L configuration at the real-time and consequently select the suitable action (i.e., representing the L's motion) according to P(a k |D β k ) encoded in Q. After that, by adapting the expected motion P(X k |D k ) at time k through the active states P(a k |D k ,X k ), the L agent transits to a new configuration realized by P (D k+1 |a k , D k ). Thus, the conditional prior P(a k |D k ,X k ) is maximized after having been initialized according to demonstration to select the best action a k given the current configuration and state. Besides, in Exploration, if the mismatch between predictions and observation is too high, the model can not apply the direct imitation of the E agent by taking a learned action from the situation model. However, if the L agent faces an anomaly, the learning model considers it as an unseen situation. Hence, the newly explored configurations are added to the reference configurations (Incremental learning model). Moreover, the model corresponds a set of possible actions with equal selection probabilities to the newly added configuration that the L agent can take randomly to move in the environment. The selection probabilities are modified through the online learning phase. The presented learning procedure aims at converging at some optimal policy to the lower probability of taking a random action over time as the agent becomes more confident with its estimations. During exploration, the L aims to take the best set of actions that can approach it to the reference configurations (i.e., reference vocabulary realizing the expert's behaviour in dealing with a dynamic object in the environment).
The AFP model improves the L's behavior during the training by minimizing the divergence between the situation model and the AFP model to decrease the loss cost of imitation, besides dealing with the abnormalities to decline the collision probability or going out of boundaries. During the active learning, the selection probabilities related to each movement are recorded by P(a k |D k ) and updated at each time instance. The L agent needs to exploit the expert demonstrations to minimize the global FE by modifying the transition policies. It also needs to explore through the new experiences to make better action selections in the future. The L agent must modify its actions several times to gain a reliable prediction with a low imitation loss cost and FE on a stochastic task while switching between exploration and exploitation.

19:
Receiving the learner observation Z k 20:

4) Imitation cost
The IA faces multiple tasks that need ample action space. Therefore it is challenging to acquire an appropriate policy by a onefold cumulative reward strategy. Hence, we cannot just rely on a cumulative reward strategy to train the IA to learn effective behavior policies. This work suggests training the learner agent by imitating expert manipulations to solve this problem effectively by employing IRL to IL, which postulates that expert behavior is to optimize the expected motion of the learner agent over time. Most IRL methods formalize the underlying decision-making problem as an MDP, a model of a discrete-time process wherein an agent's actions may stochastically influence its environment. IRL aims finding a reward function R that could explain the expert policy from demonstrations. The proposed approach endows the IA with the capability of estimating the imitation cost (i.e., reward) in terms of FE at multiple levels. Minimizing the FE (i.e., maximizing rewards in RL) ensures a dynamic equilibrium between the L and its prevalent environment. The VOLUME x, 2022 FE measurements are based on the AFP hierarchy's messages (messages passing from top-to-down and bottom-to-up). The message (λ) passing from lower nodes to upper nodes (see Fig. 5) have a diagnostic ability used to adjust the expectations (predictions by inter-slice links π) given a sequence of observations. Comparing predictive and diagnostic messages allows detecting whether new observations are similar to previously learned situations encoded in the FP model. Suppose predictions from the FP model are not compliant with observations, then the model considers the current experience as an anomalous experience, and so it should be adapted to by learning new situations and generating new semantic information.
The diagnostic messages evaluate the distinction between the expectation and evidence at two abstraction levels. We theoretically extend the FE measurement by estimating the prior policy and posterior policy at both continuous and discrete levels. The goal is to allow the L maximizing the likelihood by using the FE as a control metric. Under the FE principle, L uses the likelihood estimation of the prior hidden states based on the active reference configuration (D) and the observations. The determined prior by the hidden states and actions at the previous time instant can change the L's future policy. FE measurement at the continuous level. The AFP model allows evaluating how much the sensory measurements support predictions and thus evaluating if the selected actions were good or bad by relying on the FE. The FE at the continuous level can be computed by evaluating the distinction between the predictive message π(X k ) and the diagnostic message λ(X k ) after taking an action a k−1 under both exploration and exploitation. Thus, the taken action (a L k−1 ) guides the system to calculate the expected FE [61] at the continuous level (Ḟ ) based on the Kullback Leibler-Divergence (D KL ) [62] between π(X k ) and λ(X k ). Hence, the expected FE can be expressed as: (26) Our goal is to find a policy, such that the learner's behavior matches the reference demonstrations. For this purpose, our objective is to minimise the divergence between what the L is expecting to observe after taking a certain action and what it is really observing. The L believes that a certain action allows it to imitate correctly the E's behavior during exploitation or allowing it to approach towards the E's reference vocabulary as soon as possible during exploration. FE measurement at the discrete level. The FE at this level (F ) is computed by employing the Mahalanobis distance (D M ) [63] to calculate the distinction between the action selected by the L (a L k ) and the E's estimated action (a E k ) from the activated reference configuration µD V , defined as: where a E k = max Q(:,D k ).  if the L is in exploration: then 10:

16: end function
Global FE. The Global FE (G) is based on the losses computed at both continuous and discrete levels (defined in (26) and (27)). If the L agent is in a observed configuration (exploitation case) the GFE is defined as: Otherwise, if it experiences a new configuration or improving the action selection regarding the recorded explored states, the GFE can be expressed as:

5) Action update
The AFP model takes advantage of both discrete and continuous level dynamically to decrease the imitation loss by improving the action selection through the online learning procedure. Our objective is to minimise the long term cost by taking down the global FE measurements defined in (28) and (29). The L agent adapts the action selection process by updating the Q-table defined in (25) based on the global FE and according to: where η is the learning rate which controls how quickly the learning agent adopts to the explorations imposed by the environment, G is the normalized global FE measurement with a range from 0 to 1, and γ is a discount factor as in the general case of RL algorithms. Since the Q table used in this work is a probabilistic table, (30) can be rewritten in probabilistic form as follows:

A. EMPLOYED DATA SET
The proposed framework is validated using a real dataset consisting of multisensorial information collected from two autonomous vehicles, 'iCab 1' and 'iCab 2' [64]. The vehicles positional information and the corresponding velocities are obtained from the odometry module. This work considers two scenarios: • lane-keeping scenario (following behavior): iCab 2 follows another agent (iCab 1) as shown in Fig. 8-(a) and aims to keep a safe distance from iCab 1. The latter plays the role of a dynamic obstacle in the environment with a higher speed than iCab 2. • lane-changing scenario (overtaking behavior): iCab 2 overtakes iCab 1 (considered as a dynamic obstacle) to change the lane without collision. This scenario consists of two cases, overtaking from the left side and overtaking from the right side as depicted in Fig. 8-(b) and Fig. 8-(c). In this scenario, iCab 2 has a higher speed than iCab 1. Sensory data representing positional information from these experiments are used to learn the dynamic interaction between iCab 1 (which plays the role of a dynamic object i.e., O) and iCab 2 (which plays the role of an expert i.e., E) encoded in the situation model that the learner agent L will use to imitate the E.

B. OFFLINE LEARNING PHASE
This section shows the process of learning the situation model from data during different scenarios. The NFF is used as an initial filter employed on the data collected in the lane-keeping and lane-changing scenarios. NFF outputs the GEs defined in (4) which can be clustered using GNG that outputs a set of discrete clusters representing the discrete regions of the trajectories generated by E and O. The joint clusters define the set of configurations (defined in (6)) that encode the dynamic interaction among the two agents. The total number of clusters and configurations is 36 each.

C. ONLINE LEARNING PHASE 1) Experimental setting
During the online active learning phase, the AFP model relies on the FP, which has been initialized using the situation model. Thus, the discrete level in the three models represents the learned configurations during the offline phase. The initial Q table contains only the learned configurations and it is defined as follows:    For each iteration during an episode, L is trained to learn how to behave with another moving agentÔ in a dynamic environment. Each episode consists of 10 iterations, i.e., L tries 5k iterations by 500 different start positions to learn the policies.
Estimate the safe distance for the learner agent. The FE measurement at the continuous level helps the learner determine a safe distance from the moving object. The safe distance expresses the possibility that the learner agent can continue lane-keeping without collision probability. At each time instant, the system finds the minimum and the mean value ofḞ , which is calculated by the KL divergence defined in (26). After that, by calculating the differential of the corresponding distance vector's length to the values (|∆d i |), the measured safe distance determines a threshold for the L agent, which is changed dynamically at each time instant during the learning phase until the completion of training. The VOLUME x, 2022 model uses the safe distance to record the estimations in two Q-tables, for the safe zone and the warning zone. In the safe zone, the higher transition probability relates to lane-keeping. On the other side, in the warning zone, the higher transition probability leads the agent to lane-changing to decline the collision probability. The estimations are separated based on L situation during the online learning phase to facilitate and accelerate the making decision during exploiting from the learned tables. We evaluate the performance of the proposed method in different experiments and compare it with four learning algorithms from the literature, namely, the general value-based Q-learning, double Q-Network, IRL (when an optimal expert is available) and self-learning in RL context (when optimal expert data is not available). Performance evaluation involves two main issues, action selection and imitation loss.

2) Action Selection
The L predicts the configurations (D m ) visited by the E by employing PF and then estimates the relative distance fromÔ to decide whether to imitate the E's actions (i.e., exploitation) or to explore new actions. Initially, PF propagates N = 10 particles equally weighted (W = 1 N = 1 10 ) by relaying on the TM (at the first time instant k = 1, PF generate samples from a uniform distribution). Action selection realizes an essential process to reach the goal targeted by the agent (e.g., following or overtaking the dynamic obstacle). The number of taken actions describes the effort made by the agent to reach the goal. A good policy requires fewer actions and less time to reach the goal, while a lousy policy requires more actions and time. Fig. 10-(a) shows the mean of taken actions by the L for each episode during the online learning phase using different methods. From the figure we can observe that the L adopting the proposed approach (AIL) performs less actions compared to other methods. This can be explained by the fact that initializing the FP model using the situation model can decrease the exploration rate. Moreover, exploiting from sub-optimal expert demonstrations at similar states plays a vital role to driving in a shorter time than exploring the environment from scratch. The threshold ρ has a great impact on the exploration rate, we train the L agent 11 times with different ρ values in the range [0, 1]. By considering the success rate obtained by each ρ value, we pick the best ρ value providing the maximum success rate as shown in Fig. 11. In addition, updating particles' weights to adjust the action selection procedure allows the L avoiding abnormalities and adapting to new experiences. Fig. 12 demonstrates how the exploitation and exploration rates affect the FE during the learning phase. Refining the action selection can adapt to new experiences and minimise the FE. Balancing exploration and exploitation is one of the most challenging tasks in RL. The imbalance between exploration and exploitation might lead to adverse effects on learning performance. On the one hand, the domination of exploration would obstruct the agent to maximize short-term  reward, i.e., explorative actions could lead an agent to collect a higher negative reward in the short run. On the other hand, if a learning approach is dominated by exploitation, an agent performs actions that could get it stuck in local minima or suboptimal solutions. Fig. 13 shows the frequency of the exploratory actions, the L is trained to have an equal opportunity to gain new  An example illustrates when the learner agent is in the exploitation mode. Purple lines show the relative distance from the most probable configuration, while the green line represents the relative distance from the activated configuration. The learner exploits the activated configuration, leading to a lower divergence (θ between blue and green distances). y x θ divergence o FIGURE 15. An example illustrates when the learner agent is in the exploration mode. The purple lines show the relative distance vector from the most probable configuration, while the green line is the relative distance from the activated configuration. The learner takes an action to explore the environment because the divergence between learner configuration and the activated one is more than ρ.
knowledge from the environment's dynamics and follow the expert demonstrations to accomplish its mission (see Fig. 14  and 15). Improving the action selection skill leads the L to perform more successful movements in the dynamic environment as shown in Fig. 16. When the L enters the exploration stage in a certain episode, it saves all the new explored configurations along with the performed actions. Then, the L clusters those saved pairs (i.e., new configurations and actions) as discussed in Section III-B3. The new explored configurations and actions are clustered for two reasons: to calculate the mean action value of the corresponding clusters in order to have comparable data with the FP model and to avoid recording too many configurations in the Q-table. Fig. 17 and Fig. 18 describe the clustering process of the new configurations in two scenarios related to lane-keeping and lane-changing. In each step, the newly learned clusters are appended incrementally to the model to modify and improve the action selection by exploiting new appended actions through the online learning phase and resolving the L's uncertainty about the surrounding environment. Fig. 19 and Fig. 21 illustrate the clusters of the reference FP model (circles in gray) and the newly learned ones that are appended to the reference model (circles in yellow) in two different examples when the L aims to overtake the dynamic objectÔ. The corresponding TMs are updated by adding new rows and columns that represent the new learned configurations as shown in Fig. 20(b) and Fig. 22(b) and the TMs of the reference FP model are shown in Fig. 20(a) and Fig. 22(a). Comparing sub-figures (a) and (b) in each figure ( Fig. 20 and Fig. 22) shows how the TMs of the FP model are expanded after the L has explored and learned new situations allowing to predict the environmental dynamics in the future better and consequently select effective actions. Such an incremental learning process under the active inference endows the L with the capability of understanding the best set of actions it should perform to avoid surprising states.
L adopting the proposed AIL method has higher successful movements than IRL, SL, Q-learning and DQN as depicted in Fig. 10-(b). Two factors directly affect the success of the learner travel in each episode: the cumulative probability of going out of boundary and the collision probability. As we mentioned earlier, each episode includes ten steps (ten full paths). Obviously, with two factors decreasing at each step, the growth of the success steps led to an increase of the cumulative successful travel in each episode. By way of explanation, during the exploration, the model minimizes the FE measurement at the discrete level (F ) at time k, which causes the resemblance between predictions and evidence at the continuous level. In total, by optimizing the global FE defined in (29) in the unseen situations, the learner can manage to avoid a collision with another agent or going out of boundary. Fig. 23-(a) shows the cumulative collision probabilities in VOLUME x, 2022 each episode. We observe that the collision probability decreases as the number of episodes increases. Fig. 23-(b) presents the cumulative probabilities of going out of boundary that starts with 62% and during the learning dramatically declines to 0%. Fig. 23-(a)-(b) justifies the L behavior in Fig. 10-(b). The experimental results demonstrate that the proposed method enabled the L to learn better driving skills than other RL methods. Integrating IL with IRL gives the L a prior driving experience, which accelerates the learning rate and improves the driving policy. The presented quantitative results prove that the proposed method improves the IL using expert demonstrations by taking advantage of sub-optimal    reference data (exploitation) and dynamically involving FE measurements at both discrete and continuous levels to minimise the distinction between the situation model and AFP model. Furthermore, qualitative results show the ability to manage critical situations. Fig. 24 shows some representative cases of different scenarios. The L's activated motion, the dynamic candidate motions and the expert driving action (the ground-truth) are displayed with blue, grey and green arrows, respectively. The associated probabilities to the candidate motion are depicted in Table. 4. In each case of decision (lane-keeping, change left, and change right), the most likely motion to the expert is selected, which has the highest probability than other candidates. Table. 4 shows the probability percentage of the activated actions in all three cases.  Fig. (a) is at the beginning of training when the learner tries to experience the new action, at Fig. (b) the FE is declined cause improving the action selection, and at Fig. (c) learner could decrease the distinction with the expert configurations. Fig. 26 show the three trajectories based on the mentioned measurements.  Fig. 25. In Fig. (a), the learner can not balance the exploration and exploitation yet. By decreasing imitation loss and improving the explored actions, the learner can finish the travel by taking less actions (Fig. (b)), and Fig. (c) shows a successful travel with the suitable actions concerning the dynamic object's situation.

3) Imitation loss
Our goal is to find the best set of actions that minimize the imitation loss in terms of FE. Fig. 25 shows that the normalized global FE (G) drops down capably in less than 50 training episodes, and after 200 episodes, its value continues to decrease below 0.1. Moreover, Fig. 27 shows the G performance considering different L's preference, i.e., to keep following the other dynamic agentÔ, overtake from the left side or overtake from the right side. Two main factors affect the global FE: the motion distinction at time k and the divergence at time k + 1 after performing a specific action by the L agent. Fig. 28 illustrates the imitation loss during the online active learning phase. We prove that our method can minimize the motion distinction (F ), which is under control of action selection at each time instant. Further, improving the action selection process leads to mini-VOLUME x, 2022  mizing the divergence (Ḟ ) between prediction and evidence. Therefore, by minimizing the imitation loss in both cases, the L agent learns to maximize the likelihood with the E behavior and overtakes the unobserved situation. In addition, Fig. 28 shows that the proposed AIL is capable of achieving higher imitation rates compared with other learning methods. Fig. 29 and Fig. 30 present the performance of the proposed method (AIL) in terms of success rate, collision rate and out of boundary rate during training and testing, respectively. Also, Fig. 29 and Fig. 30 provide comparison with other methods. It is shown that the proposed method (AIL) performs best among all methods (during training and testing), which is attributed to the effectiveness of the decision making while dealing with dynamic changes in the environment that improve the success rate by preventing going out of boundary and avoiding collisions. Besides, during testing, results showed that by 5k training episodes, the agent can change-line to overtake the other dynamic agent in the environment effectively while other methods still have high collision probabilities as shown in Fig. 31

V. CONCLUSION
In this work, we proposed a novel framework to integrate Active Inference with Imitation Learning (i.e., AIL) for autonomous driving. The proposed AIL framework is based on learning a situation model encoded in a coupled Dynamic Bayesian Network (DBN) explaining the dynamic interactions between two moving agents (i.e., an expert agent and a dynamic object). The situation model is used to initialize a first-person (FP) model, which the learner agent can use to predict expert-object dynamic interactions and evaluate the situation. During the online process, the learner agent is equipped with an Active-FP model consisting of the FP model and active states representing actions, thus enriching the learner agent with the capability to predict expert dynamics and expected relative distance from a moving object in order to perform efficient actions. The learner agent relies on an abnormality indicator that measures how much observations support its expectations to decide whether to imitate the expert's behaviour under normal situations or explore new actions in abnormal situations (i.e., unseen by the expert). Under the active inference approach, we showed how the learner could learn a new set of configurations and actions incrementally that allow the learner to optimise internal predictions (about the surrounding environment) and action selection (to come near the situation model) jointly, leading to free energy minimization. Experimental results have shown that perceptual learning and inference are required to induce prior expectations about how the new experiences and abnormalities unfold. Action is being taken to resample the world in order to meet these expectations. This places perception and action together to drive solely based on the FE policies and conduct experiments regarding general applicability to autonomous driving and generalization between different changes in dynamic environments. In addition, results have indicated that the proposed approach outperforms reinforcement learning (RL) methods such as Q-learning, Double Q-learning (DQN) and Inverse RL (IRL) in terms of the number of selected actions, successful travel rate, collision probability, out of boundary probability, and imitation loss. Future work will focus on integrating the Generalized Filtering on the Active-FP model to better utilize the updated transition matrices and improve predictive abilities at multiple levels that endow the learner agent with the capability to explain abnormal situations and how it can be avoided in the future. CARLO REGAZZONI is full professor of Cognitive Telecommunications Systems at DITEN, University of Genoa, Italy. He has been responsible of several national and EU funded research projects. He is currently the coordinator of international PhD courses on Interactive and Cognitive Environments involving several European universities. He served as general chair in several conferences and associate/guest editor in several international technical journals. He has served in many roles in governance bodies of IEEE SPS and He is serving as Vice President Conferences IEEE Signal Processing Society in 2015-2017. He is author/coauthor of more than 100 papers on International Scientific Journals and of more than 300 papers at peer reviewed International Conferences. VOLUME x, 2022