Explanation-Aware Experience Replay in Rule-Dense Environments

Human environments are often regulated by explicit and complex rulesets. Integrating Reinforcement Learning (RL) agents into such environments motivates the development of learning mechanisms that perform well in rule-dense and exception-ridden environments such as autonomous driving on regulated roads. In this paper, we propose a method for organising experience by means of partitioning the experience buffer into clusters labelled on a per-explanation basis. We present discrete and continuous navigation environments compatible with modular rulesets and 9 learning tasks. For environments with explainable rulesets, we convert rule-based explanations into case-based explanations by allocating state-transitions into clusters labelled with explanations. This allows us to sample experiences in a curricular and task-oriented manner, focusing on the rarity, importance, and meaning of events. We label this concept Explanation-Awareness (XA). We perform XA experience replay (XAER) with intra and inter-cluster prioritisation, and introduce XA-compatible versions of DQN, TD3, and SAC. Performance is consistently superior with XA versions of those algorithms, compared to traditional Prioritised Experience Replay baselines, indicating that explanation engineering can be used in lieu of reward engineering for environments with explainable features.


I. INTRODUCTION
H UMAN-REGULATED environments often rely on legislation and complex sets of rules. At present, Reinforcement Learning (RL) methods are usually tested in environments with relatively sparse rules and exceptions [17]. Denser regulations appear in applications of RL for autonomous vehicles research, but such rulesets are often fixed in terms of complexity [20]. We are interested in observing how learning can be affected by the depth of the rulesets governing such systems. With large numbers of corner cases arising as a consequence of dense rulesets, generating a sufficiently diverse set of experiences and exposing these exceptions to an RL agent can be challenging. Some works in literature propose to sample past experiences related to those exceptions, heuristically revisiting potentially important events. Among them, the technique of Prioritised Experience Replay (PER) [29] looks at over-sampling experiences that are most poorly captured by the agent's learned model. However, this mechanism does not necessarily focus on the cause of events or their exceptional nature. Alex  In this letter, we pursue the intuition that explanations are a pivotal mechanism for human intelligence, and that this mechanism has the potential to boost the performance of RL agents in complex environments. This is why we draw inspiration from user-centred explanatory processes for humans [31], and design a set of heuristics and mechanisms for prioritised experience replay to explain complex regulations to a generic off-policy RL agent. A central design challenge towards this goal is integrating explanations into computational representations. Approaches such as encoding the ruleset (or part of it) into the agent's observation space may incur severe re-training overhead even under minimal ruleset changes, as the semantics of the regulation are explicitly provided as input [18]. This minimises compatibility with extant methods and may obscure whether differences in performance are due to changes to the architecture or the complexity of the ruleset. We propose a solution that is agnostic to explicitly engineering state and observation spaces, using an explanation-aware experience replay mechanism.
In our approach, we avoid explicit representations of the ruleset (i.e. rule-based explanations [7]) by instead representing the meaning of the regulations as organised collections of examples (i.e. case-based explanations [1]). These explanations do not need to be understood by the agent in the traditional sense, but can still convey meaning if the example was labelled/explained in a semantic and meaningful process. In a ludic example, suppose a young man, called Luke, is taking hyperspace flight lessons from his exasperated friend Chewbacca. However, he does not understand a single word of Shyriiwook, the tutor's language. With sufficient repetition, Luke can associate distinct Wookiee growls (and punishments) to categories of experienced episodes, even if the content of the message is in an unknown language. Eventually, Luke would learn the meaning of the most relevant utterances by associating them to the experienced consequences. Hence, our approach modifies conventional experience replay structures by partitioning the replay buffer (or memory) into multiple clusters, each representing a distinct explanation associated with a collection of experiences that serve as examples. We call this process Explanation-Aware Experience Replay (XAER) (see Figure 1) and integrate this technique into three seminal learning algorithms: Deep Q-Networks (DQN) [23], Twin-Delayed DDPG (TD3) [11], and Soft Actor-Critic (SAC) [13].
In summary, we state the following contributions: • We show how distinct types and instances of explanations can be used to partition replay buffers and improve the rule coverage of sampled experiences. •

II. RELATED WORK
In this section we give the necessary background to understand our proposed solution and the following experiments, along with prior related work.

A. Model-Free Reinforcement Learning
A Reinforcement Learning problem is typically formalised as a Markov Decision Process (MDP). In this setting, an agent interacts at discrete time steps with an external environment. At each time step t, the agent observes a state s t and chooses an action a t according to some policy π, that is a mapping (a probability distribution) from states to actions. As a result of its action, the agent obtains a reward r t , and the environment passes to a new state s = s t+1 . The process is then iterated until a terminal state is reached.
The future cumulative reward R t = ∞ k=0 γ k r t+k is the total accumulated reward from time starting at t. γ ∈ [0, 1] is the discount factor, representing the difference in importance between present and future rewards. The goal of the agent is to maximise the expected cumulative return starting from an initial state s = s t . The action value Q π (s, a) = E π [R t |s = s t , a = a t ] is the expected return for selecting action a in state s t and prosecuting with strategy π. Given a state s and an action a, the optimal action value function Q * (s, a) = max π Q π (s, a) is the best possible action value achievable by any policy. Similarly, the value of state s given a policy π is V π (s) = E π [R t |s = s t ] and the optimal value function is V * (s) = max π V π (s).
Two of the major approaches to RL are value-based and actor-critic algorithms. Value-based algorithms, such as Deep Q-Networks (DQN) [23], use temporal difference learning, where policy extraction is done after an optimal value function is found. Actor-critic methods, such as and Twin-Delayed DDPG (TD3) [11] and Soft Actor-Critic (SAC) [13], rely on evaluating and improving a policy (via gradient descent) together with a state-value function. DQN is one of the very first value-based deep RL algorithms, designed to work on discrete action-spaces only. Adaptations for continuous action-spaces, as DDPG and then TD3, propose to address value overestimation problems by means of clipped double Q-learning, delayed update of target and policy networks, and target policy smoothing. However, one of the main limitations with TD3 is that it randomly samples actions using a predefined distribution. To overcome the issue of being limited by a fixed distribution, Soft Actor Critic (SAC) empowers the agent with the ability to also learn the distribution with which to sample actions, empowering the agent to explore more different strategies through entropy maximisation.

B. Explanations in RL
The most important field studying explanations in AI and RL is eXplainable AI (XAI) [3]. Among the many surveys on XAI, a common dimension used to classify explanations is the representative format used to convey them. Within this domain, explanations are commonly conveyed via textual/visual descriptive representations of the decision criteria (i.e. rule-based), or with similar examples (i.e. case-based). An example of rule-based explanation is 'you will get a penalty for reaching 75, which is above the speed limit of 50', based on the rule 'if speed is above 50, you will get a penalty'. While an example of case-based explanation is 'you get a penalty because you are in a situation similar to this other vehicle that reached speed 74 and was previously penalised'.
Dietterich and Flann [9] frame explanation-based RL as a case-based explanatory process where prototypical trajectories of state-transitions are used to tackle similar but unseen situations, while Chow et al. [8] implement a rule-based method, constraining the Markov Decision Process by means of Lyapunov functions.
Generally speaking, many rule-based methods for explaining to RL agents usually fall under the umbrella of a subdiscipline called Safe RL [12]. Safe RL includes techniques for both: encoding rules in the optimality criterion [8], [28] and incorporating such external knowledge into the action/state space [4]. Although not generating explicit explanations, those methods engineer safety rules into the learning process, implicitly explaining to the agent what not to do. Alternatively, a famous example of case-based methods for explaining to RL agents is that of Imitation Learning [15], where demonstrations (as trajectories of state-transitions generated by a human or expert algorithms) are used to train the RL agent. These can be seen as high-quality cases/examples provided by an expert human or algorithm. However, access to human expert data may not scale well to every domain, and not all problems dispose of accessible expert algorithms.
We are interested in sampling the most useful experiences to cover a particular agent's gap in knowledge. An agent-centred explanatory process is an iterative process that follows the agent through the process of learning and selects the most useful explanations for it, at every time-step. Below, we look at how experience replay techniques tackle this issue in offpolicy RL.

C. Prioritised Experience Replay
Algorithms such as DQN, TD3, and SAC aim to find a policy that maximises the cumulative return, by keeping and learning from a set of expected returns estimates for some past policy π. This set of expected returns is kept in an experience buffer, enabling experience replay. Experience replay [29] consists in re-utilising information from the space of sampled experiences. The agent's experiences at each time-step t are stored as transitions e t = (s t , a t , r t , s t+1 ), where s t , a t , r t represent the state, action, and reward at time t, followed by the next state s t+1 . These transitions are pooled over many episodes into a replay memory, which is usually randomly sampled for a mini-batch of experiences.
Experience sampling can be improved by differentiating important transitions from unimportant ones. In Prioritised Experience Replay (PER) [29], the importance of transitions with high expected learning value is measured by the magnitude of their temporal-difference (TD) error. Experiences with larger TD are sampled more frequently, as TD quantifies the unexpectedness of a given transition [19]. This prioritisation can lead to a loss of diversity and introduce biases. Bias in prioritised experience replay occurs when the distribution is changed without control. This effect therefore changes the solution that the estimates will converge to. This bias can be corrected through importance-sampling (IS) weights.
Many approaches to Prioritised Experience Replay (PER) in RL [29] can be re-framed as mechanisms for achieving agent-centrality, re-ordering experience by relevance in the attempt of explaining to the agent and selecting the most useful experience, as indirectly suggested by Li et al. [19]. Over the years, many human-inspired intuitions behind PER drove researchers towards improved, more sophisticated and agent-centred mechanisms to RL [32], [33], [34]. Among these works, the closest to a fully agent-centred explanatory process is Experience Replay Optimisation [34], which moves towards agent-centrality by providing an external black-box mechanism (or experience sampler) for extracting arbitrary sequences of information out of a flat (no abstraction involved) experience buffer. The experience sampler is trained to select the most 'useful' ones for the learning agent. However, due to its non-explainable nature, it is not clear whether the benefits given by Experience Replay Optimisation are due to the overhead given by the experience sampler increasing the number of neurons in the agent's network.
Another work trying to achieve agent-centrality in this sense is Attentive Experience Replay [32], suggesting to prioritise uncommon experience that is also on-distribution (related to the agent's current task). However this work, as the previous one, also falls short of explicitly organising experience in an abstract-enough way by conveying human-readable explanations to the agent. Hierarchical Experience Replay [33] has attempted to address the abstraction issue in an attempt to simplify the task to the agent, decomposing it into sub-tasks. However, they do not do so in an agent-centred and goaloriented way, given that its sub-task selection is uniform and not curricular.
On the other hand, a curricular approach for training RL agents was proposed by Ren et al. [27], exploiting PER and the intuition that simplicity is inversely proportional to TD-errors, but not exploiting any abstract and hierarchical representation of tasks. Similarly to ours, [30] aims to organise experience abstractly, based on its explanatory content -framed as the ability to answer how good/bad a sequence of state-transitions is with respect to average experience. This work only considers explanations about the immediate performance of the agent (i.e. HOW explanations), and lacks any consideration of other and richer types (i.e. WHY), as well as curricular prioritisation facilities.

III. EXPLANATION-AWARENESS
Our use of explanations is aligned to Holland's [14] and Achinstein's [2] philosophical theories of explanations. In fact, in the former, the act of explaining is framed as a process of revising belief whenever new experience challenges it. In the latter, explaining is the attempt to answer questions (such as 'why', 'what', etc [31]) in an agent-centred way. Specifically, we propose a transformation of rule-based explanations (e.g. given by a ruleset/culture) to case-based explanations (experience), which are compatible with experience replay. Leaning on the concept of Explanation-Awareness (XA), our heuristics facilitate information acquisition via the organisation of experience buffers.
Drawing from an epistemic [22] interpretation of explanations, we argue that a central aspect of providing case-based explanations to an RL agent comes from meaningfully reordering experience to a greater degree. The intuition behind how we construct our case-based explanations is: 'a simple set of relevant state-transitions representing abstract-enough aspects of the problem to be solved.' This intuition motivates the heuristics of abstraction, relevance, and simplicity (ARS, in short). We adapt these heuristics from prior work [31] in the HCI domain, where they are presented in greater abstraction to form a higher-level taxonomy and knowledge graph for an interactive explanatory process.
Consider a problem where an RL agent has to learn a policy to optimally navigate through an environment with sophisticated rules and exceptions (e.g. a real traffic regulation with exceptions for special types of vehicles). Let the statetransition τ = (s t , a t , r t , s t+1 ) denote the transition from state s t to state s t+1 by means of action a t , yielding a reward r t . We assume the environment is imbued with explanatory capabilities via an explainer. Note that the explanations generated by the explainer can have virtually any representation, be it human-understandable or not, provided they are distinct and serve the purpose of labelling different clusters.
Definition 1 (Explainer): The explainer : Ω → ES is a function that maps a list of state-transition tuples τ ∈ Ω to an explanation e r ∈ ES, where Ω is the space of possible statetransitions and ES is the explanatory space, i.e., the space of all possible explanations.
An agent who has more diverse experiences with regards to the reasons (explanations) associated with rewards will have a better chance at converging towards a policy that better represents the underlying ruleset. Therefore, we posit that the more complex the environment is in terms of rules, the more useful Explanation-Awareness (XA) should be, as it would ensure a more even distribution of experiences with regards to different reasons justifying rewards. This diversity of explanations culminates on a clustering that is semantic by nature, and transitions are partitioned according to the explanation that represented its reward.
Definition 2 (XA Clusters): Let τ e = (s t , a t , r t , e rt , s t+1 ) be a XA state-transition represented by the explanation e, where τ e : τ × e r , τ ∈ Ω and e ∈ ES. Let Ω be the set of all state-transitions. We say C = {C e1 , . . . , C e k } is the set of XA clusters seen in Ω, where k is the number of different explanations seen.
We introduce our adaptation of ARS, below.

A. Abstraction: Clustering Strategies
The purpose of the abstraction heuristic is to regulate the level of granularity of the explanations, hence of the experience clusters. Our abstractions are based on the understanding that explanations are indeed answers to questions. Hence, explanations may have different granularity defined by the level-of-detail of the question they answer.
More in detail, the HOW explanations we consider answer the question 'How well is the agent performing with this reward?'. This type of explanations can be produced by studying the average behaviour of an agent. For example, if an episode has a cumulative reward that is greater than the running mean, then the explanation indicates that the agent is behaving better than average. Hence, these HOW explanations do not need to be designed with any specific domain knowledge, as they are governed exclusively by the performance of the agent. On the other hand, the WHY explanations we consider answer the question 'Why did the agent achieve this reward?'. These WHY explanations could depend on an explainer function with task/domain knowledge that can distinguish and cluster types of transitions (see Example 1, below). Furthermore, WHY and HOW explanations (or any other type) can be combined so that the explanation would answer both the associated questions.
In order to compose the experience buffer, represented by the set of experience clusters C = {C e1 , . . . , C e k }, we consequently devise the following clustering strategies, for each explanation type: 1) HOW: The experience buffer is divided into 2 clusters C better and C worse , where C better contains batches with rewards greater than the running mean of rewards, and vice-versa (given a sliding window of a defined size). 2) WHY: The number of clusters is equivalent to the number of distinct explanations available. If a batch can be explained by multiple explanations simultaneously, we select the explanation associated with the smallest cluster (most under-represented) and the batch is associated to the corresponding cluster. 1 3) HOW+WHY: a combination of HOW and WHY strategies.
There are two custom C better and C worse clusters for every WHY explanation, formed after their concatenation. Example 1: Suppose a hypothetical football environment with a WHY explainer function. This function could either be part of the environment (a logical mechanism that recognises when certain states are reached and produces a state label), or an external mechanism that receives statetransitions as input and produces explanations. The explanations could be generated by the rules of the game, such as 'goal', 'offside', or 'foul'. The corresponding WHY clusters would be C = {C goal , C offside , C foul , . . .}, where each cluster would contain a set of state-transitions associated with each label. If HOW+WHY were used, clusters would be C = {C goal_better , C goal_worse , C offside_better , . . .}.
After clustering state-transitions using the prior clustering strategies, we propose mechanisms for assessing the relevance of specific state-transitions during learning.

B. Relevance: Intra-Cluster Prioritisation
Prioritisation mechanisms are used for organising information given their relevance to the agent's objectives.
The priority of a batch is usually estimated by computing its loss with respect to the agent's objective [29]. In DQN, TD3, and SAC, relevance is estimated by the absolute TD-error of the agent. The closer to 0, the lower the loss and the relevance. The intuition is that batches with TD-error equal to zero are of no use since they represent an already solved challenge. In our method, this relevance heuristic can be combined with the aforementioned clustering strategy by sampling clusters in a prioritised way (by summing the priorities of all its batches) and then performing prioritised sampling of batches from the sampled cluster.

C. Simplicity: (Curricular) Inter-Cluster Prioritisation
Occam's Razor [5] states that when presented with two explanations for the same phenomenon, the simplest explanation should be preferred. In human explanations, simplicity is a common heuristic [16], [24]. We will adhere to those principles and select minimal and simple explanations, following a curricular approach.
Clustered prioritised experience replay changes the real distribution of tasks by means of over-sampling. Assuming that the whole experience buffer has a fixed and constant size N , and that the experience buffer contains |C| different clusters, let S min and S max be the minimum and maximum size of a cluster. Any new experience is added to a full buffer by removing the oldest one within buffers having more elements than S min .
If all the clusters have the same size (therefore S min = S max ), replaying the task's cluster with the highest (TD-error) priority might push the agent to tackle the exceptions before the most common tasks, preventing the agent from learning an optimal policy faster. The assumption here is that exceptional tasks (exceptions) are less frequent.
On the other hand, if S min = 0 and S max = ∞, the size of a cluster would depend only on the real distribution of tasks within a small sliding window, as in traditional PER, thus preventing over-sampling. The presence of clusters helps oversampling batches likely related to under-represented tasks, and learning to tackle potentially hard cases more efficiently.
Consequently, we posit that S min shall be large enough for effective over-sampling, while having S max > S min being dependent on the real distribution of tasks. This will push the agent towards tackling the most frequent and relevant tasks first, analogously to curricular learning. We define a hyperparameter to control the cluster size proportion.
Definition 3 (Cluster Size Proportion): In order for all clusters to have a size S min ≤ S ≤ S max , we set S max = S min + (ξ − 1) · |C| · S min , where ξ ≥ 1 represents the cluster size proportion. Therefore, S min = N |C|·ξ can be easily controlled by modifying ξ. We enforce S min < S max when ξ > 1. Consequently, for curricular prioritisation, if the cluster's priority is (for example) computed as the sum of the priorities of its batch, and ξ > 1 is not too large (e.g. ξ = 5), the resulting cluster's priorities will reflect the real distribution of tasks while smoothly oversampling the most relevant tasks. This avoids over-estimation of the priority of a task. As ξ gives us control of the degree of on-policyness, different values of ξ might perform better with on an algorithm and environment basis 2 . Higher values of ξ mean that the distribution of state-transitions reflects more transitions seen within the current policy, thus being advantageous for entropy-maximisation algorithms such as SAC [19]. Likewise, fully off-policy algorithms such as DQN may exhibit superior results with low values of ξ (e.g. ξ = 1).
With those mechanisms in place, we propose new environments to evaluate the performance of agents when subjected to complex rulesets.

D. Annealing the Bias
Similarly to PER [29], sampling state-transitions from prioritised clusters might produce unwanted bias. The standard debiasing function of PER weighs expected values using the normalised weight P (τ ) P (τ ) ∈ [0, 1], where P (τ ) is the probability of sampling a state transition τ from the whole buffer andτ is the state-transition with the lowest probability for the whole buffer. We adapted the debiasing function of PER by changing the formula to consider the fact that state-transitions are sampled from clusters (which are in turn sampled). Therefore, the debiasing function of XAER computes the joint probability of sampling both a cluster c and a state-transition τ . More precisely, considering that the two events are not independent, we compute this joint probability as P (c) · P (τ |c). Hence, the normalised weights produced by the debiasing function of XAER are given by P (c∩τ ) P (c∩τ ) , where P (c ∩τ ) is the lowest possible probability, considering any couple of clusters and state-transitions.

IV. ENVIRONMENTS
Real-life air/sea/road traffic regulations are often complex, and their mastery is a crucial aspect of orderly navigation. Many realistic settings have a number of exceptions that must be taken into consideration (e.g. ambulances are not subjected to some rules when in emergencies, sailing boats have different priorities if on wind power, etc). To evaluate the effect of XAER in a diverse configuration space of environments, we developed modular environments that allow us to systematically change its properties in evaluation. Our environments allow for agents to experience the same rules (our Easy, Medium, and Hard rulesets) in both discrete and continuous state-action spaces, and with frequent and sparse rewards. In these, the agent must understand the complex regulation governing the penalty system. To implement our rulesets, we use cultures [25] [26]: a mechanism to encode human rulesets as machine-compatible argumentation frameworks, imbued with fact-checking mechanisms. These can serve as explainer functions to produce rule-based explanations from an agent's behaviour. Fig. 2. Diagrams representing our proposed GridDrive and GraphDrive environments. In GridDrive, the agent has a discrete action space and must observe the properties of neighbouring cells to make a decision that is compatible with the ruleset, choosing one direction and a fixed speed. GraphDrive is a harder environment, where the agent's action and observation spaces are continuous. In it, kinematics are taken into consideration and the agent must not only learn the rules governing penalties, but also to accelerate and steer without going off-road. The goal in both environments is to visit as many new roads as possible without infringing rules.

GridDrive GraphDrive
The environments are:

A. GridDrive -Discrete
A 15×15 grid of cells, where every cell represents a different type of road (see Figure 2, left), with base types (e.g. motorway, school road, city) combined with other modifiers (roadworks, accidents, weather). Each vehicle will have a set of properties that define which type of vehicle they are (emergency, civilian, worker, etc). Complex combinations of these properties will define a strict speed limit for each cell, according to the culture. Rewards. Let 0 < s ≤ 1 denote the normalised speed of the agent in that step. Rewards are given at every step, given the following criteria: if on previously-visited cell s , otherwise (new cell, within speed limit)

B. GraphDrive -Continuous
An Euclidean representation of a planar graph with n vertices and m edges (see Figure 2, right). The agent starts at the coordinates of one of those vertices and has to drive between vertices (called 'junctions') in continuous space with Ackermann-based non-holonomic motion. Edges represent roads and are subjected to the same rules with properties to those seen in GridDrive plus a few extra rules to encourage the agent to stay close to the edges. The incentive is to drive as long as possible without committing speed infractions. In this setting, the agent must learn a control input that not only keeps the vehicle on the road, but also respects speed limits and restrictions that may vary on a case-by-case basis. We test two variations of this environment: one with dense and another with sparse rewards.
Observations. A sample in the observation space for Graph-Drive is a tuple (o v , o r , o j ), where o v denotes a concatenation of the vehicle's properties (car features, position, speed/angle, distance to path, junction status, number of visited junctions), o r is the concatenation of the properties of the closest road to the agent (likely to be the one the agent is driving on), and o j is the concatenation of the properties of roads connected to the next junction.
Rewards (dense version). Let 0 < s ≤ 1 denote the normalised speed of the agent in that frame, and let n be the number of unique junctions visited in the episode. Rewards are given at every frame, given the following criteria: if on junction or previously-visited road s, otherwise (on road, within speed limit) Rewards (sparse version). In this version, the agent will get null (zero) reward when moving correctly. Positive rewards only appear when the agent manages to acquire a new junction. Therefore, the agent will have to drive entire roads correctly to get any positive reward. Rewards are given according to the following criteria: , if breaking the speed regulation −1 (terminal), if off-road or U-turning outside junction 0, driving normally or on acquired junction 1, the instant a new junction is acquired Every episode incurs in an initialisation of the grid or graph (for GridDrive or GraphDrive, respectively) with random roads, along with randomly-sampled agent properties. The agent is encouraged to drive for as long as possible until it either achieves a maximum number of steps or breaks a rule (terminal state). All environments will be instantiated in versions with 3 different cultures (rulesets), according to their levels of complexity: Easy, Medium, and Hard.

V. EXPERIMENTS
In this section we describe our experimental setup and present results obtained in our proposed environments with XAER versus traditional PER. We trained 3 baseline agents with traditional PER (DQN/Rainbow, SAC, and TD3). For each of the 3 baseline algorithms, we train 3 XAER versions with different clustering strategies, using HOW, WHY, and HOW+WHY explanations (see Section III). Additionally, we show results for HOW+WHY explanations without the simplicity heuristic (prioritised clustering) -i.e. clusters are sampled uniformly. For a total of 12 XA agents, we call the XAERequipped versions of DQN, SAC, and TD3 XADQN, XASAC, and XATD3, respectively. DQN and XADQN agents are applied to GridDrive (discrete), whilst SAC, TD3, XASAC, and XATD3 3 were trained separately on GraphDrive with dense and sparse rewards (continuous).
The neural network adopted for all the experiments is the default one implemented in the respective baselines (although better ones can be certainly devised), and it is characterised by fully connected layers of few units (e.g. 256) followed by the output layers for actors and/or critics, depending on the algorithm's architecture. XAER methods introduce the cluster size proportion (ξ) hyperparameter. We perform ablation experiments to choose values of ξ, and arrive at ξ = 1 for XADQN and XATD3, and ξ = 3 for XASAC. We omit the detailed ablation study for brevity, but full plots and auxiliary results can be found in our GitHub page 4 .
As the environments presented in Section IV have different levels of rule density/complexity, we are interested in observing if XAER exhibits superior performance compared to traditional PER in tasks that involve learning sophisticated and exception-heavy regulations. We trained all agents up to 4.0 × 10 7 steps sampled on all environments for a total of approximately 1.6 × 10 8 training steps. Our reported scores are obtained by segmenting the curve of mean episode rewards into 20 regions containing 5% of steps each. We select the best region (highest median) for each agent to compare agents at their respective best performances. We report those medians in Table I, as well as the 25-75% inter-quartile range for the selected region.
Results in Table I show that across all tasks and methods, XAER versions only lose to the PER baseline against DQN/Rainbow in GridDrive Easy, by 0.4%. For GridDrive Medium and Hard, XADQN with HOW+WHY explanations exhibit significantly higher performance (57% and 81%, respectively). WHY and HOW+WHY exhibit similar performance in GraphDrive, being bested by HOW in Medium and Hard Sparse cases only. Although HOW+WHY explanations have consistently good results across environments, the version without the simplicity heuristic exhibited consistently inferior results. Neither baseline SAC or TD3 managed to learn a policy in GraphDrive Hard Sparse (our hardest environment). XATD3 also failed to learn a policy in this environment, but XASAC was able to achieve positive results.

VI. DISCUSSION AND CONCLUSION
Our results indicate a significant benefit achieved via explanation-aware experience replay. In one case (TD3 Hard), endowing an agent with XAER enabled an agent to learn altogether where it would otherwise fail entirely. XAER allowed agents to learn in Medium and Hard difficulty settings, obtaining significantly higher rewards whilst having the same hyper-parameters and number of learning steps.
The choice of explanation type also affected results: when superior, HOW+WHY explanations exhibited larger margins of improvement over other XAER methods. In other cases, when bested by WHY explanations, the former maintained very close results, thus achieving consistently satisfactory results in most cases. Also importantly, although HOW explanations exhibited lower performance than other XAER counterparts in most environments, it is worth noting that HOW explanations do not require an explainer and could in theory be used in any environment. The consistency of HOW+WHY results suggests that the act of explaining may involve answering more archetypal questions, not just causal ones, as hypothesised also in [31].
The frequency and magnitude of rewards is an important factor to be considered in XAER clustering. When negative rewards are more frequent (with similar magnitude to positive rewards), and there are more negative than positive clusters, oversampling may cause the agent to tackle situations with negative rewards more frequently, preventing it to maximise cumulative rewards. This effect can be particularly pronounced with very sparse rewards, such as the ones seen in the sparse version of GraphDrive.
Intuitively, this is akin to the notion that if there are few opportunities to explain, one must choose their explanations well. The notion of explanation engineering surfaces as a mechanism to orient the learning agent through means of selecting which experiences (and explanations) are more important to the task at hand, by means of abstractions. Being explainable by design, explanation engineering can be an intuitive and semantically-grounded alternative to reward engineering, as the meaning of the rewards matter just as their magnitude. A few examples include increasing the number of positive clusters, or organise clusters hierarchically.
With regards to relevance, if the cumulative priority of the state-transitions of a whole cluster is low, it may indicate that the agent has already learned to handle the task represented by the cluster, so it may not need it as an explanation (thus being less relevant). Oppositely, if the cumulative priority is high, it could indicate a further need for additional explanations. The cluster might be representing either non-generic or generic tasks. If the agent needs explanations for a generic task, it should also need them for a non-generic task. In that case, the generic task is prioritised over the non-generic. The benefits of inter-cluster prioritisation (simplicity) are higher in environments with harder rulesets, and proportional to the complexity of the culture [25]. This suggests that uniformly selecting an explanation type to replay is less beneficial than selecting the simplest and most relevant explanation.
This work foments diverse avenues for further investigation. For one, further experiments could include the development of explainer functions to evaluate the performance of WHY explanations in popular benchmarks. Additionally, future work may observe the effect of XAER with on-policy algorithms, such as PPO. And lastly, the illocutionary effect of explanations deriving from further archetypal questions [31] (i.e. WHAT, WHERE, WHEN) could be explored in advanced explanation engineering for experience clustering.