Evolutionary Learning of Interpretable Decision Trees

In the last decade, reinforcement learning (RL) has been used to solve several tasks with human-level performance. However, there is a growing demand for interpretable RL, i.e., there is the need to understand how a RL agent works and the rationale of its decisions. Not only do we need interpretability to assess the safety of such agents, but also we may need it to gain insights into unknown problems. In this work, we propose a novel optimization approach to interpretable RL that builds decision trees. While techniques that optimize decision trees for RL do exist, they usually employ greedy algorithms or do not exploit the rewards given by the environment. For these reasons, these techniques may either get stuck in local optima or be inefficient. On the contrary, our approach is based on a two-level optimization scheme that combines the advantages of evolutionary algorithms with the benefits of $\mathcal {Q}$ -learning. This method allows decomposing the problem into two sub-problems: the problem of finding a meaningful decomposition of the state space, and the problem of associating an action to each subspace. We test the proposed method on three well-known RL benchmarks, as well as on a pandemic control task, on which it results competitive with the state-of-the-art in both performance and interpretability.


Introduction
While machine learning is achieving very promising results in a variety of fields, there is an emergent need to understand what happens in the learned model, for testing, security and safety purposes.
There are mainly two approaches that try to address this problem: explainable AI (XAI) and interpretable AI (IAI) (which is actually a subset of XAI).
The field of explainable AI, in recent years, has seen a significant increase in the number of scientific contributions related to the topic. It is important to note, however, that these techniques are not applicable to all the tasks. In fact, as stated by [1], it is not safe to apply XAI techniques to safety-critical or high-stakes systems. This is due to the fact that explanations are usually approximations of the actual models and, as a consequence, do not represent exactly the models.
Interpretable AI techniques, instead, are based on the use of interpretable models, i.e. models that can be easily understood and inspected by an human operator [2]. These techniques, besides the ability to assess security and safety of the produced models, can also serve to better understand a problem. In fact, by looking at an interpretable model (with good performance), an human operator can extract knowledge about the problem faced.
However, interpretable AI techniques are not widely used in practice, due to their (usually) lower performance. In fact, it is widely accepted (although not proved) that a trade-off between interpretability and performance exists.
Recent work has addressed the problem of building interpretable reinforcement learning models. In [3], the authors implement a differentiable version of decision trees and optimize them by using backpropagation. Dhebar et al. [4] propose nonlinear decision trees to approximate and refine an oracle policy.
While the results of these approaches seem very promising, the structure of the tree must be defined a-priori. This requires us to either perform a trial-anderror search or to include prior knowledge.
In this work, we present a novel approach to the training of interpretable reinforcement learning agents that combines artificial evolution and lifelong reinforcement-learning. This two-level optimization algorithm allows us to decrease the amount of prior knowledge given to the algorithm. Our approach is able to generate agents in the form of decision trees that are able to learn both a decomposition of the space and the state-action mapping.
The contributions of this paper are the following: • We propose a two-level optimization approach that optimizes both the topology of the tree and the decisions taken for each state • We perform experimental tests on classic reinforcement learning problems:

CartPole, MountainCar and LunarLander
• We perform a comparison of the produced agents w.r.t. the interpretable and the non-interpretable state-of-the-art • We quantitatively measure the interpretability of our solutions and compare it to the state-of-the-art • We interpret the solutions produced to understand how the agents work This paper is structured as follows. In Section 2 we give some background on the field and related work, while Section 3 describes the method used in our approach. Then, in Section 4 we present the results of our experiments. In Section 5 we will discuss our results by comparing them to the interpretable state-of-the-art, performing an ablation study and interpreting the produced solutions. Finally, in Section 6 we draw the conclusions of this work.

Related work
In this section we will give some background on the research problem being faced.
The use of decision trees to learn in reinforcement learning tasks has been explored in several previous work .
McCallum, in [5], proposes U-Trees: a kind of trees able to perform reinforcement learning that handle the following sub-problems: choice of the memories, selective perception and value function approximation. In [6], the authors extend U-Trees in order to make them able to cope with continuous environment.
They propose two novel tests that are used to create new conditions that split the state-space. They test the proposed approach in two environments, a continuous one and an ordered-discrete one, and their results show that their approach is competitive with respect to other approaches.
Pyeatt and Howe, in [7], propose a novel splitting criterion to build trees that are able to perform value function approximation. In their experiments, they compare the performance obtained by using their approach to the ones obtained using other splitting criteria, a table-lookup approach and a neural network. The results show that the proposed approach usually achieves better performance than all the other approaches.
In [8] the authors propose a method that predicts the gain obtained by adding a split to the tree and select the best split to grow the tree. The experimental results show that this method is more effective than the method proposed in [7] on the tested environment.
Silva et al. [3] propose an approach to interpretable reinforcement learning that uses Proximal Policy Optimization on differentiable decision trees. Moreover, they provide an analysis of the learning process while using either Qlearning or PPO. The experimental results show that this approach is able to produce competitive solutions in some of the tasks. However, it is also shown that when discretizing the differentiable decision trees into typical decision trees, the performance may heavily decrease.
In [4], the authors used evolutionary algorithms to evolve non-linear decision trees. By non-linear, the authors mean that each split does not define a linear hyperplane in the feature-space. The experimental results show that this approach is able to obtain competitive performance with respect to a neural network based approach.
In [9], the authors use the grammatical evolution algorithm [10] to evolve behavior trees (tree-based structures that allow more complex operations than a decision tree) for the Mario AI competitions. The proposed agent can perform basic actions or pre-determined combinations of basic actions. Their solution achieved the fourth place in the Mario AI competition. However, in this work the authors only evolve a controller, not exploiting the rewards given by the environment to increase the performance of the agent.
Hallawa et al., in [11], use behavior trees as evolved instinctive behaviors in agents that are then combined with a learned behavior. While behavior trees are usually interpretable, the authors did not take explicitly into account the interpretability of the whole model, which comprises both a behavior tree and either a neural network or a Q-learning table.
Several work applied the evolutionary computation paradigm to evolve treebased structures outside the reinforcement learning domain.
Krȩtowski, in [12], proposes a memetic algorithm based on genetic programming [13] and local search to optimize decision trees. The results presented show that this approach is able to obtain performance that is comparable to the state-of-the-art, while keeping the size of the tree significantly lower.
In [14], the authors propose a multi-objective EA to evolve regression trees and model trees. They use a Pareto front to optimize RMSE, number of nodes, number of attributes. The experimental results show that this approach is able to obtain performance that are comparable or better than the state-of-the-art while using less nodes and less attributes.
In [15,16], the authors use the Genetic and Artificial Life Environment (GALE) to evolve decision trees. Their results show that GALE is able to produce decision trees that are competitive with the state-of-the-art.

Method
In this work, we aim to produce interpretable agents in the form of decision trees. Decision trees are trees (usually binary trees) where each inner node represents a "split" (i.e. a test on a condition) and each leaf node contains a decision. A representation of the proposed decision trees is shown in Figure 1.
When using decision trees for reinforcement learning tasks, there are two problems that need to be assessed: 1. How do we choose the splits? 2. Given a leaf, what action do we need to assign to this leaf?
Of course, there is an important relationship between splits and decisions taken in the leaves, so changing one of these without changing the other may lead to significant changes in performance.
Several works [8,5,6,7] use greedy heuristics to induce the trees. However, this approaches have the following drawbacks: • They use greedy rules to expand the trees: since inducing decision trees is an NP-complete problem [17], this may cause the induction of sub-optimal trees (i.e. stuck in local optima) of poor quality [5,18].
• They use tests to expand the trees: this causes these algorithms to suffer from the curse of dimensionality because, for each expansion of the tree, all the input variables need to be tested [5,6].
Other works [9] (and [12,14,15,16], even if they are not applied to reinforcement learning tasks) induce trees by means of evolutionary approaches.
However, these approaches only rely on the evolutionary algorithm. In RL tasks, not exploiting the reinforcement signals given by the environment may slow down the evolution and so result in a less-efficient process.
Our approach, instead, aims to combine artificial evolution and reinforcementlearning methods to take the best of both worlds. We propose a Baldwinianevolutionary approach to optimize simultaneously the structure of the tree and the state-action function. Baldwinian evolution is an evolutionary theory that,  knowledge acquired by the individual may be an evolutionary advantage that modifies the fitness landscape.
We do so by using an evolutionary algorithm to evolve the structure of the decision tree, while using Q-learning to learn the state-action function. This way, we search for trees that decompose the state-space in such a manner that, when taking optimal actions, maximize the reward of the agent.
The evolutionary algorithm we use is the Grammatical Evolution (GE) [10].
This evolutionary algorithm evolves (context-free) grammars in the Backus-Naur Form. Figure 2 shows a block diagram that clarifies the inner working of the proposed algorithm. The blue-colored parts are the processes inherent the evolutionary part of our algorithm, while the red-colored ones are the processes inherent the reinforcement-learning part.

Evolutionary algorithm
To evolve decision trees, we evolve an associated grammar, similarly to the approach described in [10]. In this subsection we will describe our algorithm design, highlighting the differences with the original Grammatical Evolution.  Figure 2: A scheme of the inner working of the proposed algorithm. The blue blocks are the ones that derive from the evolutionary part of our algorithm, while the red blocks are the ones that derive from the Q-learning part.

Individual encoding
The genotype of an individual is encoded as a list of codons (represented as integers). However, differently from [10], the genotype has fixed length.

Mutation operator
Instead of the mutation operator described in [10], we use a classical uniform mutation. This operator mutates each gene according to a probability. The new value of the gene is drawn uniformly from the range of variation of the variable.
However, since the grammar may have a different number of productions for each rule, we uniformly draw a random number between 0 and a number bigger than the maximum number of productions in the grammar. Then, by using the modulo operator, we choose the production from the production rule.

Crossover operator
As a crossover operator, we use the standard one-point crossover operator.
This operator simply sets a random cutting point and creates two individuals by mixing the two sub-strings of the genotype. This means that we do not prune the individuals that have genes that are not expressed in the phenotype.

Replacement of the individuals
Instead of replacing all the individuals with their offspring (intended as the copies that undergo mutation/crossover), we replace a parent only when the fitness of its offspring is better than the fitness of the parent. Moreover, when an offspring has two parents (i.e. is a product of crossover), it replaces the parent with lowest fitness. In case two offspring have better fitness than only one of the parents, the best one replaces the worst parent.
This mechanism allows us to preserve diversity between the individuals and at the same time makes the fitness trend monotonically increasing.

Fitness evaluation
The fitness evaluation process consists in the following steps. First of all, the genotype is translated to the corresponding phenotype. Then, for each timestep, the policy encoded by the tree is executed and the reward signals obtained from the environment are used to update the Q-values of the leaves.

Results
To test our approach, we test it in the following OpenAI Gym [19] environments: • CartPole-v1 • MountainCar-v0 • LunarLander-v2 In this section, we will show the results obtained and compare them to the state-of-the-art using two metrics: the score given by the simulator and a metric of complexity (from the interpretability point of view).
We adopted the interpretability metric proposed in [20]: where: • l is the size of the formula (i.e. sum of constants, variables and operations) • n o is the number of operations However, this metric was meant to be bounded between 0 and 100, so we modified the metric in order to make it work as a complexity. For this purpose, the metric used is the following: The changes we made yield the following properties: • By changing the sign of all the terms, we obtain that a model with an higher complexity is more hard to interpret • We replaced the constant with -0.2, so that when we have a constant (best case from the point of view of the interpretability) its complexity becomes 0 Furthermore, it is important to note that this metric is in line with what Lipton states in [21]. In fact, we can easily note that huge decision trees will be as interpretable as black-box methods, because the terms l, n o , n nao and n naoc will have a high magnitude. Also, M seems to be (loosely) in line with what the authors say in [22]. In fact, by using the number of operations (although there are other variables in the metric we use) we loosely resemble the computational complexity of the model that we are executing.
To assess the statistical repeatability of our experiments, we perform 10 independent runs for each setting. For each run, as required by [19], we test the best model for each run over 100 independent episodes to assess its performance.
By "testing", we mean that the policy is executed in 100 unseen episodes.

Simplification mechanism
To make our solutions even more interpretable, we introduce a simplification mechanism that is executed on the final solutions. The simplification mechanism is the following. First of all, we execute the given policy for 100 episodes in a validation environment. Here, we keep a counter for each node of the tree that is increased each time the node is visited. Then, once this phase is finished, we remove all the nodes that have not been visited. Finally, we iteratively search for nodes in the tree whose leaves correspond to the same action. Each time such a node is found, it is replaced with a leaf that contains the action common to the two leaves. The iteration stops when the tree does not contain nodes of this type.

Description of the environments
In this subsection we will describe the environments used and their properties.

CartPole-v1
In this task the agent has to balance a pole that stands on top of a cart by moving the cart either to the left or to the right.
Observation space. The state of the environment is composed of the following features: • • Push the cart to the left by applying a force of 10N (move lef t) • Push the cart to the right by applying a force of 10N (move right) Rewards. The agent receives a reward of +1 for each timestep.
Termination criterion. The simulation terminates if: • The cart position lies outside the bounds for the x variable • The angle of the pole lies outside the bounds for the θ variable Resolution criterion. This task is considered as solved if the agent receives a mean total reward R ≥ 475 on 100 runs.

MountainCar-v0
In this environment the agent has to drive a car, which is initially in a valley, up on a hill. However, the engine of the car is not powerful enough so the agent has to learn how to build momentum by exploiting the two hills.
Observation space. The state of the environment consists in the following variables: •

LunarLander-v2
In this task the agent has to land a lander on a landing pad.
Observation space. The state of the environment consists of 8 variables: •  Table 1: Grammar used to evolve orthogonal decision trees in the CartPole-v1 environment. The symbol "|" denotes the possibility to choose between different symbols. "comp op" is a short version of "comparison operator" and "lt" and "gt" are respectively the "less than" and "greater than" operators. input var represents one of the possible inputs in the given environment. Note that each input variable has a separate set of constants. Table 2: Grammar used to evolve oblique decision trees in the CartPole-v1 environment. The symbol "|" denotes the possibility to choose between different symbols. "lt" refers to the "less than" operator. Population size  200  Generations  100  Genotype length  1024  Crossover probability  0  Mutation probability  1  Mutation type Uniform, with gene probability=0.1    The number of episodes used for Q-learning is quite low. This is because, since this is a "simple" environment, we want to lower the computational cost of the search by exploiting the randomness used to initialize the state-action function. This means that, in this case, Q-learning is used to "fine-tune" the state-action function instead of learning it from scratch.

Results
The results are shown in Tables 5 and 6. These tables show an interesting result. In fact, while the orthogonal grammar is able to solve the task in the 100% of the cases, the test score was the optimal one (500 ± 0) only in the 40% of the cases. On the other hand, the oblique grammar solves the task in the 90% of the cases, but achieves the optimal score in the 80% of the runs. This suggests us that, while the oblique grammar makes the search space more complex, it usually leads to more stable (as in Lyapunov's concept of stability) solutions.      In Figures 3 and 4 we show the mean best fitness for each generation averaged across all the runs. We observe that, while both settings are able to converge towards the optimal fitness, the oblique grammar converges more quickly than the orthogonal one to the maximum fitness.
In Figure 5, a comparison of the solutions obtained by using the orthogonal and oblique grammars is shown. We can easily observe that, usually, the solutions obtained with the oblique grammar have a lower M than the ones obtained by using the orthogonal grammar. This is due to the fact that most of the produced oblique trees use only one split, resulting in a lower number of non-arithmetical operations.
To better assess the hypothesis made earlier, i.e. that oblique trees are more stable than orthogonal ones, in Tables 7 and 8 we compare all the trees produced by using the two grammars on a modified environment that has a maximum duration of 10 4 timesteps instead of 500. These results confirm our hypothesis, showing that all the oblique trees are able to obtain significantly better scores, often obtaining a perfect score (i.e. 10 4 ± 0) also in this setting. Moreover, we also tested the robustness of the produced agents with respect to noise on the inputs received by the sensors. In Figure 9 we show how the performance of the two best agents vary with respect to additive input noise ( distributed as N (0, σ 2 )). The orthogonal tree was robust to noises with σ in the order of twice the sampling step used for the constants. On the other hand, the oblique tree proved to be significantly more robust, being able to cope with noises that have a σ about 50 times bigger than the sampling step used for the constants.      Finally, in Table 9 and Figure

Method
Score M Deep Q Network [23] 327.30 1157.20 Tree-Backup(λ) [23] 494.70 1157.20 Importance-Sampling [23] 498.70 1157.20 Qπ [23] 489.90 1157.20 Retrace(λ) [23] 461  Table 9: Comparison of the solutions obtained by using the proposed approach with respect to the state-of-the-art. The results from [23] are averaged over ten independent runs. The results from [3] regard the discretized tree shown in Figure  The tree has been simplified by using the technique used in our work. This is because the ranges of variation of the two variables are quite different.
Moreover, a preliminary experimental phase confirmed that it was hard to obtain good results by not normalizing the inputs.
The parameters used for the Grammatical Evolution are shown in Tables   12 and 13. The settings for the Q-learning algorithm are shown in Tables 14   and 15. Also in this case, we set the number of episodes to 10 to exploit the randomness of the initialization, since this is considered as a "simple" task.

Results
The results obtained by the best solution for each run are shown in Tables   16 and 17. In Table 17 there are some values in parenthesis. This is because, given the difference in performance between training and testing scores, we proceeded with further investigation of the results. We deduced that in the latest steps of the training of such agents a change happened in the Q-values [-0.07, 0.07) with step 0.005 Table 10: Grammar used to evolve orthogonal decision trees in the MountainCar-v0 environment. The symbol "|" denotes the possibility to choose between different symbols. "comp op" is a short version of "comparison operator" and "lt" and "gt" are respectively the "less than" and "greater than" operators. input var represents one of the possible inputs in the given environment. Note that each input variable has a separate set of constants.        Moreover, this change has only been used in this table. For the remainder of this work, we will assume that their test score is −200.

Rule Production
As we can see from 16, the solutions obtained by using the orthogonal grammar solve the task in the 70% of the cases. On the other hand, as we can see from Table 17, oblique trees perform poorly on this problem. This suggests us that this problem is harder to solve by using oblique trees than orthogonal ones.
While this may seem counter-intuitive, since oblique trees are a generalization of orthogonal trees, it may be because our grammar (the one used to produce oblique trees) makes it difficult to obtain an orthogonal decision tree.
The fitness trend for the best individual averaged on each run are shown in Figure 13 and 14 for the orthogonal and oblique cases, respectively.
To compare the two approaches, we compare the robustness to input noise for both versions. The result is shown in Figure 15. In this case both approaches proved to be not so robust to noise. Surprisingly, we can observe that the orthogonal tree was not even robust to input noise that had σ < min where step i is the sampling step for the constants of the i-th variable.
Finally, we perform a comparison of our solutions w.r.t. the state-of-the-art.
In Table 18 and Figure 16 we       In this case, we were not able to find a configuration that gave satisfying results with orthogonal trees. For this reason, in this case we will show only the results obtained by using an oblique grammar.
The grammar used for this task is shown in Table 19, while the parameters used for the grammatical evolution and the Q-learning are shown in Tables 20 and 21, respectively.
In this case, as shown in Table 21, we significantly increased the number of episodes used for the training. This is due to the following reasons: • The LunarLander-v2 environment is not as easy to solve as the previous environments.
• In this case, we did not use a random initialization of the leaves, to leverage only Q-learning to learn the state-action function.
Moreover, as shown in Table 21, we used a slightly different Q-learning approach for this environment. In fact, in this case, we are using a decay for the ε parameter, in order to explore better the search space. The decay works as follows: in the k-th visit to the leaf, an ε = ε 0 · decay k is used. The learning rate has been set to 1 k , where k is the number of visits to the state-action pair. This guarantees that the state-action function converges to the optimum with k → ∞. Finally, to save computation time, we implemented an early stopping criterion such that if the mean score over the current period is smaller than the one obtained in the previous period, then the training is stopped. This is based on the following assumption: if the current mean score is worse than the previous one, then we can assume that the state-action function is converging to its optimum, so the small oscillations due to the randomness made it worse than the previous mean.

Results
The results obtained in this environment are summarized in Table 22 and plotted in Figure 19. We can easily observe that there is a local Pareto front     In this case, our approach is able to solve the task in the 100% of the cases.     A comparison of our two best solutions (w.r.t score and interpretability) and the state-of-the-art is shown in Table 23 and Figure 21. As we can observe, even though we do not achieve (in absolute) the best performance and the best M, our solutions represent the best compromise between the two metrics.
Moreover, in Figure 21 we can observe that a Pareto front that explains the trade-off between interpretability and performance seems to exist. However, our best solution achieves a comparable performance w.r.t. the best score of the state-of-the-art, while having a substantially smaller complexity. Our best solution is shown in Figure 22 Source

Discussion
In this section we briefly describe the interpretable techniques proposed in literature and we discuss our results in comparison to them. Then, we will perform an ablation study and, finally, we will interpret the produced trees.

CartPole 5.1.1. Differentiable Decision Trees
Silva et al. [3] propose an approach based on differentiable decision trees, i.e. decision trees that replace hard splits with sigmoids. This means that they refactor the conditions from variable > constant to σ(variable − constant). By replacing hard splits with sigmoids, the decision of the tree can be seen as the sum of all the leaves weighted by the product of the outputs of the sigmoids for that path (i.e. the product of all the σ(variable−threshold) for the true branch and (1−σ(variable−threshold)) for the false branch for each split encountered).
They optimize the splits of the tree and the actions taken by using PPO [30] and backpropagation. The solution proposed for the CartPole-v1 environment is the decision tree shown in Figure 23. It is interesting to observe that their optimization process "selects" the same variables that have been selected in our case by artificial evolution.
Moreover, the tree proposed by them is slightly more complex than our best tree. In fact, while our best tree has a maximum depth of 2 conditions, their has a maximum depth of 3 conditions. This increase in complexity is reflected by the difference in the M measure.
Moreover, since the performance of this solution are not satisfactory, the authors gave us the solution coming from a follow-up work in a personal communication. This solution is shown in Figure 24. Besides the performance comparison performed in Table 9, we compare here the robustness to noise, similarly to what we did in Figure 9.
As we can see in Figure 25, the orthogonal trees obtained by Silva et al.
have a robustness that is comparable to our orthogonal tree. This suggests us that orthogonal trees may be intrinsically less robust than oblique ones.

MountainCar
Most of the approaches we used for the comparison in the MountainCar-v0 environments come from the OpenAI Gym leaderboard.

Zhiqing Xiao
The system proposed by this entry 19 consists in a closed-form policy. However, it is not clear whether the policy has been derived by a human or learned by a machine.
Anyway, this solution achieves the best performance (let alone our solutions) on this task while also having the best degree of interpretability (according to our modification of the metric proposed in [20]). The policy is the following: While M is lower for this policy than for our best tree, it may be a bit harder to interpret this model. We think that this is due to the fact that the M metric has been proposed to evaluate the interpretability of mathematical formulae, while we are interested in interpreting hyperplanes. While hyperplanes are defined by mathematical formulae, the interpretability of an hyperplane may also depend on the number on non-linear operations that are used to determine the hyperplane.

Amit
This entry 20 uses SARSA to solve the task.
While tabular approaches like SARSA and Q-learning are transparent, their interpretability depends heavily on the number of states and actions. Table 18 shows that, even if this approach is transparent and easily interpretable, our solutions are able to achieve a better degree of interpretability. In our opinion, this is due to the fact that using decision trees as function approximators leads to the "grouping" of some states of the table used in tabular approaches. This is especially useful when we want to extract knowledge. In fact, by grouping some states, we take into account only the variables and the thresholds that have a big impact on the policy, discarding irrelevant details.

Dhebar et al.
Dhebar et al., in [4], propose an approach to reinforcement learning that uses nonlinear decision trees. They first approximate an oracle policy and then they fine-tune it by using evolutionary algorithms. The policies obtained in these two phases are called "open-loop" and "closed-loop" policies.
In this case, we only had access to the open-loop policy for the MountainCar-v0 environment, which is shown in Figure 26. 20 github.com/amitkvikram/rl-agent/blob/master/mountainCar-v0-sarsa.ipynb Also in this case, while M for this solution is better than our best solution (w.r.t. test score), it seems harder to interpret, due to the non-linearity of the hyperplanes. In fact, in our solution M is higher due to the higher number of splits in the tree, but that does not take into account the fact that in our case the hyperplanes that divide the feature space are simpler than the ones proposed in [4].
Finally, we perform a comparison on the robustness to input noise with the solutions provided by "Zhiqing Xiao" and the one provided by Dhebar et al. In [3] the authors, besides regular trees, use also decision lists. A decision list is a tree that is extremely unbalanced, i.e. il collapses to a list. Figure 28 shows the solution obtained. However, as shown in Table 23, it does not achieve satisfactory performance. This is due to the fact that, while the differentiable tree is able to achieve better performance (even though it does not solve the task), its discretization modifies the final distribution of the actions.

Malagon et al.
In [29] the authors use the Univariate Marginal Distribution Algorithm to evolve a neural network without hidden layers in the LunarLander-v2 domain. Since the neural network has no hidden layers the whole system reduces to where i refers to the output neurons.
This results in an easy-to-interpret system, according to both [21] and the metric M.

Dhebar et al.
In [4] the authors propose a nonlinear decision tree that achieves a mean testing score of 234.98 points. However, the rules associated with this tree are not shown, so we only had access to the 3-levels-deep NLDT.
The tree is shown in Figure 29. It is important to note that even if the solution obtained is a tree, the interpretation is not easy, since the hyperplanes contained in each split are not linear.
Also in this case, we performed a comparison on the robustness to input noise, the result is shown in Figure 30. However, for this comparison we could not include the results from Malagon et al. since the weights were not publicly accessible.

Ablation study
In order to assess whether our two-level optimization approach is convenient with respect to a single-level optimization approach, we perform an ablation study in which we use Grammatical Evolution alone to evolve decision trees.
Moreover, we perform statistical tests to test whether the difference are statistically significant by fixing a threshold for the p-value of α = 0.05.

CartPole
Orthogonal trees. To evolve orthogonal trees, we used the grammar shown in Table 24, which has been evolved by using the same parameters described in Table 3. Also in this case, the fitness was computed as the mean score on 10 episodes.
The results are shown in Table 25. As we can observe, while in most cases the evolution is able to evolve agents that achieve a perfect training score, they have poor generalization capabilities. In our opinion, this is akin to overfitting.
In fact, in this case, the agents did not understand the "value" of going in a certain state, but just learned a rule that worked in the tested cases. Moreover, a two-tailed Mann-Whitney U-Test gives us a p-value of 9 · 10 −3 that allows us to reject the null hypothesis (i.e. that the mean testing score come from the same distribution) with threshold α = 0.05.
Oblique trees. We perform the same test also in the oblique setting. We use the grammar shown in Table 26 and the parameters used in Table 4.
The results are shown in Table 27. We can easily observe that in this case the results are similar to the ones shown in Table 6. To check whether this similarity has a statistical significance, we perform a Two-tailed Mann-Whitney U-Test. The null hypothesis states that the results obtained by using the GE with Q-learning and GE alone come from the same statistical distribution. The p-value obtained with this test is 0.73, so we are not able to reject the null hypothesis with threshold α = 0.05. For this reason, we will assume that they come from the same distribution.
This suggests us that, since oblique trees seem to be both more robust to noise and more stable than orthogonal trees, an agent can learn good policies in simple environments without the need for Q-learning.

Rule
Production if < condition > then < action > else < action > condition input var < comp op > < const input var > action   Table 25: Results obtained by evolving orthogonal decision trees for the CartPole-v1 environment by using Grammatical Evolution alone.

MountainCar
Orthogonal trees. We evolve orthogonal trees for the MountainCar-v0 environment by using the grammar shown in Table 28 and the parameters shown in Table 10. Since in this case the number of episodes is low and the environment is harder to explore than CartPole, we expect GE alone to perform comparably with our approach.
The results are shown in Table 29. As we expected, the performance are quite similar. To ensure that there are no statistical significant differences between the two approaches, we performed a Two-tailed Mann-Whitney U-Test on the testing mean score obtained by using the two approaches, which stated that the null hypothesis (i.e. the scores obtained come from the same distribution)  cannot be rejected with threshold α = 0.05.
Oblique trees. We perform the test also by using oblique trees. We use the grammar described in Table 30 with the parameters shown in Table 13.
The results are shown in Table 31. While these results seem to be better than the ones shown in 17, they do not seem to be statistically significant according to a two-tailed Mann-Whitney U-Test (p-value 0.38). Thus, this seems to confirm our hypothesis that states that solving MountainCar-v0 with oblique trees seems to be harder than the case with orthogonal trees (with the proposed grammar).

LunarLander
Finally, we perform the same test on LunarLander-v2, using only oblique trees. We use the grammar described in Table 32    two, GE performs worse than our approach.
The results of this experiment are shown in Table 33. According to our expectations, we are able to solve the task only in the 60% of the cases. Moreover, we perform a two-tailed Mann-Whitney U-Test to test the statistical significance of the differences between the two approaches (on the mean testing score). We obtain a p-value of 0.017 that allows us to reject the null hypothesis. We can thus hypothesize that the use of the two-level optimization technique gives us a boost in performance in complex environments such as LunarLander-v2.

Interpretation of the solutions
In this subsection, we will look at the agents produced and try to interpret the policies.

CartPole
Orthogonal tree. The tree shown in Figure 11 is extremely easy to interpret. In fact, this agent moves the cart to the left if ω < 0.074 ∧ θ < 0.022 (1) otherwise, it moves the cart to the right. Note that there is a case in which the pole is falling to the right but the agent moves the cart to the left: θ ∈ [0, 0.022)rad ∧ ω ∈ [0, 0.074)rad/s. This is not a problem because when the agent moves the cart to the right, it increases the velocity of the pole, resulting in a "move right" action in the subsequent steps.
Oblique tree. In this case, the interpretation of the policy is a bit harder. The condition used by the agent to discriminate between the two states is:   where k refers to the current timestep. To simplify the process, we write Equation 1 as the following: First of all, we want to analyze the role of the constant t in the policy. By can be obtained as follows: where n is the index of the last timestep. For simplicity, let's assume that x n = − t a . This means that we can rewrite Equation 3 as follows: We can then perform other steps and obtain: Then, by noting that we can rewrite Equation 8 as: where Now, by observing that we obtain Finally, noting that after that usually, in the first 50 timesteps of the simulations the velocities are high (max k | v k |< 1.5) and then the velocity become small (max k | v k |< 0.55) because the pole is balanced, we can write that: | n j=k+1 g j v j | b τ · 1.5 + 49 · 1.5 + 450 · 0.55 = 420 (15) where the approximate equality holds in the worst case (i.e. k = 0 and all the velocities have the same sign). However, considering that in our observations the magnitude of the velocities was usually significantly smaller than the maximum and that the summation is multiplied by τ 2 (τ = 0.02 in this environment), we can safely consider only the term with the highest magnitude, i.e. b τ v k . Moreover, using only v k sets x n ≈ 0, which makes the system easier to understand intuitively.
Then, we obtain Approximating the constants, we set b = 0.543 ≈ 0.5, c = 0.904 ≈ 1, d = 0.559 ≈ 0.5, so the final policy is 21 : move lef t otherwise A dimensionally consistent policy is θ k + 1 2 (v k /l + ω) ntsτ τ > 0, where l = 1 is the pole length and n ts is the number of steps that we are taking into consideration to balance the pole (in our case n ts = 1. This policy can be interpreted as follows. If the sum of the current angle and the mean angle given by the two contributions (i.e. linear velocity of the cart and angular velocity of the pole) are positive (it is a kind of "prediction" of the future angle), then move the cart to the right, because it is going to fall to the right. Otherwise, move the cart to the left.

MountainCar
Orthogonal tree. Also in this case, the orthogonal tree ( Figure 17) is easy to interpret. In fact, if we look at the leaves, we see that the agent accelerates to the left only in two cases: This means that the agents accelerates to the left when: • it is going towards the hill on the left to build momentum and it is far from the border (x > −0.9), so it tries to maximize the potential energy of the car • velocity is positive but not enough (v < 0.035) and it is near the valley In all the other cases, the agent accelerates to the right.
Oblique trees. In this case, the agent accelerates to the left when both conditions are false. This means that we have to solve the following system of two inequalities: This means that the agent accelerates to the left when v ≤ 7.5799 · 10 −2 · x + 6.6955 ∧ v ≤ 1.1516 · 10 −2 · x + 5.495 · 10 −3 . This corresponds to the decision regions shown in Figure 31.
It is important to note that the lack of robustness for this solution does not allow us to further approximate the constants of the two hyperplanes.

LunarLander
In this case, since the oblique tree ( Figure 22) has 4 conditions and 8 unknowns, it is a bit harder to interpret.
First condition. This condition, when it evaluates to False, turns on the right engine for a timestep. So, we turn on the right engine when where a, b, ..., h replace the constants shown in Figure 22.
To simplify the analysis, let's assume c l = c r = 0, since they can assume only two values: 0, 1. This simplification does not affect the generality of our analysis, since we are only assuming that there is no contact with the ground. We can So, we can rewrite condition 18 as follows: By merging some terms we obtain: We analyzed the terms in parenthesis and we discovered that they approximate the position (or the angle) in the following timestep. The constants c ≈ f ≈ ap k+1 To understand how this condition works, let's suppose that p k+1 x ≈ 0 (i.e. the lander is in the center of the environment). Then, if p k+1 y ≈ 1 (i.e. near the starting point), we will fire the right engine if θ k+1 ≤ −b/e ≈ −0.15rad, i.e. the angle of the lander is going to fall to the right. When p k+1 y ≈ 0 (i.e. near the landing pad), the agent will fire the right engine if θ k+1 ≤ 0, so we can say that the farther the lander is from the landing pad (vertically), the more margin we have on the threshold of the angle. Let's now suppose that p k+1 y = 0 to study the effect of p k+1 x on the policy. Then, we can say that the agent turns on the right engine when θ ≤ a e p k+1 x so, when the agent is on the right part of the environment, the agent uses a linear threshold to activate the engine in order to avoid both high angles and high displacements from the landing pad location.
Similarly, when p k+1 x is negative, the threshold is negative so the agent tries both to compensate negative angles (that would move it farther on the left) and distance from the landing point.
Second condition. The second condition, when evaluates to True, leads to the firing of the left engine. Also in this case, let's neglect the terms c l and c r . We can write the condition as: Of course, in this case the coefficients a, ..., f are different from the previous ones. By grouping the terms as before we obtain Also in this case, the constants seem to have the same role (i.e. some lead to overestimation of the next position and some to a better estimate) so we can write: ap k+1 This means that this condition is easy to understand given the previous one: it is the opposite. This means that we can use the same reasoning used above to understand this condition.
Third condition. This condition handles the firing of the main engine. For this reason, we expect it to work differently from the previous two. In fact, we can easily observe that the signs of the terms in x and y are inverted. Moreover, the two angular terms do not have the same sign. Also in this case, let's use a, .
.., f to rename the constants and ignore c l and c r . This leads to: By performing a grouping of the variables similarly to the previous to conditions we obtain: Then, by denoting with v k+1 and v k−1 the value of the variable v in the next and the previous timestep respectively, we can write: An experimental measurement of the τ variable led us to set τ = 0.05. By multiplying all the members by τ we obtain: Then, by noting that τ a ≈ 5·10 −3 , τ b ≈ 6.7·10 −3 , τ c ≈ 3.5·10 −2 , τ d ≈ 2.6·10 −2 and τ e ≈ 10 −2 , we can decide to neglect the effects of the first two terms. So we have: By merging the terms in θ and θ k−1 we obtain: − c v x + d v y + (f − τ e)ω + τ eθ k−1 < 0 By moving all the terms except the one in ω to the second member we get: Then, by noting that all the states that are tested in this condition have c | v x | ≈ 5d | v y | and c | v x | ≈ 120e| θ k−1 | (where v is the mean value of the variable v), we can neglect (as shown by experimental results) the effects of v y and θ k−1 .
Finally, the rule used to fire the main engine is: While we expected the main engine to depend on p y or v y , by analyzing the activation of the condition in several episodes we found that this rule represents the landing phase. In fact, the goal of this rule is to balance angular velocity and linear velocity to make the agent gently stop on the landing pad.
Fourth condition. This condition, when evaluates to True, does not fire any engine. On the other hand, when it evaluates to False, it fires the main engine.
The condition is the following (also in this case we replace the constants with letters): ap x − bp y − cv x − dv y − eθ + f ω < 0 (33) By analyzing the mean values of the variables and their coefficients we obtain: a| p x | ≈ 8.5 · 10 3 , b| p y | ≈ 8.7 · 10 3 , c| v x | ≈ 7 · 10 2 , d| v y | ≈ 2.5 · 10 2 , e| θ | ≈ 1.3 · 10 2 , f | ω | ≈ 4.7 · 10 2 . This suggests that we can neglect the values of p x , p y , θ because their mean value is low w.r.t. the maximum. The experiments confirmed that these variables have a low impact on the performance of the agent.
So, the agent does not fire any engine when: This seems an extension of what we obtained in the previous condition, where we also have a dependency from v y . Moreover, it is important to note that this check is performed only when the third condition is not true. Finally, from experiments we observed that this condition is true usually when the agent has successfully landed. In this case, the terms in c l and c r can be seen as a further margin to the agent, so that when a leg touches the ground the agent is more likely to not fire any engine.
In the opposite case, i.e. when ω ≥ c v x + d v y , we the agent turns on the main engine to balance the high angular velocity of the agent. Note, again, that if the angular velocity is too low it is balanced by the previous condition.

Considerations
In this subsection, we interpreted the policy produced in various settings.
We showed that the decision trees produced are interpretable and give an understanding about how the agent works. It is important to note that in several cases we performed approximations to ease the understanding process. However, this is not a limitation of the method, because more exact interpretations can be obtained by not neglecting details. This is especially important in high-stakes or safety-critical settings, where humans need to have a thorough understanding to validate and trust the systems produced.
Finally, while some solutions may seem hard to interpret (i.e. oblique decision trees), it is important to see the them in a bigger context: while they may not be easy to interpret at a first sight, their analysis is pretty straightforward (as shown earlier). On the other hand, black-box models (such as deep neural networks) are way harder to inspect, due to the significantly bigger number of operations performed in the decision making process.

Conclusions
While in recent years AI made a huge progress, the need of being able to understand how a model works is becoming more and more important. To overcome this issue, significant effort was put to advance the XAI field. However, XAI is not always a suitable solution. In fact, they suffer from some problems that make their use unsafe in safety-critical or high-stakes processes.
Interpretable AI, instead, consists in using transparent approaches in order to have a complete understanding of what happens in the model. However, these models are not widely used in practice because of their widely-thought lower performance.
In this paper, we propose a two-level optimization method that allows to induce decision trees that can perform reinforcement learning.
Our results show that the proposed approach is able to generate decision trees that are comparable or even better than the non-interpretable state-ofthe-art (from the performance point of view) while having significantly better interpretability. Furthermore, the results obtained in this work suggest that the widely though performance-interpretability trade-off does not always hold (as suggested by [1]) and that interpretable models can be competitive with stateof-the-art techniques. For this reason, research in this field must be encouraged.
Moreover, we compared the solutions obtained to the state-of-the-art from the point of view of the interpretability. While the metric of interpretability does not perfectly suit our purpose, we can easily observe the difference in complexity with respect to black-box models. While we expect that changing the metric of interpretability does not significantly affect the difference w.r.t. black-box models, we think that future work should focus on the study of more tailored interpretability metrics (i.e. tailored on machine learning models).
Since it is important for practical applications, we also compared our solutions to the interpretable (and publicly available) state-of-the-art w.r.t. robustness to input noise. The results show that our approach is comparably or more robust than the other solutions.
Finally, we demonstrated that the produced agents can be interpreted, practically showing the advantage of interpretable models w.r.t. black boxes.
Other future developments include: experimental tests on more complex reinforcement learning domains; the extension of the proposed method to the imitation learning domain; the development of a method that can automatically tune the constants, reducing the prior knowledge that must be included in the grammar; a flexible grammar that easily allows oblique trees to become orthog-onal, to automatically choose the appropriate type of splits depending on the problem.