Why did I fail? A Causal-based Method to Find Explanations for Robot Failures

Robot failures in human-centered environments are inevitable. Therefore, the ability of robots to explain such failures is paramount for interacting with humans to increase trust and transparency. To achieve this skill, the main challenges addressed in this paper are I) acquiring enough data to learn a cause-effect model of the environment and II) generating causal explanations based on that model. We address I) by learning a causal Bayesian network from simulation data. Concerning II), we propose a novel method that enables robots to generate contrastive explanations upon task failures. The explanation is based on setting the failure state in contrast with the closest state that would have allowed for a successful execution. This state is found through breadth-first search and is based on success predictions from the learned causal model. We assessed our method in two different scenarios I) stacking cubes and II) dropping spheres into a container. The obtained causal models reach a sim2real accuracy of 70% and 72%, respectively. We finally show that our novel method scales over multiple tasks and allows real robots to give failure explanations like 'the upper cube was stacked too high and too far to the right of the lower cube.'


I. INTRODUCTION
One important component in human interactions is the ability to explain one's actions, especially when failures occur [1], [2]. It is argued that robots need this skill if they were to act in human-centered environments on a daily basis [3]. Moreover, explainability is shown to increase trust and transparency in robots [1], [2], and the diagnoses capabilities of a robot are crucial for correcting its behavior [4].
There are different types of failures, e.g., task recognition errors (an incorrect action is learned) and task execution errors (the robot drops an object) [5], [6]. In this work, we focus on explaining execution failures. For example, a robot is asked to stack two cubes (see Fig. 1). The robot will first pick up a cube and move its gripper above the goal cube. However, due to sensor and motor inaccuracies, the robot places its gripper slightly shifted to the left, which results in an imperfect cube alignment between the cubes. Therefore, the upper cube lands on the goal but bounces to the table. In such a situation, we expect the robot to reason about what went wrong and generate an explanation based on its previous experience, e.g., 'I failed because the upper cube was dropped too far to the left of the lower cube.' Maximilian  A contrastive explanation is generated upon task failures (steps 3,4). Finally, the obtained models are evaluated on two different tasks (cube stacking and sphere dropping) and transferred to two different robots that provide explanations when they commit errors.
Typically, explanations are based on the concept of causality [7]. Obtaining a causal model of the environment is often addressed through statistical methods that learn a mapping between possible causes (preconditions) and the actionoutcome [4], [8]. However, such statistical models alone are not explanations in itself [1] and require another layer that interprets these models to produce explanations. Another problem is that a considerable amount of data is needed to learn cause-effect relationships. In this case, training such models using a simulated environment will allow a faster and more extensive experience acquisition [4].
In this paper, we propose a method for generating causal explanations of failures based on a causal model that provides robots with a partial understanding of their environment (see Fig. 1). First, we learn a causal Bayesian network from simulated task executions, tackling the problem of knowledge acquisition. We also show that the obtained model can transfer the acquired knowledge (experience) from simulation to reality and is agnostic to several real robots with different embodiments. Second, we propose a new method to generate explanations of execution failures based on the learned causal knowledge. Our method is based on a contrastive explanation comparing the variable parametrization associated with the failed action with its closest parametrization that would have led to a successful execution, which is found through breadth-first search (BFS). Finally, we analyze the benefits of this method in two different scenarios: I) stacking cubes and II) dropping spheres into a container.
To summarize, our contributions are as follows: • We present a novel method to generate contrastive causal explanations of action failures based on causal Bayesian networks. • We demonstrate how causal Bayesian networks can be learned from simulations, exemplified in a cube stacking and sphere dropping scenario and provide extensive real-world experiments that show that the obtained causal models are transferable from simulation to reality without any retraining. Our method is agnostic to various robot platforms with different embodiments and scales over multiple tasks and scenarios. We, thus, show that the simulation-based model serves as an excellent prior experience for the explanations, making them more generally applicable.

II. RELATED WORK A. Causality in Robotics
Despite being acknowledged as an important concept, causality is relatively underexplored in the robotics domain [9], [2]. Some works explore causality to distinguish between task-relevant and -irrelevant variables [10]. For example, CREST [11] uses causal interventions on environment variables to discover which of the variables affect an RL policy. They find that excluding them impacts generalizability and sim-to-real transfer positively. In [12] a set of causal rules is defined to learn to distinguish between unimportant features in physical relations and object affordances. Brawer et al. present a causal approach to tool affordance learning [9]. Some works explore Bayesian networks, for example, to learn statistical dependencies between object attributes, grasp actions, and a set of task constraints from simulated data [13]. While the main objective is to use graphical models to generalize task executions, these works don't look into the question of how these models can be utilized for failure explanations. A different paper [14] investigates the problem of learning causal relations between actions in household-related tasks. They discover, for example, that there is a causal connection between opening a drawer and retrieving plates from human demonstrations. They only retrieve causal links between actions, while we focus on causal relations between different environment variables, like object features and the action outcome.

B. Learning explainable models of cause-effect relations
In the planning domain, cause-effect relationships are represented through (probabilistic) planning operators [15]. Mitrevksi et al. propose the concept of learning task execution models, which consists of learning symbolic preconditions of a task and a function approximation for the success model [4], based on Gaussian Process models. They noted that a simulated environment could be incorporated for a faster and more extensive experience acquisition, as proposed in [13]. Human virtual demonstrations have been used to construct planning operators to learn cause-effect relationships between actions and observed state-variable changes [15]. However, even though symbolic planning operators are considered human-understandable, they are not explanations in itself, thus requiring an additional layer that interprets the models and generates failure explanations.
Some other works also aim to learn probabilistic action representations experience to generalize the acquired knowledge. For example, learning probabilistic action effects of dropping objects into different containers [8]. Again, the main objective is to find an intelligent way of generalizing the probability predictions for a variety of objects, e.g., bowl vs. bread box, but their method does not include any understanding of why there is a difference in the dropping success probabilities between these different objects.

C. Contrastive Explanations
Contrastive explanations are deeply rooted in the human way of generating explanations [1]. This also had a significant impact on explanation generation in other fields like Explainable AI Planning (XAIP) [16]. In XAIP, typical questions that a machine should answer are why a certain plan was generated vs. another one? or why the plan contains a particular action a 1 and not action a 2 ? [16], [17]. We, however, are mostly interested in explaining why specific actions failed based on environment variables like object features. A method for explaining synthesis failures of high-level robot task specifications (encoded through LTL formulae) is presented in [18]. However, the causes need to be explicitly modeled (violations of specification constraints), while, in our approach, the causes are automatically detected during the BN learning process. Das et al. generate verbal failure explanations [19], by learning an encoderdecoder network that maps current information about the robot and environment state into a vector of words. However, the method does not scale well since it requires data with annotations about each failure cause. Our approach only requires annotations regarding the action success, which can be binary and are generally easier obtainable. Additionally, we encode the explanations directly in the causal structure of the different state variables instead of learning a blackbox model. In a follow-up study [20], the authors use MO-TIFNET [21] to autonomously detect spatial relationships and object attributes in a given scene. Then, pairwise ranking is used to filter out the subset of relevant relations for a particular explanation. Annotations for pairwise preferences of one relation over another need to be provided for training an SVM, which cannot be easily automated since they require human input. Due to the close relation to our approach, we discuss these works [19], [20] in more detail in Sec. V-D.

III. OUR APPROACH TO EXPLAINING FAILURES
Our proposed approach consists of three main steps: A) Identification of the variables used in the analyzed task; B) Learning a Bayesian network which requires to 1) Learn a graphical representation of the variable relations (structure learning) and 2) to learn conditional probability distributions (parameter learning); and C) Our proposed method to explain failures, based on the previously obtained model.

A. Variable definitions and assumptions
Explaining failures, requires to learn the connections between possible causes and effects of an action. We describe an action via a set of random variables X = {X 1 , X 2 , ..., X n }, which need to be defined by the experiment designer during the experiment setup. We require X to contain a set of treatment variables C ⊂ X, which describe potential causes, and outcome (effect) variables E ⊂ X. Then, the goal of causal inference is to estimate the effect of C on E [22].
Data samples for learning the causal model can, in principle, be collected in simulation or the real world. A data sample d consists of a particular parametrization of X, which we define as where n denotes the number of variables. It is important to sample values for possible causes C randomly. Randomized controlled trials are referred to as the gold standard for causal inference [3] and allow us to exclude the possibility of unmeasured confounders. Consequently, all detected relations between the variables X are indeed causal and not merely correlations. Besides the apparent advantage of generating truly causal explanations and avoiding the danger of possible confounders, causal models can also answer interventional questions. In contrast, non-causal models can only answer observational queries. The experiment must satisfy the sampled variable values before executing the action for data collection. E is measured at the end of the experiment.
We define another set X goal = { d goal1 , d goal2 , ..., d goal h } that contains all possible variable parametrizations that denote a successful action execution. Then, an action is successful iff its parametrization d ∈ X goal . Note, that it is out of scope of this paper, to discuss methods that learn X goal , but rather assume X goal to be provided a priori. In other words, we assume that the robot knows how an unsuccessful task execution is defined in terms of its outcome variables E and is thus able to detect it by comparing the action execution outcome with X goal . Note, however, that the robot has no a-priori knowledge about which variables in X = X 1 , X 2 , ..., X n are in C or E, nor how they are related. This knowledge is generated by learning the Bayesian network.
To efficiently learn a Bayesian network, some assumptions are needed to handle continuous data [23], mainly because many structure learning algorithms do not accept continuous variables as parents of discrete/categorical variables [24]. In our case, this means that some effect variables from E could not have continuous parent variables out of C, which would likely result in an incorrect Bayesian network structure. As a preprocessing step, we therefore discretize all continuous random variables out of X into intervals with an equal number of samples.

B. Our proposed pipeline to learn the causal model
Formally, Bayesian networks are defined via a graphical structure G = (V, A), which is a directed acyclic graph (DAG), where V = {X 1 , X 2 , ..., X n } represents the set of nodes and A is the set of arcs [24]. Each node X i ⊆ X represents a random variable. Based on the dependency structure of the DAG and the Markov property, the joint probability distribution of a Bayesian network can be factorized into a set of local probability distributions, where each random variable X i only depends on its direct parents Π Xi : Learning a Bayesian network from data consists of two steps: 1) Structure Learning: The purpose of this step is to learn the graphical representation of the network G = (V, A) and can be achieved by a variety of different algorithms. An extensive survey of potentially equally valid structure learning algorithms, like [25] is presented in [22]. For the remainder of this paper, we choose the Grow-Shrink [26] algorithm (gs) to learn G. gs falls into the category of constraint-basedalgorithms, which use statistical tests to learn conditional independence relations (also called "constraints") from the data [27]. Note that learning plausible assumptions about causal relations is one of the biggest challenges in the whole process of causal inference [28]. For example, in some cases it is challenging to determine the direction of causal relations purely from the joint distribution of the observational data (thus without additional interventional experiments, additional domain knowledge, or certain assumptions about the data distribution) [29]. Structure learning is an active field of research [28], and this paper will use the learned structure to generate causal-based explanations of failures. Therefore, we assume that the outcome of the structure learning step is indeed the correct Bayesian network graph G, or has been manually revised based on domain knowledge.
2) Parameter Learning: The purpose of this step is to fit functions that reflect the local probability distributions, of the factorization in formula (1). We utilize the maximum likelihood estimator for conditional probabilities (mle) to generate a conditional probability table based on the previously obtained network structure.

C. Our proposed method to explain failures
Our proposed method to generate contrastive failure explanations uses the obtained causal Bayesian network to compute success predictions and is summarized in algorithm 1. In (L-2 Alg. 1)), a matrix is generated which defines transitions for every single-variable change for all possible variable parametrizations. For example, if we had two variables X 1 , X 2 with two intervals x , x . Then, the possible valid transitions for node = ( . Lines 5-15 (Alg. 1) describe the adapted BFS procedure, which searches for the closest variable parametrization that fulfills the goal criteria of P (d ∈ X goal |Π child ) > , where is the success threshold, which can be heuristically set. The concept of our proposed method is to generate contrastive explanations that compare the current variable parametrization associated with the execution failure x currentint with the closest parametrization that would have allowed for

Algorithm 1 Failure Explanation
Input: failure variable parameterization x failure , graphical model G, structural equations P (X i |Π Xi ), discretization intervals of all model variables X int , success threshold , goal parametrizations X goal Output: solution variable parameterization x solutionint , solution success probability prediction p solution node ← POP(q) 7: v ← APPEND(v, node) 8: for each transition t ∈ P (node) do 9: child ← CHILD(P, node) 10: if child ∈ q, v then 11: p solution = P (d ∈ X goal |Π child ) 12: if p solution > then 13: x solutionint ← child 14: RETURN(p solution , x solutionint ) 15: q ← APPEND(q, x currentint ) a succesfull task execution x solutionint . Consider Figure 2 for a visualization of the explanation generation, exemplified on two variables X and Y , which are both causally influencing the variable X out . Furthermore, it is known that x out = 1 ∈ X goal . The resulting explanation would be that the task failed because X = x 1 instead of X = x 2 and Y = y 4 instead of Y = y 3 .

IV. EXPERIMENTS
We evaluate our method to find causal explanations of failures based on two different scenarios. The goal of experiment 1 is to stack one cube on top of another. The goal of experiment 2 is to drop a sphere into different containers.

A. Experiment 1: Stacking Cubes
In the cube stacking scenario, the environment contains two cubes: CubeUp and CubeDown (see Fig. 3). The goal of the stacking action is to place CubeUp on top of CubeDown. We define six variables as follows: X = {xOff, yOff, dropOff, colorDown, colorUp, onTop}. Both cubes have an edge length of 5cm.
1) Cube Stacking simulation setup: We run the simulations in Unity3d which bases its physics behavior on the Nvidia PhysX engine. For training the Bayesian network we generate 20,000 samples, on 500 parallel table environments (see Fig. 1). We randomly sample values for xOff, yOff ∼ U These 36 points were empirically chosen because they cover an area where the ideal conditions of the simulation (e.g., collisions without any rotation due to the gripper motors) could have potentially the most significant effect on behavior discrepancies. Once the upper cube is too far outside (> 2.5 cm), it doesn't play a role how it was dropped, so we have not included points with larger offsets than 2cm. For each unique stacking setup instantiation, we conduct five iterations. After each trial, the cubes are re-adjusted into an always similar pre-stack position by the operator. The stacking outcome (onTop value) was also determined by the operator. Note that the purpose of the robot experiments is not to modify the causal model that we learned from the simulation but to evaluate the model transferability to the real environment.

B. Experiment 2: Dropping spheres into containers
In our second experiment, the robot needs to drop spheres into different containers. The environment contains a Sphere and one of several possible Containers, which are shaped like a plate, bowl or glass (see Fig. 4). We define eight variables as follows: X = {xOff, yOff, inCont, contHeight, contSize, contType, contCurvature, contColor}. This new list of variables is required because experiment 2 contains a different set of objects and new randomization parameters like the size, which we did not consider in experiment 1. The sphere has a diameter of 6.6cm and the size of the containers is randomized. We chose a constant dropping height of 0.4m.
1) Sphere Dropping simulation setup: For training the Bayesian network, we generate 800,000 samples, on 400 parallel table environments (see Fig. 4.c), in Unity3d. We randomly sample values for xOff, yOff ∼ U [− 6,6] (in cm), contColor = {Red, Blue, Green, Orange}, contType = {Glass, Plate, Bowl}. We do not directly set contHeight and contSize, but manipulate these variables via a scalingFactor ∼ U [0. 4,1] . In Unity3d, the scaling factor can be utilized to manipulate the size of objects. We use the same scaling factor in all three dimensions, thus the height (contHeight) to diameter (contSize) ratio was constant for each object type respectively. inCont = {True, False} is not sampled but automatically determined after the stacking process.
2) Sphere Dropping robot experiment setup: We run and assess the sphere dropping experiment on the TIAGo service robot (as introduced in Sec. IV-A.2). As a sphere, we use a regular tennis ball, which weighs around 58g and has a diameter of around 6.6cm. We chose three containers which each possess a height and size parametrization that was captured in the model (Fig. 4.d). We evaluate each container at nine different points, where xOff, yOff = {0, 3, 6} (in cm). For each unique sphere dropping setup instantiation, we conduct five iterations. Similar to the prior experiment, the containers are re-adjusted by a human operator, who is also determining the dropping outcome.

A. Analysis of the obtained causal models
We first present and discuss the learned causal model of the cube stacking scenario. 10-fold cross-validation reports an average loss of 0.10269 and a standard deviation of 0.00031, and Figure 5.a displays the resulting DAG. The graph indicates, that there are causal relations from xOff, yOff and dropOff to onTop, while the two color variables colorDown and colorUp are independent. In other words, it makes a difference from which position the cube is dropped, but the cube color has no impact on the stacking success. We obtained the following dropOff intervals (in cm): The conditional probabilities P (onTop = 1|Π onTop ) are visualized in Fig. 6. These plots allow us to conclude that stacking success decreases the greater the drop-offset and the more offset in both x-and y-direction. In particular, there is a diminishing chance of stacking success for the values |xOff| > 1.8 or |yOff| > 1.8, no matter the dropOff height. The obtained DAG for the sphere dropping experiment is visualized in Fig. 5.b). The algorithm detected causal links between contHeight contSize, contType, contCurvature (marked in blue), but, initially, was not able to direct these four edges (which is a common problem in structure learning as discussed in Sec. III-B.1). Since it is not possible to fit a conditional probability table based on an undirected graph, we directed these edges manually based on our domain knowledge of the task. 10-fold cross validation reports an average loss of 0.184 and a standard deviation of 0.000052. The graph indicates causal links between contType and inCont via contHeight and contSize, while contColor and contCurvature do not impact the dropping success.
The following Querying the causal model for sphere dropping success given the object type reveals that the task is most likely successful for bowls (69%) followed by plates (59%) and glasses (25%). The model also indicates that larger-sized objects are more tolerant to x/y-offsets and yield a higher chance of dropping success. (Fig. 7). Surprisingly, a similar trend cannot be determined with the container height (Fig. 7). The reasons are twofold: First, the container height depends not only on the size but also on the type of container (e.g., glasses typically have a larger height than plates). As a result, not all object types are represented in all height intervals, e.g., plates are only covered in h1 and h2, whereas cups are distributed among h2-h4. Second, cups and plates were found to contribute to a higher dropping success chance. As a result, the largest success chances can be obtained in interval h2.
Overall, we conclude that the obtained success probabilities resemble our intuitive understanding of the physical processes for both scenarios. Nevertheless, real-world experiments have a higher complexity due to the many environment uncertainties. We, therefore, expect the simulation to be less conservative than reality, as we have higher control over the variables involved in the stacking process.

B. sim2real accuracy of the causal models
To evaluate how well the causal model and the real-world match, we introduce the sim2real accuracy score. It is defined as the normalized difference in predicted probabilities over the set of points that were evaluated in real-world experiments.
The results for the real-world experiments of the cube stacking scenario are presented in Fig. 8, where the black points indicate the nine stacking locations (all possible combinations of x-and y-offset values) for each of the four dropoff heights. The plots show the contours of the probabilities, meaning the stacking success probabilities are interpolated between the nine measurement points. The sim2real accuracy amounts to 71% for the TIAGo and 69% for the UR3. The largest discrepancy between model and reality can be determined for the higher drop-off positions. For the realworld measurements, the stacking success drops earlier, at around 2cm or 3.5cm. It is also interesting to compare similarities regarding probability outcomes between the two differently embodied robots. The correspondence concerning the 36 measured positions amounts to 85%.
The results for the sphere dropping experiments are presented in Fig. 9. We obtained 72% sim2real accuracy for the tested data points. We observed the most significant discrepancy for the plate, which was predicted to have a much larger success probability by the model. One reason could have been, for example, that the surface of the real plate was not perfectly flat as in the simulations.
We can conclude that for both tested scenarios, the probability model obtained from simulated data matches reasonably well with reality and thus can be utilized for the explanation of failures that occur in the real world. Furthermore, the model generalizes well to differently embodied robots. We want to emphasize that the causal model was not retrained or adapted when the real scenarios were tested. If we had obtained a lower sim2real accuracy or more significant differences between the two robots, it would be advisable to include robot-specific variables (such as the gripper type and orientation) and adapt the model with realworld data. But even then, the model that we obtain from the simulation can be used as an excellent experience prior, allowing for faster applicability and learning.

C. Explainability capabilities
Finally, we provide several concrete examples to showcase which kind of explanations our method finds for robot failures. We set the probability threshold that distinguishes a failure from success to = 0.8 for all examples. Tab. I provides three examples for both scenarios of stacking a cube and dropping a sphere. Cube Stacking -Example 2 is particularly interesting, as it showcases that there are often multiple correct explanations for the error. In this case it would have been possible to achieve a successful stacking by either going from dropOff = z 4 to dropOff = z 3 or by changing xOff = z 4 to xOff = z 3 (search tree is visualised in Fig. 10). Which solution is found first depends on the variable prioritization within the tree search due to the used BFS algorithm. ...

0.957
Explanation: The container was too small and too high (interpretation: take a bowl instead of a cup, which is bigger but has less height) Sphere Dropping -Example 2: xOff = 0.059 yOff = 0.059 ContSize = .21 ContHeight = .09  Our closest solution will lead to a minimal number of interval changes and thus provides the 'simplest' solution in terms of Occam's razor principle [1]. For instance, in Sphere Dropping -Example 2, it would have also been possible to change the container to a larger bowl. But instead, the search process found it was easier to adapt the xOff position. The advantage of the current uninformed BFS is that this principle is always applicable and does not require any human domain knowledge.

D. Comparison of our failure explanation approach with baseline methods
We compare our method of finding explanations of robot task failures with the two closely related methods of Context-Based History (CB-H) [19] explanations, and the ranked Semantic Scene Graph method (SSG-R) [20], based on the criteria that are summarized in Tab. II. For CB-H all failures and their causes need to be manually defined in the form of Fault Trees. In SSG-R failures are not modeled, but explained in form of a list of spatial relations (like close to or occluded) and object features (like fragile or heavy), automatically detected through the semantic scene graph model MOTIFNET [21]. We explain failures via contrastive variable parametrizations. Due to these differences in failure representation, all three methods have different requirements during the learning step. For learning the encoder-decoder network that generates language failure explanations for CB-H, simulations must be annotated with the respective failure cause. In [19], 2100 annotated time-steps were used to train for six different failure causes. However, the number of required samples will drastically increase for the two discussed examples of cube stacking and sphere dropping due to the increased number of failure possibilities. Additionally, samples are more expensive than in our method since it is required to label the failure cause instead of a simple binary action success label. In SSG-R, pairwise ranking distinguishes between relevant and irrelevant relations. Pairwise relation preferences must be provided via domain knowledge of the failure scenario and which are more expensive than the automatically retrievable binary action success labels from our method. Another difficulty in terms of applicability to the presented scenarios of cube stacking and sphere dropping provide the continuous variables (e.g., contSize or xOff), which are discretized into more than two categories (as opposed to binary object relations). For these variables, MOTFNET is not applicable. While, in principle, a range of variables was detected to influence the action outcome causally, it is due to a specific variable parametrization that they lead to the action failure. Our method automatically discerns between relevant and irrelevant relations. Last but not least, neither CB-H nor SSG-R learn an action success model, which can be useful for other tasks beyond failure explanation, e.g., failure prediction and prevention.
To conclude, both of these methods would require significant changes and adaption to find explanations for the experiment scenarios discussed in this paper. One of the most significant differences of both methods with ours is the requirement of failure-cause labels instead of action success labels, which are typically easier to obtain.

VI. CONCLUSION
This paper presents our novel approach to finding causal explanations for robot failures. First, we learn a causal  Bayesian network from simulated task executions. We show that the model is transferable to the real world with 70% and 72% accuracy over two tasks of stacking cubes and dropping a sphere into different containers and is agnostic to differently embodied robots. Furthermore, we propose a new method to generate explanations of execution failures based on the causal model. This method finds a contrastive explanation comparing the action parametrization of the failure with its closest parametrization that would have led to a successful execution, which is found through breadth-first search (BFS). For future work, we would like to incorporate a language model that automatically encodes the contrastive failure explanations into a vector of words, such that it can be communicated more intuitively to a wide range of potential users. Furthermore, we want to investigate how the obtained causal models can also be used to predict and prevent failures from happening.