Towards Human-Centered Explainable AI: A Survey of User Studies for Model Explanations

Explainable AI (XAI) is widely viewed as a sine qua non for ever-expanding AI research. A better understanding of the needs of XAI users, as well as human-centered evaluations of explainable models are both a necessity and a challenge. In this paper, we explore how human-computer interaction (HCI) and AI researchers conduct user studies in XAI applications based on a systematic literature review. After identifying and thoroughly analyzing 97 core papers with human-based XAI evaluations over the past five years, we categorize them along the measured characteristics of explanatory methods, namely trust, understanding, usability, and human-AI collaboration performance. Our research shows that XAI is spreading more rapidly in certain application domains, such as recommender systems than in others, but that user evaluations are still rather sparse and incorporate hardly any insights from cognitive or social sciences. Based on a comprehensive discussion of best practices, i.e., common models, design choices, and measures in user studies, we propose practical guidelines on designing and conducting user studies for XAI researchers and practitioners. Lastly, this survey also highlights several open research directions, particularly linking psychological science and human-centered XAI.


INTRODUCTION
Artificial Intelligence (AI) is driving digital transformation and is already an integral part of various everyday technologies.Recent developments in AI are essential to progress in fields such as recommendation systems [98,99,100], autonomous driving [101,102,103] or robotics [104,105,106].Moreover, AI's success story has not excluded high-stakes decision-making tasks like medical diagnosis [107,108,109], credit scoring [110,111,112], jurisprudence [113,114] or recruiting and hiring decisions [115,116], However, the behavior and decision-making processes of modern AI systems are often not understandable, so they are frequently considered black boxes.Deploying such black-box models presents a serious dilemma in certain safety-critical domains, for instance, public health or finance [117].This is due to the necessity for a transparent and trustworthy AI system, which is required by both practitioners (to gain better insights into system functioning) and end users (to rely on model decisions).
Methods to increase the interpretability and transparency of an AI system are developed in the research area of Explainable AI (XAI).Specifically, human-centered XAI, which addresses the importance of human stack-holders to the AI systems, has been proposed and discussed since [118,119].While a huge number of model explanations • Y. Rong, Tina Seidel, Gjergji Kasneci and Enkelejda Kasneci are with Technical University of Munich, 80335, Munich, Germany.E-mail: {yao.rong,tina.seidel,gjergji.kasneci,enkelejda.kasneci}@tum.de • Tobias Leemann, Thai-trang Nguyen, and Lisa Fiedler are with University of Tübingen, 72076, Tübingen, Germany.E-mail: {tobias.leemann}@unituebingen.de, {thai-trang.nguyen,lisa.fiedler}@student.uni-tuebingen.de • Peizhu Qian and Vaibhav Unhelkar are with Rice University, 77005, Houston, USA.E-mail: {pq3, vaibhav.unhelkar}@rice.eduManuscript received 29 January 2023.Corresponding author: yao.rong@tum.deare available, the question of how to transparently evaluate their quality is still an open research question, and hence, extensively studied in recent years.A popular taxonomy of evaluation strategies for XAI methods proposes three categories: functionally-grounded evaluation, applicationgrounded evaluation, and human-grounded evaluation [120].While functionally-grounded measures do not require human labor, the other two involve human subjects and are more costly to conduct.Many functionally-grounded measures have been proposed to evaluate XAI algorithms (see [121] for review), however, the difficult comparability between different automatic evaluation measures is a common problem [122,123].Another drawback of automated measures is that there is no guarantee that they truly reflect humans' preferences [40,124].Consequently, user studies in XAI, especially when moving towards real-world products, are inevitable if one wishes to test more general beliefs of the quality of explanations [16].However, only a small portion (about 20 %) of XAI evaluation projects consider human subjects [121].There exist efforts in developing taxonomies or introducing the definitions or implications of different human-centric evaluations [125,126,127], but the recent generation of user studies and their findings have not been systematically discussed yet.Moreover, Yang et al. [128] point out that XAI is growing separately and treated differently in different communities (e.g., machine learning and HCI).Hence, effective guidance in XAI user study design is crucial to better let both XAI algorithm and application designers recognize the users' real needs.This work aims to bridge this research gap in modern XAI user study design by distilling practical guidelines for user studies through a comprehensive and structured literature review.
Therefore, we reviewed highly relevant papers that include user studies from top-tier HCI and XAI venues.
Specifically, we included the recent five years of CHI, IUI, UIST, CSCW, FA(cc)T, ICML, ICRL, NeurIPS, and AAAI.As we aim at analyzing human user evaluation of advanced model explanations, we ran search queries involving keywords from the two groups "explainable AI" and "user study", as listed in the Table 2.We selected the papers containing at least one keyword from each group, resulting in over one hundred papers.Then, we thoroughly studied these papers and filtered out papers that did not fulfill the criteria: (1) deploying explainable models or techniques and (2) conducting an assessment with human subjects.We identified a total of 97 core papers for this survey (see Table 1 for an overview of core papers with respect to their measured quantities in user studies).Based on these core papers, we performed a comprehensive analysis to fill the research gap by offering a systematic overview of user studies in XAI.We highlight the main contributions: 1) To offer an overview of the foundational work of user studies in XAI, we investigated references of all 97 core papers in a data-driven manner.Likewise, we analyzed follow-up works building on these core papers (identified through citations of core papers) to reveal the fields impacted by XAI user evaluations (Section 3).2) We present a summary of the design details in XAI user studies with particular focus on the deployed models and explanation techniques, experimental design patterns, participants as well as concrete measures, providing inspiration of how to collect human assessment (Section 4).3) We discuss the impact of using explanations on different aspects of user experience (Section 5), which can serve as an overview of the effectiveness of the current XAI technology and a summary of the state-of-the-art.4) Based on the examined user study details and their best-practice findings, we synthesize guidelines for designing an effective user study for XAI (Section 6).5) Beyond the user study design, we discuss potential paradigms of AI systems understanding humans in the context of e.g., theory of minds, as well as other future research directions (Section 7).
Our study highlights under-investigated areas in the context of current user-centered XAI research such as cognitive or psychological sciences through data-driven bibliometric analysis.Together with our proposed guidelines, we believe that this work will benefit XAI practitioners and researchers from various disciplines and will help to approach the overarching goal of human-centered XAI.

RELATED WORK
As a vast amount of explanation methods have been proposed, many researchers seek a systematic overview of the ever-growing field of XAI.In [129,130,131,132,133,134], the authors aim to cover many facets of XAI technologies ranging from problem definitions, goals, AI/ML model explanations to evaluation measures, while in [135] the authors emphasize the research trends and challenges in Human-Computer-Interaction (HCI) applications.A large body of XAI surveys focuses mainly on the interpretability of a particular family of models and corresponding explanation techniques.For instance, [136,137,138] investigate explanations for Deep Neural Networks (DNNs), where models often take images as input [136,137].Joshi et al. [138], however, provide an extensive review for DNNs with multimodal input for instance that of joint visionlanguage tasks.Causal interpretable models are gaining more attention recently and Moraffah et al. [139] provide a literature review for causal explanations.A systematic literature review on explanations for advice-giving systems is conducted in [140].Among these surveys focusing on general XAI technologies, evaluation measures are only briefly examined.
One challenge in XAI research is to evaluate and compare different explanation methods, due to the multidisciplinary concepts in interpretability/explainability [120,121,141].Evaluation measures can be divided into two groups: human-grounded measures that rely on human subjects and functionally-grounded metrics that can be computed without human subjects [120,121].Many researchers seek solutions to evaluate explanations automatically.A comprehensive literature review with a focus on these functionallygrounded evaluation methods (without human subjects) can be found in [121].Explainability is an inherently humancentric property, therefore, the research community should and has started to recognize the need for human-centered evaluations when working on XAI [120,142].
For instance, Chromik and Schuessler [126] propose a taxonomy on XAI evaluations involving humans.Mohseni et al. [127] summarize four groups of human-related eval-Fig.1: Roadmap of our literature analysis.We find out the foundational works of core papers and their application domains using a data-driven method introduced in Section 3. Three main research questions in user studies are distilled from core papers.Methods related to measures of each category are discussed in Section 4, and findings of the research questions are summarized in Section 5. Based on the findings, we propose future directions to further promote human-centered XAI in Section 7. We distill important messages in this figure, but refer to the discussion in the corresponding sections for more details.
uation metrics: mental model (e.g., user's understanding of the model), user trust, human-AI task performance and explanation usefulness and satisfaction (i.e., user experience).Hoffman [125] places more focus on psychometric evaluations by proposing a conceptual model of the XAI process and specifying four key components that should be evaluated: explanation goodness and satisfaction, (user's) mental models, curiosity, trust and performance.Beyond assessing evaluation methods, XAI applications are designed to eventually support decision-making and benefit end users.A recent review by Lai et al. [143] considers studies on collaborative Human-AI decision-making, which may include AI agents providing explanations.Success in human-AI decision-making tasks can be seen as one amongst many other ways to evaluate the effect of explanations.Ferreira and Monteiro [144] present a review of the user experience of XAI applications to answer who uses XAI, why, and in which context (what + when) the explanation is presented.
Closer to our focus on user studies concerning XAI, Liao et al. [142] study user experiences with XAI to reveal pitfalls of existing XAI methods, underscoring the important role of humans in XAI development.As suggested by Doshi-Velez and Kim [120], a human-subject experiment needs to be designed sophisticatedly to reduce confounding factors.In contrast to previous surveys on XAI, we aim to provide XAI researchers and practitioners with a comprehensive overview of the research questions explored in user studies, along with thorough information on experimental design.To this end, we present a practical guideline in user study design, which can be used as a starting point for future exploration of human-centric XAI applications.

METHODOLOGY
To analyze the collected papers related to user studies on XAI, we first categorize them into four groups based on their objectives.From these studies, we distill three main research questions concerning the effects of model explanations on each objective.We then summarize the methods used in these studies to quantify these objectives.Important findings from the papers are discussed, and we propose future directions based on these findings.Additionally, we examine the foundational works upon which these user studies are based (i.e., their references) and the follow-up papers that cite them, shedding light on the foundational works and emerging trends in human-centered XAI studies.Figure 1 presents a roadmap of our analysis.
In this section, we first describe the criteria used for their categorization.We then discuss the foundational and application domains of these papers, providing a broader view before diving into their detailed analysis.

Categorization of User-Study Objectives
Since the core papers cover various factors of model explanations, we decided to categorize the core papers into different clusters to better study their commonalities and differences.In [120], interpretability in the context of ML systems is defined as the ability to explain or present model predictions in understandable terms to a human.Beyond fostering comprehension, the authors argue that interpretability can assist in qualitatively ascertaining whether other desiderata, such as usability and trust are met.During a profound study of the relevant literature that was previously selected, we identified four sensible categories, that are derived from the considered dependent variables in user studies (desiderata of interpretability).These four categories are trust, understanding, usability, and human-AI collaboration performance.In Table 1, the studied papers are categorized according to the measured quantities.As each measure can usually be assigned to only one of these categories, we found this distinction to be intuitive.
These categories reflect different functionalities (goals) of XAI.As interpretability is defined as "the ability to explain or to present in understandable terms to a human.",humans' "understanding" is the direct goal of XAI.To be concrete, understanding in the context of interacting with an ML model refers to a user's grasp or "mental model" of how the model operates, and this knowledge grows from using the system and from clear explanations about it [142]."Usability" is commonly studied in human-computer interaction [145], which is one of the desiderata of XAI [120].
According to [146], usability is the extent to which users can utilize a product to successfully, efficiently, and satisfactorily accomplish their intended objectives.Thus, this category encompasses user studies that employ model explanations to support users in achieving specific tasks.In usability, different aspects are measured, for instance, whether the system is easy to use or how much cognitive load it requires.The aspect "undesired behavior detection" relates to use cases where explanations uncover model discriminatory behaviors, such as the utilization of undesired features."Trust" in AI is summarized as a combination of the user's confidence in a model's accuracy, a personal comfort level with understanding and using it, and the willingness to let the model make decisions [141].It encompasses more requirements.Human-AI collaboration performance is related to scenarios where the AI system provides its predictions, but humans retain the final decisions [89].In this case, model explanations are deployed to reach a performance superior to that of the AI system or the human decision-maker alone.These categories cover different dependent variables of interest in the reviewed user studies, primarily related to how XAI methods function.These functions mainly tie to the models' reasoning and knowledge representation.A wider perspective on XAI, which assesses generalization or robustness, remains an important field for future exploration through user studies.

Foundations of User Studies
Based on a data-driven bibliometric analysis of the references in core papers, we highlight significant research topics within the "Foundational Domain" in Figure 1.It is evident that model explanations and interpretability are pivotal components.This includes papers that introduce explanation methods such as LIME [147], SHAP [148], and other attribution methods.These are a frequent subject of study in works measuring understanding and usability.Additionally, convolutional networks, which are commonly employed in experiments, use tools like GradCAM [149] and various saliency maps to generate model explanations.Notably, many research papers appear within the domain of recommender systems, because many XAI user studies are conducted in the context of recommendation solutions.The EU's General Data Protection Regulation (GDPR) [150] is frequently mentioned in core papers due to the ongoing debate on the right to explanation" [151].This debate has sig-nificantly influenced the shift in modern AI systems towards explainability.While the ultimate consumers of model explanations are humans, well-established research domains that focus on human understanding are underrepresented.For instance, only a few papers related to "Cognition" are cited compared to those on other algorithmic topics.Millecamp et al. [18] suggest enhancing XAI theory with insights from social sciences, including cognitive science and psychology.Given the scant references to psychology, it appears that only a handful of XAI user studies delve into evaluating XAI from a psychological standpoint.We highlight a nascent research domain of XAI frameworks based on human cognition and behavior theories [142].This theoretical guidance can also offer conceptual tools for better evaluating XAI from user perspectives.More details about common references can be found in Appendix ??.

Impact of User Studies
Figure 1 presents applications that make use (and thus are the consumers) of the findings from core papers.We noticed that studies on user understanding and trust span a wide range of applications.For example, trust is frequently addressed in the contexts of medical diagnosis and transportation, indicating its significance in high-risk scenarios.Recommendation systems emerge as a primary focus in follow-up works.Papers on usability have a significant impact on fields like data visualization, software development, and education.In these areas, models frequently serve as tools to ease the burden on end users.Human-AI collaboration measures particularly promote the further development of robotics and or natural language processing.The prominence of recommendation systems in both foundational works and their impact implies that XAI is an integral component of contemporary recommendation systems.A comprehensive overview of the fundamental works and application domains can be found in Appendix ??.

COMPREHENSIVE USER STUDY ANALYSIS
In this section, we present details of the covered XAI user studies.We first introduce some commonly used AI models and explanation techniques (Section 4.1), followed by a discussion of application domains and measures with respect to the four measured quantities.The experimental designs, as well as analysis tools are presented in Section 4.3.

Models and Explanations
As our selected core papers comprise a large spectrum of AI models, data modalities, and explanation approaches, we initially list the models and explanation techniques deployed along with the corresponding core paper references in Table 3.It presents the utilization of explanation types in columns and model types in rows.The explanation methods used is organized according the the taxonomy by Molnar [152].First, there are intrinsically interpretable models, also known as white-box models.For instance, whitebox models include decision trees and linear models.Second, there are black-box models that provide no parameter access or are too complex to be explained in a humanunderstandable way [153].These include ensembling techniques such as Random Forests or neural models.As for explanation techniques, we identified five key types in the scope of the surveyed papers (rows of Table 3).Most frequently used are feature-based (attribution) explanations, for instance, SHAP (Shapley additive explanations [148]) and LIME (Local Interpretable Model-Agnostic Explanations [147]).There is a clear differentiation between local, instance-wise, explanations and global explanations that apply to the model in its entirety.For instance, the weights of a linear model have a global scope.This differentiation is common among these feature-based explanations, where most of the papers using local explanations.Other popular explanation types are example-based explanations, counterfactual explanations, which aim at providing actionable suggestions for attaining a user-preferred prediction by changing certain input features, and concept-based explanations, which use meaningful high-level concepts such as objects or shapes to explain a prediction.
Besides these four main types of explanations, there are other explanations such as rules [11,88] or game strategies [7,10] when AI plays games.More details about concrete models and explanations can be found in Appendix ??.

Measurements
The effectiveness of explanations can be characterized from several angles.We specifically identified the categories of trust, understanding, usability, and human-AI collaboration performance.In this section, we give an overview of the contexts in which each of these variables is studied and the measures used to quantify them.

Trust
User trust is studied in decision-making applications such as image classification [13,17], (review) deception detection [25] or loan approval [27].Besides decision making, [5,8,16,18,19,23] study user trust in the domain of recommendation systems.Whether explainable ML models can increase user trust in the medical domain is studied in [1,6,9].Moreover, Colley et al. [3] measure user trust in an autonomous driving application with and without explanations.
Trust measures used in much of the existing research can be divided into two groups: self-reported and observed trust [156].Self-reported trust is commonly measured by asking users to fill out questionnaires whereas observed trust is quantified by humans' agreement with the model's decisions.In ?? in Appendix, trust measures in these two groups are listed.The agreement rate of users with the model decisions is commonly used [9,11,12,25] as a measure of observed trust.Parallel to observed trust measurement, van der Waa et al. [157] ascribe the user's alignment behaviors to the persuasive power of model explanations, i.e., the capacity to convince users to follow model decisions despite the correctness.As an extension, trust calibration is defined based on this measure.For example, a high agreement rate to wrongly made decisions represents overtrust, while a low agreement rate to correct decisions means undertrust [12].In self-reported measurements, researchers either utilize well-developed questionnaires or self-designed ones, with the exception of [4] which conducts a semistructured interview to explore user opinions.Several works [6,11,13,16,17,18,19,24,27] propose their own questionnaires.Among these, a subgroup [13,16,18,19,24] simply asks users to rate a single statement such as "I trust the system's recommendation/decision", which is named as one-dimensional trust by [8].When deploying previously proposed questionnaires [2,3,5,7,8,10,21,22,23,158], Trust in Automation [159] is the most commonly used one, in which the underlying constructs of trust between human and computerized systems are explored.

Understanding
An important goal of explanation techniques is to foster users' understanding of complex ML systems.An important separation has to be made between users' perceived understanding and their actual comprehension of the underlying model, as the two often do not agree [35,40].Cheng et al. [22] explicitly differentiate between objective understanding and self-reported understanding, which we term subjective understanding in this work.While subjective understanding is usually measured through questionnaires, measuring objective understanding requires a proxy task where the users' understanding is put to a test.Additionally, user studies can be run to assess how well users can understand the explanation itself (and not the underlying model).This can be an important sanity check and is particularly used in the domain of conceptual explanations [62,160], where the intelligibility of concepts needs to be verified.We refer to the third category as understanding of explanations but defer its detailed findings to Appendix ??.
Objective Understanding.Works in the subdomain of objective understanding deploy proxy tasks to verify users' understanding of a model's inner workings.The most commonly considered domain in works on understanding is finance [35,39,40,47,48,49,53] followed by image classification [13,21,52].One of the most critical design choices when assessing objective understanding is the selection of a suitable proxy task.Doshi-Velez and Kim [120] argue that the task should "maintain the essence of the target application" that is anticipated.One of the most prominent tasks is forward simulation [120,141].This task demands subjects that are given an input to simulate, i.e., predict, the model's output.The extent to which participants can successfully provide the model's output is also referred to as simulatability [141].However, scholars have designed many more tasks to quantify understanding and applied them across a variety of data modalities (cf.Table ?? in Appendix for an exhaustive listing).
We briefly describe other common tasks below.A special variant of forward simulation is called relative simulation.In this task, users predict which example out of a predefined choice will have the highest prediction score (or class probability).A manipulation or counterfactual simulation task [120] asks users to manipulate the input features in such a way that a certain model outcome (counterfactual) is reached.Users' performance on this task can be used as a proxy for their understanding.Lipton [141] pointed out that simulatability can only be a reasonable measure, if the model is simple enough to be captured by humans and that simpler tasks are required otherwise.An example could be a feature importance query, where users have to tell which features are actually used by the model.A directed and more local version of this task is marginal effects queries, where the subjects predict how changes in a given input feature will affect the prediction (e.g., "Does increasing feature X lead to a higher prediction of Y being class 1?").Because explanations should allow the identification of weaknesses in models, the task of failure prediction measures the accuracy of users' prediction when the model prediction is wrong.
Subjective Understanding.Besides the objective understanding which is supported by performance indicators, understanding of a model may be subjective, i.e., it may depend on a user's own perception.The most commonly used applications that measure subjective understanding are various recommendation system setups [16,33,34,38].
Most of the works assess the subjective understanding of a user with a post-task questionnaire.Guo et al. [7] adapted a popular questionnaire designed for recommendation systems by Knijnenburg et al. [161], while Bell et al. [39] accommodated the questionnaire which originally intended to measure the intelligibility of differenet explanations by Lim and Dey [162].On the other hand, agreement to simple subjective statements such as "I understand this decision algorithm" [22], "I understand how the AI..." [13,17] or "The explanation(s) help me to understand..." [33] can be collected to assess subjective understanding.

Usability
Usability is a key concern of every HCI system and thus applies to almost all domains.This is reflected in the surveyed papers, where usability is studied in a wide range of setups and contexts.We also include application-specific performance measures in this category.
Based on the measurements in the user studies, we refined usability into measures of helpfulness, workload (cognitive load), satisfaction, ease of use and detecting undesired behaviors of the system, as shown in Table 1.
Using model explanations to audit models is one purpose of explainability [130].Some of the surveyed works study how model explanations can assist users in detecting undesired behaviors of models.These issues mainly include (perceived) unfairness in the model decision-making [38,74,78,79], biases in models [72] or features [57], and wrong decisions (failures) [24] in the studied papers.A detailed summary of types of undesired behaviors is listed in Table 6.In the undesired behavior detection, the effectiveness of explanations is evaluated by objective performance measures, such as the number of bugs identified [71], the share of participants that identify a certain bias [57,First Experiment] or by the deviations between model predictions and human predictions for unusual samples [53].The perception of users regarding fair treatment by a system has primarily been researched in high-stakes applications such as granting loans [27] or granting bail for criminal offenders [73,74,75].For example, [73,74,75] investigate the fairness of COM-PAS, a commercial criminal risk estimation tool that was used in the US to help make judicial bail decisions.It is also considered in everyday use-cases such as news [38] and music [77] recommendations, or possible career suggestions [76], where a bias in the underlying system can be to the detriment of the user.As the assessment of fairness is a very subjective matter, questions regarding perceived fairness are prevalent, e.g., "how the software made the prediction was fair" [74], which can be answered on 5-or 7-point Likert scales [2,27,38,73,74,75].Among these works, an effective explanation is the one that can either increase or decrease the fairness perceptions, since the aim of explanations is to show fairness or unfairness.An exhaustive overview of measures for usability is given in ?? of the Appendix.

Human-AI Collaboration Performance
The goal of human-AI teaming is to improve the performance in AI-supported decision-making above the bar set by humans or an AI alone [89].Improving human performance with the help of AI has been considered in games [10,88], question answering tasks [89,91], deception detection [25,90] and topic modeling [29,30].
The most common assessment is to rate AI-aided human performance by the percentage of correctly predicted instances in the decision-making process [25,89,90].Paleja et al. [10], however, define the performance as the time to complete the task.In [88], performance is measured in a game-based application, chess, using a winning percentage (which is commonly used in sports) as well as a percentile rank of player moves.

Experimental Design and Analysis
There are three common experimental settings when conducting user evaluation: between-subjects (or betweengroups) designs, within-subjects designs, and mixed designs that combine elements of both.An overview of the designs found in the core papers and their participant numbers is presented in Table 4 and Figure 2, respectively.

Between-subjects
With slightly above 55 % of the user studies conducted in a between-subjects manner, i.e., one subject is only exposed to one condition, this design choice is most common in the XAI literature.The number of participants in the betweensubjects manner usually starts at around 30 participants, while it may go up to 1070 in total for 3 conditions as in [17] and to 1250 for 5 conditions in [53].However, the number of participants can be limited when the studied application is designed for specific groups of lay persons, which cannot be easily recruited from the Internet platforms such as Amazon Mechanical Turk.For instance, Ooge et al. [8] use 12 school students per condition.Some authors place particular emphasis on participants being similar to the average demographic [73,75].
The conditions usually include the different explanation techniques in combination with other parameters such as the model, data set, data modality, or a number of features used as independent variables.Note that a full grid design with many independent variables may quickly result in a very high number of conditions, which in turn requires many participants.The outcome variable of interest is commonly measured on a numerical or ordinal scale right away, however, in the fairness domain, qualitative analyses are sometimes obtained through conducted interviews or written responses [2,27,73].
The statistical analysis directly follows from this design.If one is interested in identifying significant differences between the groups, common statistical hypotheses tests are used.For overall comparison, one or two-way ANOVA tests are the most commonly used statistical tool.Interesting post-hoc comparisons between two groups can be made with a standard T-test, if the data is normally distributed with equal variance, or by using non-parametric tests such as the Wilcoxon rank-sum test (also known as Mann-Whitney U-test) for comparison of two populations (e.g, [57]) or the Tukey HSD test (e.g., [49]) for multiple populations.When running multiple post-hoc tests, some works make use of the Bonferroni correction (e.g, [57]).

Within-subjects
Around 30 % of the papers use the within-subjects design, where each participant sequentially passes through all conditions and provides feedback.Fewer participants are TABLE 4: Experimental designs in core papers.recruited in within-subjects experiments compared to the between-subjects ones.Hence, they are particularly popular when participants with restrictive characteristics, such as domain-specific professional expertise, are required.For example, Suresh et al. [9] and Rong et al. [26] recruit fourteen medical professionals and five radiologists in their user studies, respectively.The small number of medical experts contributing to the user study is a limitation [26], however, it is often the case in expert user research.Gegenfurtner et al. [166] evaluate 73 sources and point out that the majority of these studies include only five, maybe ten experts.
Besides the medical domain, other works [3,4,19,21] also invite subjects with particular professions such as engineers in a technology company.When no specific knowledge is required, however, participant numbers reach up to 740 also for within-subjects designs [93].
For within-groups designs, the Wilcoxon signed-rank test (e.g.used by [35,52]) is the most common method to compare paired samples for significant differences.Repeated-measures ANOVA is a common analysis tool, when multiple comparisons are required (see, e.g., [35]).

Mixed
The smallest group of studies, about 15%, use a mixture of between-and within-subjects settings.In these works, subjects are first assigned randomly to one group, where they are exposed to multiple conditions.Anik and Bunt [2] use knowledge background in machine learning as a betweensubjects factor to divide the participants into three groups (expert, intermediate and beginner), while inside each group participants interact with explanations in the context of four different scenarios (e.g., facial expression recognition or automated speech recognition).Dominguez et al. [16] make the presence of explanations a between-subjects condition and different types of explanations a within-subjects factor in the group with model explanations.A particular challenge for such a study design is that statistical tools from both the independent-samples and dependent-samples categories need to be combined.

FINDINGS OF USER STUDIES
In this section, we summarize the primary findings from the core papers.Table 5 lists findings with respect to four measured quantities.To build an overview of the findings, we divide papers according to their evaluation dimensions, i.e., the independent variables in the user studies.When using the presence of explanations as the evaluation aspect, the findings are summarized in Table 5.The listed impacts using explanations are to be seen in comparison with a control group without explanations.Effects are divided into two groups: (1) Positive effects, for example, increasing user trust or understanding; (2) Non-positive effects: the effect can be negative, or not significantly positive (neural), or a mixture of different effects (e.g., feature-based explanations have positive effects but counterfactual explanations do not).Beyond the explanations themselves, other possible evaluation dimensions such as that might have an impact on the perception of XAI, for instance, AI technology literacy, model performance, or the dimensionality of the data.Instead of using the mere presence of explanations, many works compare different explanation techniques with each other (see Appendix ?? for more details).
As various research questions and findings are addressed in 97 core papers, many papers compare explanation types in order to choose a preferable one, it is not possible to cover all results in one table.Based on them, we outline some interesting trends in the effectiveness of explanations on user experience: (1) Explanations are effective in improving users' subjective understanding; (2) The effectiveness of explanations in increasing user trust and usability of models is not clear; (3) Explanations are not good at convincing users that models are fair; (4) Interactivity of the model has positive impact on user trust, understanding and model usability.The first three statements can validated through the number of papers obtaining positive or nonpositive effects in each category, while the last finding is extracted from ?? in the Appendix, which details findings with on other independent variables.We encourage the reader to consider the short summary of primary findings in the tables and check for further details according to their specific interests.In the following section, we highlight some findings for each category of measurement.Trust.Among the papers comparing the effect of using explanations to using no explanations, or placebo (randomly generated) explanations [8,25], about half of the papers validate that explanations have a positive impact on user trust [1,8,10,13,16,25,27,28], while the other half cannot verify this hypothesis [3,11,12,21,22,24].For instance, Colley et al. [3] investigated the explanations in an autonomous driving task and discover that the trust is improved in simulation but not with the real-world footage.Another example of the mixed effect of using explanations is found in [12], where (minimal) evidence is found that feature-based explanations help increase appropriate trust, but counterfactual explanations do not.
Apart from using explanations as independent variables, the user personalities or expertise may also affect their perceptions [2,17,18,22,23,30].Millecamp et al. [18] captured personal characteristics in the aspects such as the Locus of Control defined by Fourier ("the extent to which people believe they have power over events in their lives"), Need for Cognition ("a measure of the tendency for an individual to engage in effortful cognitive activities") or Tech-Savviness ("the confidence in trying out new technology").However, no significant interaction effect could be found between the personal characteristics and the trust.Liao and Sundar [5] studied a recommendation system asking users' personal data with different explanations.They hypothesized that explanations in a "help-seeker" style and using the pronoun "I" would gain more trust of users than the explanations formalized in a "help-provider" style.Nevertheless, However, the opposite result is found and using self-referential expression resulted in lower affective trust.Model performance together with model explanation was studied in [17] for an image recognition task.The authors found out when images were recognized (high model performance), users feel the system more capable ("capability" is defined as a belief of trust).Understanding.The fundamental question in this subdomain is to find out which explanation technique is most beneficial for increasing the user's understanding of a machine learning model.As pointed out earlier, understanding can be measured both in a subjective and objective manner.
We first discuss results on objective understanding.The goal of increasing objective understanding was explicitly posed by Alqaraawi et al. [54] who reported that saliency maps have a positive effect on understanding.Wang and Yin [12] show that counterfactual explanations and feature importance increase users objective understanding.On the contrary, Sixt et al. [57] find none of their examined explanation techniques (counterfactuals, conceptual explanations) superior to a baseline technique consisting of example images for each class and the work by Hase and Bansal [40] reveals that many explanations (including anchors, prototypes) have no effect in increasing objective understanding, which LIME on tabular data being the only exception.Apart from the explanation, several other factors have been identified to have an effect on objective understanding.Hase and Bansal [40] suggest that the data modality may have a non-negligible impact on how different explanation techniques increase understanding.Some results highlight that the choice of proxy task is influential.Arora et al. [50] show that their manipulatablity task revealed differences remained hidden when forward simulation is used.In spite of these findings, Buçinca et al. [13] underline that preferred explanations may be different in a real-world application from a simulated one.Regarding the type of model, there is disagreement on whether white or black-box models can lead to increased objective understanding.While black-box models without explanations resulted in higher simulation performance than white-box models with SHAP values in [39], Cheng et al. [22] observe that white-box models increase simulatability and also conclude that interactivity is an important factor when it comes to objective understanding.
In comparison with the objective understanding, the research question in the subdomain subjective understanding is to find out how explanations impact user's perceived understanding [7,12,17,22,32,33,34,37,56].There exist a trend of using model explanations to improve subjective understanding [13,16,17,28,34,38,168].However, Chromik et al. [35] challenge the improvement in perceived understanding with the cognitive bias named illusion of explanatory depth (IOED) [169], which means that laypeople often have overconfidence bias in their understanding of complex systems.Their results confirm the IOED issue in XAI, i.e., questioning users' understanding by asking them to apply their understanding in practice consistently reduces their subjective understanding.Explanations can have different impacts on subjective and objective understandings [22], where whitebox explanations increase objective understanding but do not have significant impact on subjective understanding.User study findings when using model explanations as evaluation dimensions.Effects of explanations compared to the baseline (control group) of "no explanations" on measured quantities.Effects are divided into "Positive" where explanation information is given, and "Non-positive / Mixed" where negative impact is marked with underlines.
Similar disagreements have been observed in multiple other works [40,168].Radensky et al. [33] examine the joint effects of local and global explanations in a recommendation system and their results provide evidence that both are better than either alone.
Usability.Similar to trust, it is not clear whether explanations are effective in improving users' perceptions of helpfulness, satisfaction or other dimensions of usability.For instance, in [16,30,47], the explanations have a positive effect on satisfaction, while no significant effects on satisfaction are observed in [18,19,29,69].Parallel to trust, Smith-Renner et al. [29] provide evidence for the hypothesis that it is harmful to user trust and satisfaction to show explanations by highlighting the important words in a text classification task.A strong correlation between self-reported trust and satisfaction can also be observed in [3], where explanations have a positive impact in a simulated driving environment, but no significant effects when using real-world data.Beyond explanations, Nourani et al. [56] study the order of observing system weakness and strengths, which reveals that encountering weakness first results in a lower rate of usage of system explanations than encountering strength first.Schoeffer et al. [27] find out that showing feature importance scores or counterfactual explanations (or a com-bination of both) for explaining decisions helps increase the perceived fairness, whereas highlighting important features without scores does not.However, several studies don't show a significant difference between scenarios with and without explanations [27,38,78].Effects of explanations may be dependent on input samples, as shown in [67].
The authors show that both Debiased-CAM and Biased-CAM improve the helpfulness for a weakly blurred image, however, there is no significant improvement for unblurred or strongly blurred images.When used to assist users in detecting undesired behaviors, model explanations are likely to identify various types of problems that exist within models or data, as demonstrated by [57,71,72].However, successful detection is not guaranteed.For example, Poursabzi-Sangdeh et al. [53] show that users with model explanations are less able to identify incorrect predictions.A limitation of current detection methods is that users may have varying assessments, such as perceived unfairness and irrelevance [53,71,73], regarding the features used in models for decision-making.Due to this limitation, the effectiveness of methods assessed through self-reported data may face challenges in generalizability as discussed in [73].Yet, these methods generally offer a one-size-fits-all solution, failing to account for variations in individual assessments.

Human-AI Collaboration Performance.
A strain of works [25,88,90,91,95,96,97] show that viewing explanations can improve human accuracy in making decisions, especially with feature-based explanations taking text data as input [25,90,91].When using example-based explanations in text classification, there is no improvement in human performance [25].Likewise, utilizing explanations has no significant impact on human performance in [89,92], but simply showing model predictions has a positive effect in [92].Experts and novices perceive explanations differently, for example, Feng and Boyd-Graber [91] conclude that the performance gain of novices and experts comes from different explanation sources.Paleja et al. [10] reveal that explanations can improve novices' performance but decrease experts' performance.Additionally, less complex models with explanations can better convince humans in correct decisions [90].

A GUIDELINE FOR XAI USER STUDY DESIGN
Learning from the best practices of the previous works, we summarize a handy guideline for XAI user study, which serves as a checklist for XAI practitioners.This guideline contains suggestions to avoid pitfalls that researchers could easily overlook.We introduce our guidelines in the order of before, during and after user studies, which reflects user study design, execution and data analysis, respectively.
Before the User Study.When designing a user study, the first step is to decide what to measure.To define the measured quantities, one can consider two alternatives: using a general definition or an application-based quantity that is specific to the application at hand.The former one refers to a quantity that is borrowed from previous wellestablished research, such as using "trust in automation" [2,3,21] or "general trust in technology" [7,23].To further construct "trust" as a quantitative measurement, one needs to examine how existing work has conceptualized "trust" in both social sciences context as well as XAI and technical context [170].The application-based quantity depends on the application goal, for instance in a chess game [88], the measurement is the human winning percentage with the help of model explanations (Human-AI collaboration).
From Table 5, we can see that previous works have frequently struggled to prove the effectiveness of XAI even with respect to a control group that is without explanation.When only different explanation techniques are considered, there will always be one winner explanation, but the overall benefit will remain undisclosed (see examples in Appendix ??).Therefore, it is important to compare with a baseline without explanations to rigorously show the strength of XAI.When a comparative design is explicitly desired, baselines such as random explanations [28,41,62]).
When deploying a proxy task, its difficulty should be gauged and monitored carefully.In the past, the forward simulation task has been criticized as being unrealistically complex for domains such as computer vision [54].Thus, other proxy tasks such as feature importance queries [57] or manipulatability checks [32,50] were proposed.Another important point is to choose a proxy task that is simplified, but features many characteristics of the application in mind [120].Notably, the proxy task should be designed close to the final anticipated application, as even slight differences in the tasks may void the validity of the findings on the proxy tasks in the real world [13].
The measurement is often dependent on the definition of the measured quantity.For instance, in [58], the objective understanding is measured as failure prediction (the accuracy of user prediction when the model prediction is wrong).For subjective measurements such as subjective understanding or trust, one-dimensional measures (i.e., simply rating one question such as "Do you trust the model explanation?")have the drawback that they cannot completely reflect different constructs of measured quantities [8].Moreover, subjective questions and behavioral measurements often appear to be weakly correlated.For example, the users state that they trust model but they do not really follow the model suggestions [11].Similar findings have been made with respect to objective and subjective understanding [12,35,40].To overcome this limitation, both self-reported and observed measures shall be used in parallel.
Besides the measures introduced in Section 4.2, there are several psychological constructs that can be deployed to evaluate multiple facets of the interaction between humans and XAI.For instance, the subjective task value in the expectancy-value framework is often used to analyze subjective motivation to take any actions [171], which is not thoroughly studied in the XAI experience yet.The subjective task value consists of intrinsic value (enjoyment), attainment value (importance for one's self), utility value (usefulness), and cost (the amount of effort or time needed) [171,172].A good explanation interface should be positively correlated with the subjective task value, consequently boosting one's interest and motivation to use the model explanation.With regard to the cost of using model explanations, cognitive load is popularly measured in the current literature with conventional Likert scales [163,173].Cognitive load researchers study the validity of different visual appearances in rating scales beyond numerical Likert scales, i.e., pictorial scales such as emoticons (faces with different emotions), or embodied pictures of different weights [174].Their results demonstrate that numerical scales are more proper in complex tasks while pictorial scales are for simple ones.
Pre-registration using online platforms such as AsPredicted 1 has become a common practice in recent years [175].In this process, researchers submit a document detailing their planned study online before initiating the data collection.Among other details, the pre-registration includes the measured variables and hypotheses, data exclusion criteria, and the number of samples that will be collected.An ex- Fig. 3: Summary cards of the guidelines extracted from past XAI user studies haustive pre-registration can provide evidence against the findings being a result of selective reporting or p-hacking [176] and thus strengthen the credibility of a study.Expert interviews and pre-studies following a think-aloud protocol [177], e.g., in the references [32,46], are often mentioned as helpful tools to develop the explanation system and the study design and gain first qualitative insights or complement the qualitative analysis [13,65].
When preparing for a user study, it is important to plan for explicit steps and to have a backup plan for different situations.Before participants arrive, it is helpful to provide them with information such as where the researchers will meet with them, what they need to bring, and how they can prepare for the study.If conducting the experiment in person, send participants a reminder the day before and provide them with your contact in case they cannot find the experiment site or they need to cancel the experiment session.Once participants arrive, make sure the researchers have a plan that covers all stages of the experiment.The protocol should cover small details (e.g., where participants should leave their backpacks, water bottles, and lunch boxes) and plans for unexpected situations (e.g., uncooperative participants and multifunctional systems).How to obtain participants' consent should be an important part of the procedure.Additional procedure is required for obtaining consent when working with vulnerable populations (e.g., children and pregnant women), in which case alternative consent procedures might take place.Another benefit of predesigning the experiment script is to fine-tune the language to avoid inadvertent cues.Researchers can unintentionally pass on their expectations to participants through verbal and nonverbal behavior, which might result in participants' skewed performance towards the researchers' desire [170].To ensure a sound experiment procedure and to protect the integrity of the data, it is worthwhile to put in much effort to design a detailed experiment script.During the User Study.A sufficient number of participants is the prerequisite of a solid user study analysis.To get a rough estimate of common sample sizes, we refer the reader to the participant statistics in Figure 2 where we analyze the subject numbers in different experimental designs.For instance, around 350 users without any specific expertise are averagely recruited in between-subject experiments.However, we would like to underline that the required number of participants is highly specific to the study design and should be determined individually, for instance by conducting a statistical power analysis [178].Additionally, recruited participants should have the same knowledge background as the end users that applications are designed for.For instance, when evaluating an interface explaining loan approval decisions to bank customers, it is not proper to include only students whose major is computer science, since they may have prior knowledge of how model explanations work.Note that the design of an AI application requires different audiences across the project cycle, thus model explanations need to evolve as well [179].
To uphold high-quality standards of the collected data, attention or manipulation checks are essential to filter out careless feedback.This particularly applies to long surveys or online surveys with lay users.Kung et al. [180] justify the use of these checks without compromising scale validity.In within-subject experiments, a random order of conditions is necessary to avoid order effect [1].Participants can learn knowledge of data or examples shown in the previous conditions, and Tsai et al. [6] choose to use a Latin square design to avoid the learning effect.
After the User Study.After the data collection, statistical tests are run to find significant effects.The applicable tests used are determined by experimental designs and the form and distribution of the data.Generally, ANOVA tests and Ttest are usually used when comparing distributions between different conditions.Structural Equation Models (SEM) or multi-level models are used for mediation analysis.More details of statistic tools can be found in Section 4.3.Distributional assumption checks should be applied.When Likerttype data is collected as in most of the questionnaires, nonparametric tests such as paired Wilcoxon signed-rank test, or Kruskal-Wallis H test for multiple groups can be used to avoid normality assumptions.
If multiple measures are aggregated into a single instrument, it is important to assess the validity of this aggregation with reliability measures such as the tau-equivalent reliability (also known as Cronbach's α).For example, if objective and subjective measures of a quantity, such as understanding are combined, it is necessary to verify that there is sufficient agreement.If multiple items (e.g., data samples or visualizations) are rated by several subjects, statistics such as Cohan's κ as Fleiß's κ for more than two raters [181] can be used to assess agreement beyond chance between these raters and serve as an indication for the reliability of the ratings.
In the final writing phase, it is essential to report sufficient details that allow readers to estimate the explanatory power of the study.On the level of participants, this should include the total number of participants and how many are assigned to each treatment group, their recruitment, consent and incentivization, and the exact treatment conditions they are subjected to.Furthermore, some descriptive statistics of the collected data can help readers assess the characteristics of the adequacy of the statistical tools used.Regarding the analysis, we found it important to mention how the underlying assumptions of the statistical tests used were checked and to mention the exact variant of the test used (e.g., stating "a two-way ANOVA with the independent variables X and Y" is used instead of just mentioning that ANOVA-test is used).

FUTURE RESEARCH DIRECTIONS
Our survey of recent and ongoing XAI research also helps us identify research gaps and distill a few directions for future investigations.In this section, we highlight these directions and summarize our findings.

Towards Increasingly User-Centered XAI
We advocate that user-centered methods should be used not only to assess XAI solutions (e.g., through user studies) but also to design them (e.g., through user-centered design).By explicitly modeling and involving users in the design phase and not just in a post-hoc manner during the evaluation phase, we expect the development of XAI solutions that better respond to user needs.As discussed in [118], there are two aspects of human-centered AI: (1) AI systems that understand humans with a sociocultural background and (2) AI systems that help humans understand them.The former point can guide the design of AI systems.In this section, we discuss XAI research that leverages this insight.
The process of explaining a machine's decisions to human users can be viewed as a teaching-learning process where the XAI system is the teacher and the human users are the students.From a user-centered perspective, the problem of designing effective teaching methods to enhance the student's (i.e., user's) learning outcomes is essential to human-centered XAI algorithms.To leverage the ability of humans and address unique user's needs, it is important to review studies and findings from psychology and education.These studies provide insights into how humans perceive other intelligent agents (humans or artificial agents) and how they utilize limited information to infer and generalize.Understanding how humans think and learn will help XAI developers build and design systems that are not only informative but also user-friendly to people with different backgrounds.In this section, we discuss three pedagogical frameworks, namely (1) the expectancy-value motivation theory, (2) the theory of mind, and (3) hybrid teaching, to shed light on incorporating such methods in computational approaches.Inspired by existing work in pedagogy and XAI, we provide implications for designing future transparent AI systems and human-centered evaluations.
Expectancy-value Motivation Theory.Human interaction with XAI interfaces can be viewed as an activity where humans learn about the model's inner workings through explanations and then achieve an understanding of the models.The question of how to enhance the efficiency and the outcome of this human learning process is of high importance [182].This research problem is widely considered in educational psychology through the lens of expectancy-value motivation theory.For instance, Hulleman et al. [172] propose to utilize interventions to increase the perception of usefulness (utility value) to subsequently increase motivation and final performance.Intervention here refers to identifying the relevance of model explanations to the user's own situation, which can be a prompt question while working with the interface.Moreover, when utilizing model explanations in human-AI collaboration, explanations can be seen as a type of "scaffolding" (prompt during a task) proposed in a conceptual framework in education.

Theory of Mind.
When interacting with XAI systems, humans form mental models of the machine learning algorithms that reflect their belief of how the algorithms work.The formation of these mental models comes from observing explanations or examples given to the human, who often subconsciously applies the observations in a few examples to the broader understanding of the whole machine learning system.This incredible ability to infer, rationalize, and summarize other intelligent agent's decisions is known as the Theory of Mind (ToM) in psychology.Based on this theory, the Bayesian Theory of Mind (BToM) provides a probabilistic framework to predict inferences that people make about mental states underlying other agents' actions.Recent work, at the intersection of XAI and robotics, indicates that humans also attribute ToM to artificial agents that they observe or interact with.Guided by these user-centered results, several works at the intersection of XAI and robotics have utilized BToM to create a simulated user, and then use it to generate helpful explanations.Hybrid Teaching.Teaching strategies for the human-tohuman setting have been widely studied and many categorizations exist.One way of categorizing these strategies is through the following three concepts: (1) direct teaching, (2) indirect teaching, and (3) hybrid teaching.Direct teaching utilizes direct instructions that are teacher-centered, involve clear teaching objectives, and are consistent with classroom organizations.In XAI applications, direct teaching methods generate explanations by selecting representative examples of an agent's decisions to convey the patterns in its policy.In contrast, indirect teaching is student-centered and encourages independent learning.In the XAI perspective, methods utilizing indirect teaching provide users with tools to actively and independently explore an AI system.Technically, direct teaching focuses on providing guidance (using a computational approach) to assist users in building an understanding of a machine, whereas indirect teaching (often through a user interface) enables users to address individual learning preferences and mitigate individual confusion about the AI.To leverage the advantages of the two teaching strategies, hybrid teaching has been widely used in human-to-human teaching with an emphasis on interactivity.Recent work [183] indicates that hybrid teaching reduces the amount of time for a user to understand an agent's policy compared to direct and indirect teaching, and is more subjectively preferred by the participants.Building on this, future XAI systems can consider using hybrid teaching methods that (i) generate direct instructions to provide guidance to user's understanding of an AI system; and (ii) provide methods to allow users to interact with the agent.

Explanations through Large Language Models (LLMs).
The recent rise of Large Language Models [184,185] naturally opens up new research directions.There is a growing interest in leveraging their unprecedented capabilities [186] to offer explanations for model decisions [187,188].Through their natural language interface, LLMs offer the possibility to build interactive explainers [189].Intriguingly, textual explanations can also be used as subsequent inputs to LLMs which may help to solve subsequent problems and result in superior performance [190].This technique, referred to as chain-of-thought reasoning [191], opens up an interesting research territory combining interpretability and performance considerations.

Automatic vs. human-subject evaluations
With automatic evaluations, we refer to evaluation methods that do not require human subjects, which corresponds to the functionally-grounded metrics discussed in [120,121].These metrics aim to test desiderata around the "faithfulness"/"fidelity"/ "truthfulness" of model explanations [121,122,192].Faithfulness of explanations is defined as that explanations are indicative of true important features in the input [192].The automatic evaluations aim at capturing general objectivity which is independent from downstream tasks, while human evaluations are contextualized with specific use cases.Generally speaking, automatic evaluations and human evaluations tackle different research challenges: the former objectively examines how truly explanations reflect models and the latter one measures how humans perceive models through explanations (although there existing algorithms for automated evaluation designed to align with human evaluations, which we will discuss later).All explanations used in human-subject experiments should have satisfying performance in automatic evaluations, i.e., the explanations should be able to faithfully unbox the model.This verification step is essential to guarantee the validity of the empirical user study and to ensure that users are not tricked by unfaithful explanations.However, in most current human-subject experiments, the functional faithfulness of explanations is not thoroughly verified beforehand.Using unfaithful explanations could lead to the problem that only the placebo effect of explanations is measured.Ideally, a good explanation should be faithful to the model as well as understandable by users.

Identifying and handling confounders
Existing research underscores the vulnerability of model explanation studies to significant confounding effects.For instance, Papenmeier et al. [156] reveal that user trust can be more influenced by model accuracy than the faithfulness of the explanation itself.Similarly, Yin et al. [193] demonstrate that the accuracy score perceived by users and the one shown to users contribute to trust formation.
A different problem is that good explanations also reveal weaknesses of the model.However, when seeing unexpected explanations, users may express their negative feelings about the model through negative ratings of the explanations.Therefore, good model explanations should help users calibrate their trust [26,194], i.e., trust the model's decision when it is correct but distrust it otherwise.There is a disagreement on how to handle such cases: When evaluating model fairness, several works [2,27,38,73,75] reckon the increase in perceived fairness as positive, while Dodge et al. [74] define the decrease as positive.Other factors, such as the temporal occurrence of model errors (Nourani et al. [56]), and the dimensions of models (Ross et al. [32], Poursabzi et al. [53]), also come into play.
In summary, these confounding elements suggest that users might be led to put more trust in oversimplified, deceptive, or simply unfaithful explanations.To mitigate this, we recommend meticulous analysis, control and reporting of potential confounders, such as explanation faithfulness and model accuracy, across various test conditions.More advanced measures have been suggested as well.For instance, Schoeffer and Kuehl's [79] propose appropriate fairness perceptions, which measures whether people increase or decrease their fairness perceptions depending on the algorithmic fairness of the underlying model.Nevertheless, the thorough investigation of confounding factors remains a challenge.Calibrated measures that are less prone to confounding can be a valuable step forward.

Mitigating personal biases for XAI
Most XAI techniques and corresponding designed user studies provide one-size-fits-all solutions.Individual bias, rooted in a user's mental framework, influences the user's perception of a model.It should be considered in XAI design, development, and evaluation procedures.Several studies that aim to explain reinforcement learning policies utilize cognitive science theories to create a model of the human user [182,183,195,196].They then generate explanations based on this human model and verify the benefits of tailoring explanations for individual user models.Within the scope of XAI, [197,198] utilize a Bayesian Teaching framework to capture human perception of model explanations.In user studies, depending on cultural and educational background, participants may likely give different feedback [31].This kind of personal bias can be mitigated by deploying a large sample size and recruiting participants who are representative of the target audience.We advocate that personal biases should be taken into account in the realm of XAI development.

Human-in-the-Loop and sequential explanations
In several relevant cases, such as online recommendation systems, users are not only confronted with an explanation once but instead view decisions and potential explanations repeatedly.Recent work in this domain [35] has shown that the order of decisions and explanations may indeed have an effect on user perception and understanding.The AI model may continue to shape the user's mental model over time.The differences between the single-use and the sequential setting still remain to be thoroughly investigated.7.2.5 Proxy tasks should be close to real-world tasks When using proxy tasks to evaluate models, for instance, to measure subjective understanding, there is a great choice of tasks present in the literature.A good proxy task should have the following features: (1) it has close real-world connections [120]; (2) users or participants have some background knowledge of the task but not too much to affect their judgment or performance during the task; (3) the task is not too complicated to implement or there exists an existing implementation but was used for different purposes (i.e., not used for XAI); and ( 4) it has connections to existing work.Yet, the link between evaluations through different proxy tasks and real-world applications has not been made very explicit to date.Buçinca et al. [13] show that the outcomes of proxy evaluations can be different from a realworld task.More specifically, the widely accepted proxy tasks, where users are asked to build the mental models of the AI, may not predict the performance in actual decisionmaking tasks, where users make use of the explanations to assist in making decisions.The results show that users trust different in the proxy task and the actual decision-making task.Therefore, we argue that further research is required to uncover the links between current proxy tasks and on-task performance or to devise new proxy tasks with a verified connection to actual tasks.

Simulated evaluation as a cost-efficient solution
As human-subject experiments are costly to conduct, Chen et al. [199] propose a simulated evaluation framework (SimEvals) to select potential explanations for user studies by measuring the predictive information provided by explanations.Concretely, the authors consider three use cases where model explanations are deployed: forward simulation, counterfactual reasoning, and data debugging.Human performance is measured for these three tasks with different explanations.If there is a significant gap in settings of using two types of explanations, the simulated evaluation can also observe such a gap under the same task settings as well.Meanwhile, first attempts to simulate human textual responses in a given context using large language models show that models can provide surprisingly anthropomorphic answers [200].Undoubtedly and also affirmed by Chen et al. [199], it is not yet realistic to replace human evaluation with the simulated framework as other factors e.g., cognitive biases can affect human decisions.To better simulate human evaluations, more effort should be directed towards modeling human cognitive processes.Concurrently and with appropriate caveats, XAI researchers should also leverage existing and approximate models of human cognition to enable rapid prototyping and assessment of explanations.Section 7.1 discusses several candidate human cognition models and highlights recent XAI works [182,183] that utilize this "Oz-of-Wizard" paradigm.

CONCLUSION
In recent years, there has been a proliferation of XAI research in both academia and industry.Explainability is a human-centric property [142] and therefore XAI should be preferably studied by taking humans' feedback into account.In this work, we investigated recent user studies for XAI techniques through a principled literature review.Based on our review, we found out that the effectiveness of XAI in users' interaction with ML models was not consistent across different applications, thus suggesting that there is a strong need for more transparent and comparable humanbased evaluations in XAI.Furthermore, relevant disciplines, such as cognitive psychology and social sciences in general, should become an integral part of XAI research.
We comprehensively analyzed the design patterns and findings from previous works.Based on best-practice approaches and measured quantities, we propose a general guideline for human-centered user studies and several future research directions for XAI researchers and practitioners.Thereby, this work represents a starting point for more transparent and human-centered XAI research.and complete and thus bring a positive impact on users.Another motivation for XAI is that it should assist users in building mental models of the AI systems [208].Previous user studies for ML systems or for explainable interfaces that are referenced for comparisons or serve as templates of user study design.In the end, we list several general works about user trust that may go beyond the scope of XAI.

IES
Black-box models are dominant in the current human-AI interaction research area as we can see that more black-box models are studied.Local feature explanations are popularly used such as LIME [147] and SHAP [148].Figure 5 demonstrates the chronological overview of frequently adopted XAI techniques for black-box models in user studies from the surveyed papers.However, there are many specific explanation types for certain applications.For recommendation systems, content-based and hybrid explanations are widely used explanations.A content-based explanation is a single-style explanation coming from a content-based recommendation system, while a hybrid explanation contains multiple explanation styles such as user-based or itembased, which is provided by a hybrid recommendation system [69,223,224].For instance, Dominguez et al. [16] provide a content-based explanation as "Painting A is 85% similar to the Painting B that you like".Tsai et al. [70], however, use hybrid explanations in textual and visual explanation formats.

APPENDIX C MEASUREMENT DETAILS C.1 Trust
Table 9 lists the trust measurement.Most of the works deploy questionnaires to measure user trust (self-reported), where a 7-point or 5-point Likert scale is commonly used.Many works design their own questionnaires [6,11,13,16,17,18,19,24,27].To measure trust in an objective manner, many works choose to use the agreement rate of humans [9,11,12,25].

C.2 Usability
Table 10 demonstrates the measures used for the usability of explanations.We divide usability into five sub-categories: workload (cognitive load), helpfulness, satisfaction, undesired behavior detection and ease of use and others.User perceptions of workload, helpfulness, satisfaction and ease of use are subjective and often measured with questionnaires.However, for debugging tasks, it can be measured objectively such as using the accuracy of the user confirming the correctness of answers from a question-answering model and the time for solving this task [24].[149] IG [213] SmoothGrad [226] INN [227] TCAV [160] ConceptSHAP [61] ProtoNet [228] MAME [229] Dr.XAI [155] SECA [230] VIBI [42] CLUE [51] Fig. 5: Chronology of commonly used XAI methods from reviewed papers.

C.3 Understanding of Explanations
For novel or cognitively challenging types of explanations, it makes sense to verify whether users can make use of the information provided through the explanation.Usually these types of tests are conducted in combination with other measures to establish if the explanations are correctly understood by users and can thus be processed as intended.
In the domain of conceptual explanations [160,241], such kind of understanding questions are common, to assess semantic coherence of automatically discovered concepts [61,62,63,64].Assignment tasks, where novel instances should be assigned to existing clusters are commonly used as a proxy to measure the intelligibility [49,61,62,64].Another option is to assess how well the cluster can be described in natural language which is often referred to as describability [62,63,64].Apart from conceptual explanations, Zhang et al. [46] ask multiple choice questions to verify if users understand the differences between the acoustical cues presented and evaluate which cue differences were most noticeable.Wang et al. [65] prompt users explicitly if Weight of Advice (WOA) -Degree to which the algorithmic suggestion influences the participant's estimate.[9,11,12,15,25,26] Agreement rate -Percentage of cases in which participants agree with the model.[12] defines the appropriate trust, overtrust and undertrust.
[28] Unified Theory of Acceptance and Use of Technology Model (UTAUT) [234] On the 5-point Likert scale.
the found the explanation easy to understand.
Research questions and Findings.Laina et al. [64] found that feature vectors obtained by contrastive learning approaches such as MoCo [242] or SeLa [243] allow for clusters that are almost as interpretable as human labels.Leemann et al. [63] show the similarity of ResNet-50 embeddings allows to predict how semantically coherent users find a cluster of images.For the acoustical cue, Zhang et al. [46] found that shrillness and speaking rate were most often recognized.Wang et al. [12] found that users reported they understood all types of explanations well without significant differences.

APPENDIX D FINDINGS
When using explanation types as the evaluation dimension, many works compare their effects without comparing them to a control group (baseline) without explanation methods.Anik et al. [2] argue that many works have proven the usefulness of explanations and therefore no need to include such a control group.Table 12 summarizes the findings of the comparison among different explanations.Table 11 lists results of using other evaluation dimensions beyond explanations.

APPENDIX E TOWARDS INCREASINGLY USER-CENTERED XAI
In this section, we provide a detailed literature review regarding existing work in pedagogical frameworks, which provides implications for designing future transparent AI systems and human-centered evaluations in Sec.7.1.

E.1 Expectancy-value Motivation Theory
Human interaction with XAI interfaces can be viewed as an activity where humans learn about the model's inner workings through explanations and then achieve an understanding of the models.The question of how to enhance the efficiency and the outcome of this human learning process is of high importance [182].This research question is widely considered in educational psychology through the lens of expectancy-value motivation theory [172,244,245].For instance, Hulleman et al. [172] propose to utilize interventions to increase the perception of usefulness (utility value) to subsequently increase motivation and final performance.Intervention here refers to identifying the relevance of model explanations to the user's own situation, which can be a prompt question while working with the interface.Moreover, when utilizing model explanations in human-AI collaboration, explanations can be seen as a type of "scaffolding" (prompt during a task) proposed in a conceptual framework in education [246,247].Bisra et al. [248] summarize guidelines for effective scaffolding.For instance, different disciplinary descriptions can be used in the scaffolding (explanation prompt) to enhance the user's intuition.Another important, yet often unconsidered point is the role of personality traits in the perception of explanations.For instance, Conati et al. [249] show that the need for cognition characteristic, which indicates users' openness towards cognitively challenging tasks, is a determining factor for explanation effectiveness in an intelligent tutoring system.Considering these findings, we see personalized XAI as a relatively underexplored but yet sorely needed research direction.

E.2 Theory of Mind
When interacting with XAI systems, humans form mental models of the machine learning algorithms that reflect their belief of how the algorithms work.The formation of TABLE 10: Measures of usability.The measurement is divided into five categories.The studied core papers using the same measurement are grouped together.The name and the paper reference of the used metrics are listed in the column "Metric" and "Definition Source", respectively."-" in the column "Definition Source" means that the source is the studied paper.More details about the metrics are given in the last column.
these mental models comes from observing explanations or examples given to the human, who often subconsciously applies the observations in a few examples to the broader understanding of the whole machine learning system.This incredible ability to infer, rationalize, and summarize other intelligent agents' decisions is known as the Theory of Mind (ToM) [250,251] in psychology.Based on this theory, Bayesian Theory of Mind (BToM) [252] provides a probabilistic framework to predict the inferences that people make about the mental states underlying other agents' actions [253].Recent work, at the intersection of XAI and robotics, indicates that humans also attribute ToM to artificial agents that they observe or interact with [254,255].Guided by these user-centered results, several works at the intersection of XAI and robotics have utilized BToM to create a simulated user and then use the simulated user to generate helpful explanations.Towards this goal, Huang et al. [196] provide a greedy algorithm for selecting explanations that maximize the simulated user's knowledge of the agent's (a self-driving car in their domain) policy; and Lee et al. [256] provide a related approach where the user is modeled as an inverse reinforcement learner.In addition to selecting the most informative explanations, Qian and Unhelkar [183] utilize a variation of the Monte Carlo tree search to generate a com-putationally tractable approach to identify the most informative sequence of the explanations, based on the assumption that some explanations might be more effective initially.Thus, while some existing works evaluate the effectiveness of the selected explanations through experiments with human users, the community still lacks an understanding of how robust or realistic BToM is compared to a human's cognitive process particularly for XAI.We also advocate for more probabilistic and computational cognitive models to be utilized in XAI designs.To achieve this, we need experts from cross disciplines to address individual user's needs in an XAI system from cognitive, psychological, and computational perspectives.Lastly, we also encourage XAI researchers to develop solutions to explain AI-enabled systems -for instance, robots and autonomous vehicles -which require grounded and user-centered solutions.

E.3 Hybrid Teaching
Teaching strategies for the human-to-human setting have been widely studied and many categorizations exist [257,258,259].One way of categorizing these strategies is through the following three concepts: (1) direct teaching, (2) indirect teaching, and (3) hybrid teaching.Direct teaching utilizes direct instructions that are teacher-centered, involve TABLE 11: User study findings when using other aspects (other than the presence of explanation) as evaluation dimensions.Effects on measured quantities are divided into "Positive" where explanation information is given, and "Non-positive / Mixed" where negative impact is marked with underlines.
clear teaching objectives, and are consistent with classroom organizations.In XAI applications, direct teaching methods generate explanations by selecting representative examples of an agent's decisions to convey the patterns in its policy [256,260,261,262,263,264]. In contrast, indirect teaching is student-centered and encourages independent learning.In the XAI perspective, methods utilizing indirect teaching provide users with tools to actively and independently explore an AI system.Although the goal of direct and indirect teaching methods is the same, namely explaining an AI system to human users, the computational problems solved by these methods are different.Direct teaching focuses on providing guidance (using a computational approach) to assist users in building an understanding of a machine, whereas indirect teaching (often through a user interface) enables users to address individual learning preferences and mitigate individual confusion about the AI.To leverage the advantages of the two teaching strategies, hybrid teaching has been widely used in human-to-human teaching with an emphasis on interactivity [265,266,267].In XAI-related work, Qian and Unhelkar [183] provide a hybrid teaching framework by introducing an AI Teacher to enable guided interactivity between RL-based AI agents and a user.Their results indicate that hybrid teaching reduces the amount of time for a user to understand an agent's policy compared to direct and indirect teaching, and is more subjectively preferred by the participants.Building on this, future XAI systems can consider using hybrid teaching methods that (i) generate direct instructions to provide guidance to users' understanding of an AI system and (ii) provide methods to allow users to interact with the agent or model enabling active learning.

Fig. 2 :
Fig. 2: Distribution of participant numbers in the surveyed user studies by design and participant type (each bar represents one study).Per-design means are indicated in bold.

TABLE 2 :
Keywords for our paper search query.Two groups of keywords were used.

White-box Black-box Other Feature- based localTABLE 3 :
Models and explanations in core papers.

TABLE 7 :
Fundamental works of the core papers (categorized according to topics).

TABLE 8 :
Works measuring objective understanding grouped by proxy task/data modality