A Review on Human–Machine Trust Evaluation: Human-Centric and Machine-Centric Perspectives

As complex autonomous systems become increasingly ubiquitous, their deployment and integration into our daily lives will become a significant endeavor. Human–machine trust relationship is now acknowledged as one of the primary aspects that characterize a successful integration. In the context of human–machine interaction (HMI), proper use of machines and autonomous systems depends both on the human and machine counterparts. On one hand, it depends on how well the human relies on the machine regarding the situation or task at hand based on willingness and experience. On the other hand, it depends on how well the machine carries out the task and how well it conveys important information on how the job is done. Furthermore, proper calibration of trust for effective HMI requires the factors affecting trust to be properly accounted for and their relative importance to be rightly quantified. In this article, the functional understanding of human–machine trust is viewed from two perspectives—human-centric and machine- centric. The human aspect of the discussion outlines factors, scales, and approaches, which are available to measure and calibrate human trust. The discussion on the machine aspect spans trustworthy artificial intelligence, built-in machine assurances, and ethical frameworks of trustworthy machines.


I. INTRODUCTION
A S AUTONOMOUS systems become increasingly complex, the interaction between these systems and human users/operators relies heavily on how much and how well the users/operators trust them. Overtrust and lack of trust on the users' behalf lead to disuse and misuse of autonomous systems, respectively [1]. The following incidents demonstrate examples of the above-mentioned situations. In 2018, an Uber car operating in self-driving mode fatally struck a pedestrian on a bicycle in Arizona, USA [2]. The investigation by National Transportation Safety Board (NTSB) found that the vehicle's automatic system failed to identify the pedestrian and her bicycle as an imminent collision danger, as they were supposed to. NTSB has also found that during the accident, the human driver in the car had been streaming a television episode leaving the self-driving mode unattended. This suggests an overreliance on the system leading to misuse of autonomy. Another example is an incident with Southwest Airlines Flight 1455 that traveled from Las Vegas to Burbank, California, in March 2000 [3]. During the landing approach, cockpit warning signals alerted the captain and first officer that the flight speed and angle of descent were well outside the glide path. These warnings were ignored. As a result, the plane overran the runway, and crashed through a fence and wall. This incident fortunately had no fatalities, but illustrates how lack of trust may lead to disuse of autonomous systems. These examples also imply the need for appropriate and calibrated reliance on such systems. The seminal works [4] and [5] laid the motivational foundation by not only identifying trust as a major factor for the interaction between humans and autonomous systems involving uncertain situations, but also the necessity to calibrate trust correctly.
Trust has also been recognized to be one of the important factors for effective interaction and use of autonomous systems. According to [5] and [6], the critical outcome of trust is described to be reliance. While using these increasingly complex systems, we may not be concerned about trust itself, but the ultimate behaviors that trust is likely to produce-reliance or absence of it, i.e., nonreliance. Thus, the importance of trust in facilitating HMI and integration of autonomous systems into everyday use is noteworthy and necessary.
Trust in automated and autonomous systems has been explored in earlier research along with the notions of use, misuse, and disuse of automation [1]; the theoretical foundations of modeling human trust that can be used for empirical studies of human intervention in automated systems [4]; and trust consideration in automation design for appropriate reliance in [5]. Other topics ranging from trust dynamics in autonomous vehicles to trust etiquette and trust-distrust definitions were discussed in [7]- [12]. Furthermore, several survey papers have discussed human-machine trust with regard to multidisciplinary definitions of trust, trust frameworks, and factors affecting trust [13].
Other survey papers include studies of algorithmic assurances in human-autonomy trust relationships [14] and computational trust and reputation models [15] that focused on models applied to nonengineering fields. Notably, the review presented This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ discussed how to conceptualize trust variability among various other research outputs on trust in automation.The model revealed three layers of human-automation trust variability. This layered and structured approach comprises dispositional, situational, and learned trust. Another review [17] summarized the research concerning autonomous systems by considering four categories of such systems, namely decision support systems, robots, self-driving cars, and autopilot systems. This article provided a quantitative definition of trust that appeared in the authors' prior work [18]. However, the quantitative definition provided has not been utilized as a criterion for evaluating any of the subsequent discussions presented there. Most recently, Kok and Soh [19] addressed how trust in robots can be gained, maintained, and calibrated. This article also outlined the challenges in terms of measurement of trust, and trust models in real-world scenarios.
Despite the fact that previous studies covered important themes in relation to trust in automation, there still remains much ongoing discussion regarding trust measurement and cal ibration. This includes standardized measurement techniques, standardized scales, and exploring trust calibration efforts that consider both the human and machine aspects of trust. In this article, a review of research outputs pertaining to human trust measurement approaches and trustworthiness of machines is provided. This article intends to provide a broader review of the topic focusing particularly on the following: 1) Human and Machine Aspects of Trust: This brings attention to a dual perspective of trust in human-machine teams. In the HMI context, an improved and effective use of machines/autonomous systems depends on both the human and machine counterparts. Proper calibration of the interaction requires a functional understanding of both aspects, and hence the relevance of this contribution.

2) Human Trust Models, Measurements, and Calibration:
This includes a review of models used for studying trust evaluation with respect to autonomous systems, summary of various trust measurement, and experimental and empirical ways of trust calibration. 3) Machine Trustworthiness: It aims to cultivate machine trustworthiness and acceptance in society and a review of ethical frameworks, which lays the foundations of design, development, and deployment of trustworthy autonomous systems (TASs) to cultivate machine trustworthiness and acceptance in society. Also, a discussion on properties of trustworthy autonomy, verification of trustworthiness, and methods of trust calibration and assurance are also presented. The remaining sections of this article are organized as follows. Section II starts with a definition of trust that subsumes the context of its use in the reviewed materials. Section III primarily focuses on the discussion of the human trust measurements and calibration along with validity and reliability testing requirements. In Section IV, a review of machine aspects of trust and trust influence based on the trustworthiness of autonomous systems are discussed. In Section V, the challenges of measuring and calibrating trust in light of the reviewed material are also discussed. Finally, Section VI presents the conclusions and the road map ahead for more studies.

II. BACKGROUND AND DEFINITIONS
With the increased complexity of autonomous systems, understanding all the underlying operations of those systems becomes difficult. Yet, users will continue to delegate tasks, trusting their machine partners to various degrees in different settings. Therefore, it is important to discuss a definition of trust and autonomy that relates to the majority of the works reviewed in this article. We start by providing the definition of autonomy and follow up with a definition of trust.
Although it is quite difficult to provide a governing definition of autonomy without the situational context of the application, Fisher et al. [20] defined autonomous systems as those systems that decide for themselves what to do and when to do it. In explaining this idea, Bradshaw et al. [21] emphasized that autonomy entails at least two dimensions: 1) self-directedness, which describes independence of an agent from its physical environment and social context; and 2) self-sufficiency, which describes self-generation of goals by the agent.
Trust has been studied in a broad range of fields of interest, including psychology, sociology, cognitive sciences, economics, computer science, human factors, and engineering. As a result, a unifying definition for trust is lacking. In particular, within the framework of HMI, the definitions span from the humaninspired notion of trust [22] to a computational model of trust accounting for cyber and network security [23], [24]. However, there exists a commonality across usage, which includes three components. These common components are: a trustor, a trustee, and a risk or uncertainty [16]. The definitions describe the relationship between the trustor (a user/operator) and the trustee (a machine) depending on the nature of the task, consequences, and conditions.
Trust is described in [5] as "the attitude that an agent will help achieve an individual's goals in a situation characterized by uncertainty and vulnerability." This definition of trust is by far the most widely used in subsequent studies of trust in automation [25]. In [19], this definition is elaborated to include the following: 1) consideration of trust arising in situations that entail risk; 2) the multidimensional and latent nature of trust; 3) trust will also account for the relationship between past experiences with the trustee and subsequent acts of reliance. Combining these, [19] provided a definition of trust as "a multidimensional latent variable that mediates the relationship between events in the past and the trustor's subsequent choice of relying on the trustee in an uncertain environment." A literature review of research covering trust in autonomous systems, mostly on materials published in the last decade, was conducted to build on the previous studies carried out in [16] and [17] using combinations of key terms, such as trust, trust in autonomous systems, trust measurement, trust scales, trustworthiness, and trust calibration.
In this article, the functional understanding of humanmachine trust is viewed from two perspectives-the human aspect and the machine aspect. The human aspect of the discussion outlines trust models that have been developed and adopted through the years to characterize trust and facilitate trust measurement toward trust calibration. The discussions on the machine aspect span trustworthy artificial intelligence (AI), built-in machine assurances, and ethical frameworks of trustworthy machines.

III. HUMAN TRUST MEASUREMENTS
The U.S. Department of Defense's unmanned systems integrated roadmap [26] and the Institute of Defense Analysis's roadmap to trustworthy autonomy [6] emphasized the role of trust as an important determinant of reliance on autonomous systems. Establishing the appropriate levels of trust and trust calibration capabilities are part of the challenge areas identified in those roadmaps. Recalling whether someone trusts a system depends on the nature of the task, the consequences, and conditions [5]. Measurements of trust at those levels of task and conditions will be required to assess the situational calibration of operator trust. To facilitate our discussion of human trust measurement and calibration attempts, we subsequently outline trust models, validity considerations of trust measurement scales, and different types of measurement approaches.

A. Trust Models
Researchers modeled human trust based on social interpersonal interactions [27], [28] to represent the psychological state or attitude toward delegation of tasks under implied risk. Various models for trust evaluation and explanation providing reproducible and methodical treatment of trust in HMI have been put forth by several research outputs. Table I summarizes some of these models that have been widely adopted. The early models of trust include those developed in [4] and [5]. These models adopted and modified the notion of interpersonal trust for use in automated systems. In [4], a qualitative model used to explain human-automation interaction to make behavioral predictions about a human operator toward automation and calibration of trust was provided. Furthermore, in [5], Lee and See devised a model that captures 1) the dynamic interaction among contextual awareness of individuals and the environment and 2) the design aspects of the interface between the automation and the operator.
Other widely used models of trust include those introduced in [9], [16], [30], and [31]. These models describe an increasingly detailed representation of trust and autonomy, factors affecting the relationship and trust formation under these interactions, and accounts of the multidimensionality of trust as a latent behavior.

B. Trust Measurement-Scales and Approaches
Due to its latent and multifaceted nature, trust cannot be directly measured. As a result, measuring trust depends on capturing other factors or underlying constructs. As different types of trust are investigated: trust propensity [32], dispositional trust [16], history-based trust [33], and affective and cognitive trust [5], what is measured continues to vary widely within the existing literature. Trust measurements are mainly done via self-reports by the human subjects under study [34]- [38]. Another approach that has grown in recent years is the use of psycho-physiological sensors to measure neural and physical correlates of trust [10], [39]- [43].
In [44], Wei et al. pointed out that, despite the availability of a large number of research in trust measurement, it is not clear what psychometric level of measurement is most appropriate for trust in automation. They discussed the various levels of psychometric scales (nominal, ordinal, interval, and ratio) and investigated what could be a step toward having a standardized level of measurement for trust. They did so by carrying out a trust measurement using an experimental setup and measurement approach, which allows a permissible transformation between two levels of measurement. They found out that the interval level psychometric scale, adopted by the majority of trust measurement studies, might be a probable candidate for determining the most appropriate level of trust measurement. Also, Brzowski and Nathan-Roberts [45] iterated the need to validate trust scales. They indicated that most measurements are created with the exact application in mind and are validated in ways that are specifically relevant to the application (e.g., simulation of a decision aid) used for measurement purposes. Validity refers to the degree to which a construct is correctly measured in a quantitative study [46]. Another issue related with trust measurements is the reliability of those measurements. Reliability refers to the consistency of measurement [47]. It is the way in which a trust scale consistently produces the same results when used repeatedly in the same scenario. According to [19], validity, reliability, and related measurement variances need to be given the necessary attention to have reproducible trust measurement practices. Next, we discuss various subjective selfreport measures and psychophysiological trust measurements.
Self-Report Measurements: These approaches measure trust by collecting responses in the form of surveys and questionnaires from participants. There is currently no standard for assessing human-machine trust [44], [48]. At present, it is measured by researchers using custom scales or validated measurements. This makes cross-study comparisons difficult to determine whether automation facilitates appropriate trust levels. Researchers have used a wide variety of approaches in measuring trust with selfreported accounts from human subjects under study [34]- [38]. In the following, the widely used self-report based scales are discussed.
In [37], Jian et al. developed one of the most frequently used scales, i.e., the Scale of Trust in Automated Systems, where they devised a 12-item scale to measure trust. They developed an experiment using a word elicitation study, a questionnaire study, and a paired comparison study. The 12 factors characterizing trust between people and machines were identified based on a cluster analysis carried out on the experimental data. In their conclusion, they particularly noted that the trust scale developed should be validated in experiments designed for a specific study of trust in automation.
In [49] and [50], a validated scale for assessing changes in a person's trust in a robot was developed. The 40-item trust instrument, developed over the course of six experiments, was designed to measure trust perceptions in the context of human-robot interaction. For this item pool, two experiments identified the robot and perceived functional features. The scale was reduced to those items from an original pool of 172 items, using item pool reduction techniques and content validation by subject matter experts. The scale was then validated through two final experiments.
The multidimensional nature of trust was employed in [51] to incorporate capacity trust (reliable, capable) and moral trust (sincere, ethical) aspects as factors. They developed a Multi-Dimensional Measure of Trust (MDMT) instrument, which captures the two trust factors. Repeated cycles of principal components analysis and item analysis resulted in study items distributed across four factors: reliable, capable, sincere, and ethical. The MDMT provided a tool that brings the interpersonal trust construct (moral) and a human-machine construct (capacity) into an amalgamated scale. This instrument, however, is yet to be validated.
The Trust of Automated Systems Test (TOAST) measuring trust based on the theory of trust formation was proposed in [48]. Scale structure reliability and criteria validity were evaluated with civilian and military-affiliated samples. There, TOAST demonstrated a reliable, two-factor model representing system performance and understanding. The scores obtained through confirmatory factor analysis on the two factors demonstrated strong positive correlations within those factors.
Other trust scales include the perfect automation schema scale in [52], the psychometric instrument designed to measure cognitive and affective components of human-computer trust in [53], and the early scales include operators' trust and prediction of trust scale in [4] and the trust evolution and reliance scale in [5].
Psychophysiological Measurements: Although self-report trust measurements provide valuable insight into understanding users' trust in automation, they are unable to objectively evaluate user trust or trust correlates and are, therefore, not appropriate for real-time trust assessment [54]. Progress with sensing technology has resulted in the development of inexpensive and efficient psychophysiological sensors and a shift away from self-reporting scales toward more objective methods for assessing trust correlates using psychophysiological signals. Measurements using physiological traits, such as employing heart-rate variability (HRV) measurement [55], evaluation of brain activity using fMRI [56], [57], and fNIRS [58], and electroencephalogram (EEG) [59]- [63], have been used to study and evaluate trust via its psychophysiological correlates.
In [64], Gupta et al. investigated participants trust toward an auditory assistance system for a search task in a virtual reality (VR) environment. They found that the trust inferred this way is subject to different cognitive loads and agent accuracy levels. Participants data were gathered using a variety of sensors, including EEG, HRV, and Galvanic skin response (GSR) devices. The study identified that physiological and behavioral measurements can be used to assess human trust in virtual agents. It also showed that researchers can use a VR environment to simulate realistic environmental scenarios. Furthermore, they studied the effect of cognitive load on trust. Human-like cues were also demonstrated to play an important part in the neural response to an agent's technical capability [65]. Subjects played matrix games with a computerized agent while event-related potentials from EEG were averaged for each subject and trial. By analyzing brain signals, this study creates an environment that capability-based trust levels can be computed from those measurements.
In [59], an empirical trust sensor model was proposed using data from GSR and EEG and showed that psychophysiological signals could be used for real-time human trust measurement inferences in agents. In particular, Akash et al. [59] developed a binary classification model for trust and distrust based on the data they collected in this manner. They further expanded their experiments [66] to dynamically vary and calibrate automation transparency to optimize HMIs.
The application of psychophysiological signals to assess trust via its correlates is complex [67]. It is usually impossible to draw a one-to-one correspondence between the signals from these sensors and trust states [54]. Subsequently, the interpretation of the relevant signals to draw a causal relationship with trust is still an elusive task [68], [69]. In [54], it was suggested that researchers adjust the trust measurement scale, the number and types of physiological signals, trust relationship type, and the analysis technique used to analyze the data to infer the trust state when assessing trust using these approaches.

A. TASs and Trust
Trustworthiness is a property of trusted agents or organizations which engenders trust in other agents or organizations [70].
Any AI-driven autonomous system that embodies such properties is categorized under the term "Trustworthy Autonomous System (TAS)." TAS cements human trust in AI-based systems so that societies can assuredly design, develop, and reap the benefits of these technologies [71].
The integration of AI and machine learning is pervasive in many fields and affects various aspects of our daily lives. The lack of trust in AI systems is often a typical yet justifiable hindrance in achieving the next level of autonomous system ubiquity. Trust in intelligent systems is often compromised due to consistent failure in a broad range of systems and the black box behavior inherent in some data-driven models that obscure internal decision-making processes.
Although intelligent system benefits are far-reaching and undeniable, there are many safety-critical scenarios where the failure of AI systems could have detrimental consequences. The general category of reasons that lead to failure includes maliciously compromising system functionalities executed by unethical people ("intentionally"), engineering shortcomings, and even environmental factors [72]. Autonomous road vehicles are, for instance, exposed to adversarial attacks, which involve adding visual perturbations to stop signs to mislead the interpretations of their trained classifiers and ultimately compromise the vehicles' safety [73].
A timeline consisting of first occurrence of intelligent systems failure, excluding deliberate attacks, is presented in [72]. It is critical to deliver on the advantages of intelligent systems and remit the drawbacks by building trustworthy systems. Additionally, system trustworthiness should be ensured throughout each stage of a product's life cycle, including the technical design, development, implementation, testing, and deployment stages.

B. Ethical Frameworks for TASs
Amidst the ever-evolving dynamics of HMI, it is essential to continuously monitor and ensure the development of intelligent technology aimed at benefiting both individuals and society at large. This has inspired the development of several frameworks and guidelines that inform key principles underlining the ethical operation of AI-based systems.
These frameworks are also relevant to trust, accentuated by references to trustworthy AI design and development principles for entailing customers' trust [74]. Ethical frameworks and guidelines for building trustworthy intelligence list a set of qualities that should be adopted by AI-based systems to be deemed trustworthy.
Stakeholders of these technologies, including industries, governmental agencies, and academia, have released guidelines and frameworks around ethical AI. Some international organizations are instituting AI expert committees to draft guidelines. For instance, the High-Level Expert Group on Artificial Intelligence (AI HLEG) mandated by the European Commission [75] defined seven requirements that systems must meet to realize trustworthiness under its proposed trustworthy AI framework. These include technical robustness and safety, transparency, accountability, privacy and data governance, human agency and oversight, diversity, nondiscrimination, and fairness, as well as societal and environmental well-being. The guideline presented by AI HLEG also provides technical and nontechnical methods to meet these requirements. Industries too are establishing guiding principles and frameworks on ethical AI. Deloitte's Trustworthy AI Framework [76] is an effort that proposes six pillars (transparent/explainable, robust/reliable, fair/impartial, privacy, secure/safe, and accountable/responsible) to consider when designing, developing, and deploying AI-based systems. The similarity in the essence of these principles suggested in both guidelines is reasonably vivid.
It is relevant to capture the global convergence among diverse principles and reconcile their existing differences to reach a consensus on trustworthy innovations. A comprehensive survey of guidelines is provided in [77] by investigating the overlap and divergence of principles and interpretations. This study identified 47 principles across frameworks and later grouped those related principles culminating in the following eight overarching themes: 1) safety and security; 2) transparency and explainability; 3) human control of technology; 4) professional responsibility; 5) promotion of human values; 6) fairness and nondiscrimination; 7) privacy; 8) accountability.

C. Elements of Autonomous Systems Affecting Trust
Autonomous systems garner human trust in their capabilities in more than one way. Several of these methods are discussed in the following section.
1) Trustworthiness Properties: Robustness and Safety. Technical robustness in autonomous systems refers to the ability to counteract adverse conditions. Some autonomous systems are built to make sophisticated decisions in safety-critical scenarios. These systems must ensure their users' safety and exhibit acceptable performance, especially in dynamic environments. There are various angles for addressing robustness in autonomous systems, such as security, reliability, safety, and resilience. Security is a sensitive issue in autonomous systems, given that the systems are more susceptible to attacks.
A survey of security in autonomous systems in [78] highlights major security threats in autonomous systems and provides insights on available cyber-security solutions along with their constraints. Maintaining the security of autonomous vehicles by verifying the integrity of sensor data is proposed in terms of LIDAR [79] and Radar [80] sensors, which use the quantization index modulation based data hiding technique. The research in [81] ensures the safety of autonomous vehicles in unstructured environments and robustness against adversarial attacks by developing controller-focused anomaly detection and systemfocused anomaly detection techniques to enhance anomaly monitoring in sensor data and the overall system. Seemingly, minor adversarial perturbations to an AI model's inputs can severely undermine the model's reliability when used in safety-critical domains. To this effect, robust visual adversarial perturbations under changing environmental conditions are applied to road sign images causing intentional misclassification in autonomous driving deep neural networks [73], demonstrating how vulnerable these systems can be.
2) Explainability, Transparency, and Interpretability: These concepts relate to the degree to which the operation and decisionmaking processes of AI-based systems are relayed in ways comprehensible to a human user. Humans are reluctant to espouse techniques that are not interpretable and trustworthy [82]. Any explanation of the decisions made by an AI-based system to engender trust in the system is enrolled in the concept of explainable AI (XAI). The nature of interpretability, whether inherent or aided, establishes the two facets of XAI. AI models with inherent interpretability are classified under transparent models, and those with aided interpretability are known to exhibit post-hoc explainability [83]. Transparent models have a level of intrinsic interpretability, whereas post-hoc explainability is due to deliberate incorporation of qualitative or quantitative explainable solutions. Visual [84] and natural language [85] explanations can be used to rationalize and elaborate the decisions of a system to external users who are unfamiliar with the system's inner workings. Black-box models in a data-driven domain such as neural networks adopt model-simplification techniques, like the DeepRED algorithm [86] and block-chain solutions [87], to ascribe transparency to the model. A computational explanation approach is used in [88] where feature coefficients are evaluated to understand their effect on the explainability of interpretable models using decision lists or decision trees. Researchers are often faced with tradeoff scenarios between the increasing complexity of algorithms to accommodate improved performance and the system's interpretability quotient [89].
3) Verification of Trustworthy Autonomy: Technological characteristics of trustworthy autonomy, such as reliability and performance efficiency, do not necessarily implicate positive trust responses nor ensure proper trust calibration. People's perceptions of trustworthiness can be affected by the environment, inclination toward technology, complexity of system (uncorroborated with explanations) and different interactions [70]. Trustworthy systems alone do not necessarily impose trust unless provided with the means to verify their capabilities. The importance of providing verifiable claims regarding the robustness, safety, fairness, and privacy protection of AI systems is emphasized as a prerequisite to building trust for those systems [90]. Thus, we draw attention to the notion of "Verified AI," whose objective is to provide evidential assurances for satisfying correctness levels in an AI-based system as a more substantial attempt to warranty trust.
An appeal to represent correctness justifications using mathematical specifications has led to a growing interest in extending formal verification to AI systems. However, traditional formal verification techniques are not always equipped to handle and generalize effectively in advanced intelligent systems. There are five key challenge areas derived from traditionally adopting formal verification methods in autonomous systems [91]. The first challenge is the presence of unknown and stochastic variables in modeling an environment addressed using probabilistic [92] and data-driven [93], [94] approaches to formally model uncertainties in human behavior and physical environments. High-dimensional input and state spaces in machine learning components are another challenge whose modeling has been facilitated using abstractions [95]. Third, formal specification concerns in learning systems are addressed by accommodating a wider mode of task specifications. Quantitative formulations are designed to specify quantitative properties [91]. Algorithmic improvisations [96] and counter example-guided training data generation [95] are some formal method-based research efforts used to optimize data specifications. The fourth challenge, which is designing scalable and efficient computational engines, was maneuvered using modular reasoning [97]. Finally, the theory of formal inductive synthesis [98] is an emerging solution used to address the challenge of synthesizing "correct-by-construction" design of models for a learning system.
A machine learning-based empirical verification of other intelligent systems is suggested as an alternative method to justify claims. Although not fully guaranteed, machine learning is better equipped to evaluate other machine learning-based systems' probabilistic nature. An interesting review of the use of a machine learning approach to verify another machine learning model's capability is provided in [99]. In this article, adaptive stress testing is deployed to validate the performance of the Next Generation Aircraft Collision Avoidance Software (ACAS X). Reinforcement learning is used to estimate and simulate the likeliest near mid-air collision events with aircraft, which opens doors to validate whether the software responds as expected.
An open challenge presented with verification of intelligent systems is the basic difficulty for specification of some abstract trustworthy AI properties, such as transparency. Conversely, trustworthy properties that are routinely verified in autonomous vehicles include safety, robustness, fairness, and privacy. Safety of autonomous vehicles is formally verified online [100]- [102] by continually monitoring vehicle maneuvers using reachability analysis of all possible behaviors given some knowledge of initial states and bounded uncertainty model. A safety analysis of whether a neural network-based controller prevents an unmanned underwater vehicle from colliding with a static object is provided in [103] using an efficient overapproximate reachability scheme. The fairness of a machine learning systems is probabilistically verified using a scalable algorithm and tool (FairSquare) in [104] and [105], respectively. The robustness of machine learning models is formally specified in [106], which can be used as the basis for verification. An analyzer, i.e., AI 2 , was developed in [105] to certify large neural networks' robustness. Furthermore, the work in [107] formulated ethical policies in unmanned aircraft and executed formal verification of whether an autonomous agent does make ethical decisions.
Formal verification techniques have also been applied to evaluate human-machine interactive systems [108], [109]. They can exhaustively evaluate various categories of usability properties (e.g., reachability, visibility, reliability, and task-related) and mode-confusion related specifications (e.g., unintended side effects, operator authority limits, and inconsistent behavior) associated with formally modeled human-machine interfaces [108]. Therefore, formal verification helps find failure points overlooked by other traditional analysis techniques (e.g., simulations and human subject trials). The analysis of safety operations leads to design patterns that promote successful HMI characterized by calibrated trust. The limitations of formal verification discussed earlier still hold for HMIs involving autonomous systems. State-space explosion, expressiveness power of modeling formalisms, and model validity all determine the effectiveness of a formal verification method [108]. A new element of limitation is introduced in [110], which alludes to the difficulty of formalizing a model that evolves and adapts to change in human behavior. The Symbolic Analysis Laboratory model checker is used in [111] to formally verify the interaction between a human operator and a single unmanned aerial vehicle (UAV). The extended operator function model [109] was also used to model human operator behavior.

4) Methods of Trust Calibration and Assurance:
User's trust must be consistent with an autonomous system's capabilities to ensure that the system is used within the bounds of its intended purpose. Concepts of calibration, resolution, and specificity are provided in [5] to describe the mismatch between user trust level and automation capabilities. Calibration is defined as "the correspondence between a person's trust in automation and the automation's capabilities" [112], [113]. Resolution refers to how well the range of capabilities in automation maps to the range of trust levels [114]. Moreover, specificity describes the interdependence level between trust and an element of a trustee. Following these definitions, the perspective in [5] emphasizes the importance of sound calibration, high specificity, and high resolution of trust in mitigating mismatch and guiding the design and evaluation of HMI.
The gradual process by which users build their faith in a particular system and become conditioned to its behavior suggests a characterization of low temporal specificity of trust with the system. Users have difficulty instantly building a mental model of a complicated system, thus requiring several interactions with the system. A faster approach in informing users was proposed for a human-machine setting that exposes users to the critical states of the system's policy, yielding an improved understanding of the system's performance and, as a result, efficiently calibrating users' trust. [115]. Israelsen [116] proposes an algorithmic approach, known as algorithmic assurances, to influence calibrated human trust in autonomous systems. An example of such an approach is when an unmanned ground vehicle communicates its competency boundaries, pertaining to a given task, to a human counterpart through the self-confidence score generated from the Factorized Machine Self-Confidence (FaMSeC) [117].
A few other studies on trust calibration include the following. In [118], Robinette et al. indicated how the provision of a robot's operational information and an indication of better performance on upcoming tasks on the robot's part is used as a trust repair mechanism. In [119], the investigation was carried out on how user interfaces that communicate both internal and external system awareness may increase the driver's perceived and measured awareness, as well as their trust in the system. Also in [60], Shahrdar et al. combined the use of immersive virtual reality experimentation with self-reporting software to gauge and measure the subjects' trust level in a self-driving autonomous vehicle. In their findings, they illustrated how trust could be reduced, escalated, mutated, and rebuilt by adjusting system performance, like the success rate of carrying out a task. Notably, [66] managed to dynamically vary and calibrate automation transparency, and the amount and utility information provided to the human, to optimize HMIs using feedback control policies.

V. DISCUSSION
In the following, we discuss the challenges concerning human trust measurement and calibration efforts. These ideas reflect both the human-centric aspect of measuring trust and the machine-centric aspect of trustworthiness and its effect on measurement and calibration.

A. Trust Measurement
Due to the latent nature of trust, it has been challenging to provide a definitive solution toward trust measurement and calibration. As discussed earlier, there have been notable developments that provide both subjective measurements of trust in terms of self-reporting and objective measurements of trust correlates.
The scales based on self-report measurements in [37], [48], [49], and [51]- [53] provided the most widely used scales in trust measurement. The need to validate these scales for their suitability of use under different application settings other than those they were developed under still largely remains. Also, most of the measurements reported on these scales lack confirmatory factor analysis to justify the relationship among the identified trust factors and the latent behavior, trust. The Structural Equation Model method [120] provides one such rigorous factor analysis method in accounting for how a latent behavior like trust is affected by the identified predictors. This approach used multiple regression and factor analysis to identify relationships between latent variables measured by multiple items. However, this method extracts only linear relationships among the factors.
The shift away from self-reporting reliance for trust measurement has paved a way toward using objective measurement approaches of trust correlates based on psycho-physiological sensors such as EEG, fMRI, and GSR. These measurement approaches enable real-time sensing of trust variation in response to the interaction with the machine in use. These methods help us study what the underlying psycho-physiological measured this way reveal about the human's trust state while interacting the machine. These signals provide a rich set of frequency, time, and spatial domain features that can be used to classify trust and distrust responses. However, like the case with self-reporting measurements there is a lack of confirmatory factor analysis and validity studies to cement these methods as definitive measurement approaches of trust and subsequent calibration. Additionally, there is difficulty in establishing a one-to-one relationship between such psycho-physiological correlates and trust states [67]- [69], [121]. Furthermore, the signals from these sensors are typically non-stationary [122]. And the majority of machine learning classification algorithms are based on the assumption of stationarity, and independent data samples [123]. These algorithms do not perform as high as expected for data collected in this a manner [122]. Hence, on top of addressing the issues of validity, reliability, and measurement invariance, it is also worth investigating the signal processing aspects of psycho-physiological sensors as these sensors provide us with a way to look into trust formation, trust measurement and calibration from a different angle by sensing directly observable correlates to trust.

B. Trustworthiness and Trust Calibration
Lack of standardized trust measurements has hindered research progress in developing effective trust calibration techniques. We have yet to completely characterize the extent and magnitude of how the trustworthiness properties of machines influence trust. To our knowledge, users' trust should be appropriately calibrated to match the system's capabilities in a given situation to ensure a safe and effective HMI. The effects of technical robustness, transparency, fairness, and other properties of a TAS are intuitively linked with trust realization and calibration. Empirical-based research has surfaced in recent years demonstrating that this cause-and-effect relationship between the notions of trustworthiness and trust is more intricate and less crisp than anticipated. For instance, studies [4], [124] uphold the significance of system transparency in instigating regulated trust levels. Users may fall into an overtrust or undertrust category with respect to a system. Okamura and Yamada [125] assert that transparency of these systems alone is not enough to recover from such states of trust. Alternatively, they developed an adaptive trust calibration technique aided with cognitive cues to signal users when improper trust calibration is detected, thus prompting users to adjust their level of trust. Seppelt [126] demonstrated that continuous system feedback promotes trust calibration in automation. In contrast, after investigating various feedback levels, Mackay et al. [127] contest that more information does not always positively correlate to trust, allowing for the possibility of negative cognitive load effects. The way a system communicates its decisions to its users should also be tailored to user variability for a competent effect [31]. The composition of feedback relayed to users is investigated and automatically generated using XAI principles to influence desired trust effects in automated vehicles [128].
Although Israelsen et al. [117] devised a quantitative approach for trust calibration where an autonomous vehicle reports its self-confidence level depending on its capabilities, the application is highly domain-specific, and the competency boundary of the system is probabilistic. There is a persistent gap in terms of understanding how a machine's trustworthiness maps to the human-trust variable. Quantitatively outlining the association in a generalized manner is a long stride but shows the greatest promise toward effectively designing and implementing autonomous systems that are truly trustable.
Another challenge we may draw from the discussions in previous sections is that although there are hundreds of guidelines from different stakeholders outlining ethical principles that AI-based systems should adopt, less effort has been put in translating those principles into practice. A call for a "Practical-AI ethics" field enforces the development of action-oriented frameworks and guidelines [129]. As such, recent guidelines [75], [130] proposed recommendations on performing the translation. Implementation based on these recommendations demand inputs from researchers from various disciplines and ultimately draw stakeholders (with conflicting priorities) toward a consensus on these implementations.
Much work was presented in terms of quantifying the impact that trustworthy properties of autonomous systems have on human trust. The difficulty of interpreting these properties as relevant to machine-specific functions is a key impediment. Additionally, prior attempt to define and model trustworthy properties has been context-specific and require generalization. A Bayesian network-based trust modeling used in [131] demonstrates where the trust impact of modifying system behavior is evaluated using a utility function. The model enforces system behaviors that yield the highest trust utility value depending on the status of context variables. Despite the effort to computationally relate system properties to trust, the system under investigation [131] is not a learning system, and the system properties reflect user perceptions of them instead of actual measurements.

VI. CONCLUSION
Evaluation, measurement, and calibration of trust in relationship to HMI can facilitate and improve the effective integration of complex autonomous systems into everyday use. Without an appropriately calibrated level of trust, there is a tendency to misuse autonomous systems because of overtrusting, or they may fall into disuse because of a distrust in these systems. To make the best use of ever-increasing interactions between humans and complex systems, trust measurement and appropriate calibration have become a critical task to solve. This survey has reviewed a large pool of research output that provide varying solutions toward the measurement and calibration of trust. Most importantly, the survey presents a discussion of the issue from both a human-centric trust measurement and machine-centric trustworthiness perspective. The survey leads the authors to conclude that proper calibration of trust depends on the appropriate factors affecting human trust being properly accounted for; on the relative importance of those factors being correctly measured and validated; and also on the degree which how properly the information regarding the operations, why and how the machine carries out a task the way it does, is relayed to the human user in an understandable way.
On one hand, concerning the human-centric aspect of trust measurement and calibration, the current state-of-the-art trust measurements have shifted from self-reporting approaches to more real-time sensing of trust using psychophysiological sensors. Measurements using these sensors provides an understanding of how human cognitive and physiological responses correlate to the level of trust projected toward how an autonomous system carries out a task. On the other hand, the machine-centric assessment of the same, research on critical aspects of machine trustworthiness influencing trust has gained further momentum to include detailed aspects such as robustness, fairness, and transparency. The review of these influences implies that providing cognitive cues about the machine's state (i.e., a continuous but appropriate level of system feedback about the machine's decision-making processes) and reporting the machine's self-confidence leads to promising results toward adaptive trust calibration.