Prediction of Inter-Personal Trust and Team Familiarity From Speech: A Double Transfer Learning Approach

Speech classification is one of the most convenient objective measures of internal state exhibited during a problem-solving task that requires verbal communication. This study investigates the hypothesis of speech acoustic characteristics being indicative of trust between team members and team members' familiarity with each other. Speech recordings from 27 dyadic teams (26 males and 28 females) were made during a distributed threat perception task, determining safe points along a route through the town to be visited by a VIP. Before the threat detection mission, 26 team members knew each other, and the remaining 28 had no prior knowledge of their partners. Two levels (Low Trust and High Trust) of two trust constructs, TTP (Trust, Trustworthiness, Propensity to trust), and RIS (Reliance Intentions Scale), were estimated based on numerical responses to preand post-mission surveys. Speech recordings of individual speakers were divided into 1-second intervals and converted into RGB images of amplitude spectrograms. The images were classified using a pre-trained convolutional neural network ResNet-18 fine-tuned to recognize either the trust level or familiarity. In the baseline classification scenario, the speech was classified using a single transfer learning into Low/High-trust categories separately for RIS and TTP constructs before and after the mission yielding an average classification accuracy of 82%-86%. Single transfer learning classification into Know/Unknown-partners categories led to 85% accuracy. Application of double transfer learning, i.e., first tuning the ResNet-18 on Know/Unknown labels and then on Low/High-trust, increased the trust classification accuracy up to 89%. When tuning the ResNet-18 on Low/High-trust and then on Known/Unknown labels, the accuracy of partner familiarity recognition was also increased up to 89%. These results support the hypothesis of speech acoustics being indicative of trust and familiarity between team members and show that by adding prior related knowledge to the model, more efficient learning can be achieved without increasing the training data size. INDEX TERMS Inter-personal trust prediction, inter-personal familiarity prediction, speech classification, transfer learning, convolutional neural networks.


I. INTRODUCTION
Interpersonal trust is commonly defined as the willingness to accept vulnerability to another person's actions or decisions. Trust is a vital component of daily human interactions and decision-making processes with a partner. Interpersonal trust has high importance in all aspects of life. Understanding of subjective and objective factors affecting trust is an active topic of research in psychology, social studies, and recently in artificial intelligence and machine learning. A comprehensive review of the interpersonal trust research from the perspective of behavioral psychology can be found in [1]. Recently investigated objective trust indicators include emotional states [2], facial expressions [3], and speech attributes [4]. From the machine learning perspective, speech as an objective factor affecting interpersonal trust is of particular interest. Through the mapping of social attributes of trustworthiness into synthetic speech and machine-made decision-making processes, higher social acceptance of all kinds of human-machine interactions can be achieved. Synthetic speech produced by robots can be more friendly when characterized by acoustic and verbal signs of empathy, affection, and trustworthiness. Verbal human-machine communications containing acoustic traits of trust are vital for achieving high performing human-machine teams working on high-responsibility tasks of public security and defense.
This paper investigates the hypothesis of acoustic characteristics of speech being indicative of inter-personal trust and team members' familiarity with each other. Two fundamental research questions underlie this study. Firstly, we wanted to find out if it is possible to predict the trust and familiarity of team members working on a distributed problem-solving task from the acoustic characteristics of their speech. Secondly, we wanted to determine if having a prerequisite knowledge of familiarity can improve the prediction of trust and vice versa.
In general, there are two types of trusters [3]: individuals who do not trust others until they gather clear evidence that the trust can be granted and individuals who exhibit a high degree of prior trust in others until they obtain evidence that their trust cannot be granted anymore. These two groups were identified in this work by conducting pre-and post-mission surveys determining how much team members trust each other. Given the two trust labels (pre-and post-mission), an automatic prediction of pre-and post-missional trust was conducted using speech collected from team members during the problem-solving mission.
The remaining parts of the paper are organized as follows. Section II provides an overview of related studies. Section III describes the speech data and the methodology of speech classification. Section IV describes the experiments and discusses the results. Section V concludes the paper.

II. RELATED WORKS
Trust has been analyzed from the perspective of several disciplines, including sociology, psychology, philosophy, management, economics, automation, and communication networks. In general, trust can be defined as a relationship between two entities denominated trustor (evaluator) and trustee (evaluate), based on a given criterion [5]. Trust is a multidisciplinary concept that can be applied to human-human interactions as well as to other types of relationships, such as human-machine interactions.
Although there is neither a multidisciplinary trust model nor a standard metric for the quantification of trust measurements, there are some common tendencies in trust investigations. In particular, there is a recently growing interest in studies of trust between humans and artificial agents (human-robot interactions). As a result, several trust models have been proposed, which can have an online or offline structure [6]. In behavioral economics, the "Trust Game," designed in [7], is commonly applied to evaluate the level of trust in economic decisions. One of the most widely used trust models was developed by Mayer et al. [8] in the context of organizational management. This model defines trust using three dimensions: ability, benevolence, and integrity (ABI framework). The gender rules of interpersonal trust based on a research questionnaire [9] revealed that male college students have more difficulty trusting other students than female college students. In a search for objective predictors of trust, facial expressions have been considered as a guide to trustworthiness. It has been observed that physical facial attributes, as well as the degree of similarity to oneself, can affect the perceived trustworthiness of a person. However, the perception of acoustical qualities which might establish trust are easier to obtain for remote teams and individuals. In general, people more similar to oneself are perceived as more trustworthy [10]. In the absence of cues indicating similarity, as is the case in our current world of remote work and teleconferencing, increased pitch and intensity are perceived as indicators of deception or untrustworthiness [11]. However, as described further, trust perception in speech has many viable indicators and what may be asked is the content of the shared speech. There is a difference between intended deception and determining the trustworthiness of a partner who harbors no malintent.
Several studies have investigated the relationship between trust perception and speech [12][13][14][15][16]. In [12], voice models were evaluated to investigate the correlation between basic acoustical parameters and the perception of low and high trust. Results showed that a significant change in intonation is related to an increase in the perception of trust. Similar studies conducted in [13] found an inverse correlation between vocal pitch and trust.
Experiments conducted in [14] reported that low harmonicto-noise ratio, low mean fundamental frequency, and a fast speech rate are the most significant acoustic characteristics present in the perception of high trust. Additionally, tests comparing positive valence speakers against negative valence ones, younger voices against those with older voices, and female against male speakers showed that a higher perception of trust is achieved for positive valence speakers, younger voices, and female speakers, respectively.
Recently, studies conducted in [15] investigated how the perception of trust is affected by the voice and accent of a speaker. Trust was evaluated as if a statement is believable or not, using confident and doubtful voices. The study was conducted from the perspective of an English Canadian listener. Three accent groups of speakers were established: "in a group" (native Canadian English speakers), "out groupregional" (Australian English accent), and "outgroup/foreign" (Canadian French accent). The results indicated that the accent does not affect favorable trust in confident statements against doubtful ones. However, confident statements produced by the out-group/foreign accent were evaluated as less believable than the statements produced by the English accent groups. Evaluation of basic acoustic features showed that doubtful statements present lower speaker rates and amplitude ranges than confident expressions in all accents.
Acoustic speech properties of high and low public trust politicians have been recently investigated in [4]. Statistical analysis and classification of low-level acoustic speech parameters revealed strong gender differences. In general, high trust females appeared to have higher spectral energy for both voiced and unvoiced speech within 0 -500 Hz compared to low trust females (sonorant sound). It was also revealed that trust in females is related to the timber and non-verbal attributes of speech that are often linked to emotions [17][18]. High trust males, on the other hand, had slightly lower energy concentration at low frequencies (more fricative or constrained sound) compared to low trust males. The main differences between spectral energies of high and low trust males were found to be in the unvoiced speech within the lowfrequency band 500 -1500 Hz. High trust males appear to have a smaller spectral slope within this range (i.e., smoother transitions between slowly and fast-changing speech components). In addition, high trust males showed larger spectral flux (i.e., larger differences between spectra of consecutive frames). An automatic speech classification into low and high-trust politicians using a multi-layer perceptron neural network led to an average accuracy of 81%. The outcomes of this study indicated that the acoustic speech properties of politicians are good indicators of public trust.
The recent advent of deep learning (DL) neural network techniques have been particularly helpful in creating very efficient prediction models. The availability of cloud storage spaces and graphical processing units (GPUs) provide support for dealing with large numbers of data and vast numbers of computations needed to train the DL models. Classical signal processing pipelines, including complex pre-processing of speech followed by feature extraction and feature selection, were simplified by application end-to-end neural network approaches. Convolutional neural network (CNN) models, for example, calculate their own features through the iterative generation of correlated data and thus dropping the need to compute problem-related features. Available pre-trained neural network models reduce the computational and data requirements allowing to perform specific problem-related fine-tuning on a relatively small number of available data. The network's features, also known as network embeddings, provide a highly competitive alternative to handcrafted features for classical classifiers such as the support vector machine (SVM), k-nearest neighbors (KNN), or the Gaussian mixture model (GMM).
Although automated analyses of social signals can provide supplementary information and discover trends that human beings are not able to detect, machine learning techniques for social signal processing have been rarely used in the studies of human-human trust interaction. An approach to an automatic prediction of trust in human-human interaction was presented in [19]. The investigation did not involve speech analysis; only features of the non-verbal behavior of the participant (movement of the head, arms, eyes, touch, and smile signals) were analyzed. Hidden Markov models were implemented to investigate the temporal relationships between trust and social signals. Trust in human-machine-human interactions was analyzed in [20][21] using an automated approach that combines subjective labels and physical features. Two scenarios were evaluated using two scripted dialogues. The first dialogue was video recorded during a business conversation, and the second one during a fire rescue. A multimodal analysis was performed using facial expressions, basic acoustic features (fundamental frequency, voice/silence ratio, mean voice power, and energy entropy), and the heart rate variability. An ensemble classification was then applied to predict three levels of trust using a neural network (NN) with a fuzzy decision as a classifier.
In this study, we provide a twofold contribution to automatic trust recognition from speech using convolutional neural networks (CNNs). Firstly, we demonstrate that both interpersonal trust between team members and team members' familiarity with each other can be efficiently predicted from speech. Secondly, we show that a transfer of team familiarity and inter-personal trust knowledge between deep CNN models can enhance the models' performance.

1) DATA COLLECTION
The speech data was collected at the Air Force Research Laboratory (AFRL) from dyadic team partners working on a distributed problem-solving task aided by a computer video simulation program. The task was to approve the safety of the travel route for a VIP. During the task, team members were conducting verbal communication, and their speech was recorded. When working on the task, team members were required to make a few intermediate decisions by weighing their own information against information obtained from the team partner. Since the information available to each team member was different, a certain amount of trust or mistrust in the partner was required from team members to make the decisions.

2) DATA LABELING
Participants were asked to provide responses to pre-and postmission psychology surveys with the same questions used in both. The survey responses were used to evaluate two trustrelated construct measures called TTP (Trust, Trustworthiness, Propensity to trust), and RIS (Reliance Intentions Scale) [22]. Table I shows the questions used to derive the TTP trust construct measure. The response to each TTP question was given by the participants as an integer number on a scale from 1 to 5. PreTTP and PostTTP scores were calculated for each participant, as Where A, B, C, and D denote pre-or post-mission scores given by the participants in response to questions listed in Table I. Speech of individuals that obtained PreTTP and PostTTP scores above the average estimated across all participants was labeled as "TTPHighTrust," and speech representing scores falling below this average was labeled as "TTPLowTrust." A similar speech labeling procedure was applied to the RIS trust construct. However, it this case, the response to each RIS question was given by the participants as an integer number on a scale from 1 to 7, and the PreRIS and PostRIS scores were calculated for each participant, as Where A, B, C, D, E, F, G, and H denote pre-or post-mission scores given by the participants in response to questions listed in Table II. Speech of individuals that obtained PreRIS and PostRIS scores above the average estimated across all participants was labeled as "RISHighTrust" and speech representing scores falling below this average was labeled as "RISLowTrust."

3) DATA CHARACTERISTICS
Speech recordings were collected from 27 dyadic teams (i.e., 54 individuals) participating in the problem-solving task. The data distribution across genders and team members' familiarity with each other is summarized in Table III. The term "Known" indicates participants who knew their partners before the data collection, and "Unknown" refers to participants who did not know their partners before the mission. The duration of audio recordings collected from each participant was 10-15 minutes. Fig. 1 shows the data distribution across two different trust levels (High/Low) and participant familiarity (Known/Unknown) in four cases (PreTTP, PostTTP, PreRIS, and PostRIS). It shows apparent differences in trust assessment between pre and post-missional surveys, indicating that the problem-solving task had an effect on interpersonal trust between team members. The trust assessment also differs between TTP and RIS surveys and depends on team members' familiarity. Given these observations, we wanted to determine if these differences can be predicted through an automatic speech classification. To do so, we conducted speech classification experiments on the following binary categories (Low vs. High trust), team familiarity (Known vs. Unknown), and trust change during the problemsolving task (Change vs. No Change). In addition, we have investigated a multilabel classification combining two labels (trust and team familiarity) and three labels (trust, team familiarity, and trust change). Since the team familiarity and interpersonal trust are likely to be correlated, we also wanted to know if a classification system capable of familiarity prediction can be trained more efficiently to determine trust compared to a system trained only to predict trust without prior knowledge of familiarity and vice versa.

1) SPEECH CLASSIFICATION FRAMEWORK
Given the limited size of available data and computational resources, the option of training a designated trust-prediction CNN "from scratch" was not feasible. We have, therefore, adopted a classification approach proposed in [23] for the prediction of speech emotions. However, instead of AlexNet, the more advanced convolutional neural network (CNN) structure of ResNet-18 [24] was applied. Fig. 2 shows an overview of the classification procedure consisting of the training and the testing stages. The following paragraphs provide brief descriptions of each processing step shown in Fig. 2.  2) PRE-PROCESSING The pre-processing step included speech normalization into the range -1 to +1 and removal of silence intervals. The silence was detected using an empirically chosen energy threshold.
No speech pre-emphasis filter was applied. Since the signal was recorded from personal headset microphones within a quiet room, the signal to noise ratio was relatively high, and there was no need for the removal of noise or any other interference. The sampling rate was 16 kHz giving the 8 kHz signal bandwidth.

3) SPLITTING INTO BLOCKS
Pre-processed waveforms of speech samples were divided into short 1-second blocks to conduct block-by-block spectrogram computation. A short, 10-millisecond stride was applied between subsequent blocks. For each block, a single spectral amplitude array was calculated and converted into a color RGB image. Waveforms remainders that did not fill up to 1second frame length were discarded. The 1-second blockduration of 1 second was consistent with the previously reported duration used in speech-based prediction of speaker's states [24,[25][26]. The stride time between subsequent blocks was chosen in an arbitrary way. By changing the stride duration, a smaller or larger number of CNN input images could be generated. The 10-millisecond stride allowed to generate a relatively large number of training images ensuring sufficient generalization of the CNN model.

4) GENERATION OF SPECTROGRAM IMAGES
A short-time Fourier transform was performed on each 1second block of speech waveforms. The resulting pairs of real and imaginary outputs were converted into real-valued spectral magnitudes and concatenated across all subsequent frames within a given block to form a time-frequency spectral magnitude array of 257 x 259 real-valued numbers representing a given block. Where 257 was the number of discrete-frequency values, and 259 was the number of discrete-time values for which the spectrogram arrays were evaluated. The time scale of the magnitude spectrograms was linear, spanning the range from 0 to 1 second. The frequency scale, on the other hand, was logarithmic, spanning the entire bandwidth range from 0 to 8 kHz. The logarithmic frequency

FIGURE 1. Participants characteristics based on responses to TTP and RIS questions pre-and post-experiment. The term known indicates team members who knew each other before the experiment, and the term unknown refers to team members that did not know each other before the experiment.
scale was previously reported to give superior performance compared to linear, mel, and ERB scales [16,20]. The advantage of using the logarithmic scale is that compared to the other scales, it compresses the high-frequency details and expand the low-frequency range where the vital information regarding the speaker's fundamental frequency and the first formant values is given. Spectral magnitude arrays were converted into the RGB image format represented by three color-component arrays. As shown in [23], the RGB images of amplitude spectrograms provide higher speech classification performance compared to the greyscale images or raw spectrogram arrays. This is largely due to the fact that conversion to the RGB format decomposes the spectrogram array into three amplitude components, which are then passed into the three parallel processing channels of the network. This way, complimentary information is analyzed by each channel rather than a copy of the same information given to each channel when raw spectrogram arrays (or greyscale images) are processed. To normalize the color intensity range across all images, the minimum and maximum values of the spectral magnitude were estimated over the entire speech database and mapped into the normalized dynamic range of the RGB images spanning from Min [dB] to Max [dB] values. The normalization step was critical in achieving good visualization of speech spectral components. The Min and Max values were chosen to maximize the subjective visibility of contours outlining timefrequency evolution of fundamental frequency (F0) speech formants and harmonic components of F0.
Since the required input size for ResNet-18 was 224 x 224 pixels, the original image arrays of 257 x 259 pixels were resized. Consistent with [23,[25][26], the resizing was very small, causing no significant distortion to the spectrogram images and having no effect on the model training results. Each color component of the RGB images was passed as an input to a separate channel of ResNet-18.

5) CNN MODEL
The classification was achieved using a pre-trained ResNet-18 CNN model, which has a residual network block architecture comprised of 18 layers. The network is defined by 11.5 million parameters; it requires a storage space of 44 MB. Natural image classification experiments based on ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) dataset [28], containing 1.2 million images with 1000 classes, demonstrated that the ResNet-18 model offers a very good balance between the computational cost and the quality of performance [29]. In comparison with other pre-trained CNNs, ResNet-18 achieves higher accuracy than popular models with linear architectures such as AlexNet [30]. It is also significantly faster than other residual block networks (ResNet-50, ResNet-101) without compromising the classification accuracy [24]. ResNet18 has shown good performance in comparison with other pre-trained CNN models in speech classification tasks such as, for example, intoxication detection [31], depression detection [32], and speech command recognition [33]. In the current study, the original last fully connected layer of ResNet-18, the Softmax layer, and the output layer were modified to have the number of class outputs required by a given experiment.

6) TRAINING AND TESTING THE CNN MODEL
The entire dataset of speech samples was divided into the training/validation subset consisting of 80% of data and the testing sub-set composed of 20% of data. In both subsets, all speakers were represented in a balanced way. The testing data was not used in the training process. Each of the classification experiments described in Section IV was repeated three times with different mutually exclusive training and testing sets, and the results were given as average values over all repetitions.

C. CLASSIFICATION PERFORMANCE MEASURES
To determine the quality of the classification outcomes, the accuracy , was estimated for each class c i ( = 1, … , ) using (3) [27].
Were, denotes the number of classes, and are the numbers of true-positive and true-negative classification outcomes, respectively. Similarly, and denote the numbers of false-positive and false-negative classification outcomes, respectively. In cases when the classification was based on unbalanced data, the F-Score was calculated as, , denotes the precision parameter given as, Whereas is the recall parameter given as,

A. EXPERIMENTAL SCHEDULE
This section describes experiments validating the concept of speech-based prediction of interpersonal trust and familiarity between team members. The experimental scenarios are explained in the following sub-sections.

B. PREDICTION OF PRE-AND POST-MISSION TRUST USING SINGLE TRANSFER LEARNING
In this experiment, we wanted to determine if inter-personal trust can be predicted directly from the acoustic characteristics of speech. Speech recordings were automatically classified into two categories, Low/High trust, using a pre-trained CNN model ResNet-18. The ResNet model was pre-trained on over one million images to perform a general image classification task of object recognition. Here, we have fine-tuned this model to solve the more specific and more abstract task of trust prediction from our much smaller collection of spectrogram images. Due to the pre-existing relevant image classification knowledge embedded into the model, it was possible to accomplish the training (fine-tuning) process within a relatively short time and with a much smaller number of training images compared to what would be required to train the CNN model "from scratch" (i.e., without any prior knowledge built into the network). We refer to this approach as "single transfer learning." The training and classification tasks were performed in four separate cases corresponding to RIS, and TTP trust construct labels made before and after the mission (i.e., PreRIS, PreTTP, PostRIS, and PostTTP). The numbers of spectrogram images per class used to train the model were: 27661 for PreTTP, 27654 for PostTTP, 20935 for PreRIS, and 39087 for PostRIS. Fig. 3 shows the results of this experiment. General observations stemming from this figure are:  In all cases of single transfer learning presented in Fig.  3, the trust classification accuracy is above 80%, varying from 82% to 86%. It indicates that acoustic speech characteristics are correlated with interpersonal trust. Therefore, speech classification can provide quite an efficient prediction of trust.  Prediction of trust before the mission is only slightly lower than after the mission.  Before the mission, the RIS based prediction is 3.8% higher than the TTP based prediction. However, after the mission, both constructs give very similar outcomes.  The best performing single transfer learning trust prediction models are given by the CNN trained on the PostTTP trust, and the CNNs trained on the PreRIS or PostRIS construct; in both cases, the achieved accuracy was about 86%.

C. PREDICTION OF TEAM FAMILIARITY USING SINGLE TRANSFER LEARNING
Our aim in this experiment was to determine if acoustic speech characteristics are indicative of team familiarity. As in the first experiment, a single transfer learning was applied to predict if team members knew each other before the mission. However, this time, the pre-trained ResNet-18 model was fine-tuned to perform binary classification of speech into two categories Known/Unknown. The number of spectrogram images per class used to train the model was 34956. Fig. 4 shows that after training, the model could predict team members' familiarity with an accuracy of 85.16%. It supports the proposition that acoustic speech characteristics are correlated with team members' familiarity.

D. PREDICTION OF PRE-AND POST-MISSION TRUST USING DOUBLE TRANSFER LEARNING
In this experiment, the aim was to determine if the addition of closely related pre-requisite knowledge into the model can improve the prediction of trust. From the previous two experiments (Sections IV-B&C), we knew that both interpersonal trust and team members' familiarity influence speech acoustics. Therefore, if there was a dependency between trust and familiarity, a network model having a prerequisite knowledge of team members' familiarity could hypothetically outperform a model that did not have such a pre-requisite knowledge. To investigate this assumption, we have conducted a double transfer learning experiment. The CNN model generated through transfer learning from ResNet-18 (described in Section IV-C) to predict prior knowledge of team members was further trained to predict trust between team members. The transfer of knowledge during this process occurred twice, the first time when transferring the general image object classification knowledge from ResNet-18 to train the CNN model to recognize if the team members know each other, and the second time when this knowledge was transferred to further train the CNN model to recognize trust between team members. We refer to this approach as double transfer learning. The training and classification tasks within this scenario were conducted in four cases. This included the prediction of RIS and TTP constructs labeled before and after the mission described by four types of binary trust/no-trust labels (PreTTp, PostTTP, PreRIS, and PostRIS). The numbers of spectrogram images per class used to train the model were: 27661 for PreTTP, 27654 for PostTTP, 20935 for PreRIS, and 39087 for PostRIS. The results are presented in Fig. 3, give the following general observations:  In all four cases, the double transfer learning provided improvement upon the single transfer learning approach. For PreTTP trust the improvement was the highest, by 4.1%, for PostTTP by 2.1%, for PreRIS by 2.2%, and for PostRIS by 3.3%. It indicates that there is a strong correlation between interpersonal trust and pre-requisite familiarity between team members. Therefore, a network model having knowledge of familiarity provided a better prediction of trust than a network predicting without this knowledge.  Like in the case of single transfer learning, the double learning prediction of trust before the mission appears to be slightly lower than after the mission.  Prediction of trust before the mission when using the TTP construct is 1.9% higher when using the RIS than when using the TTP construct. This difference is smaller compared to the single transfer learning (Section IV-B). After the mission, both constructs TTP and RIS give similarly high outcomes of about 89%.  The best performing double transfer learning trust prediction model was given by the CNN fine-tuned on the PostRIS trust construct; it achieved an accuracy of 89%.

E. PREDICTION OF TEAM FAMILIARITY USING DOUBLE TRANSFER LEARNING
This experiment was a reversed version of the experiment described in Section IV-D. Our aim was to find out if the addition of related pre-requisite knowledge of interpersonal trust can improve prediction of the pre-mission familiarity

F. JOINT PREDICTION OF TRUST AND TEAM FAMILIARITY USING SINGLE TRANSFER LEARNING
To observe the difference between the concept of double transfer learning between trust and familiarity, and a simultaneous prediction of these two categories, we have trained a separate CNN model to perform a joint prediction of trust and familiarity based on a single transfer learning. Unlike in double transfer learning, where there was a gradual buildup of cognition components, in the multilabel classification, all components were generated simultaneously. The speech labels and the number of spectrogram images for each class used in the process of training are shown in Table IV. The results in Fig. 5 show the following:  The achieved accuracy was ranging between 80% and 84% depending on the type of the applied trust construct. Given that for the four classes, the pure guess was 25%, these results are relatively high. It confirms that both trust and familiarity are affecting speech acoustics.  The RIS based prediction was higher than the TTP based prediction (by 5% for the PreRIS and by 1.2% for the PostRIS).

G. PREDICTION OF CHANGE IN TRUST
One of the observations given by the data characteristics in Fig. 1 is that some of the participants changed their trust in partners after the mission. As shown in Table V Fig. 6 show that the trust change prediction based on the RIS construct achieved 86% accuracy; thus, outperforming the TTP based prediction by 3.59%. In both cases, the prediction accuracy was above 80%, confirming that speech classification can be used to obtain a good prediction of trust change.

H. JOINT PREDICTION OF TRUST, TEAM FAMILIARITY, AND TRUST CHANGE
In this experiment, we tested a more complex classification scenario of simultaneous prediction of trust, familiarity, and trust change given eight different class labels, as listed in Table  VI. Due to the fact that there was no data available to represent some of the classes, the classification was unbalanced; therefore, the results of this experiment shown in Fig. 7 include both accuracy and F-scores. Both accuracy and Fscore values shown in Fig. 7 are within a narrow range of 80%-82%, indicating that in all four cases (PreTTP, PostTTP, PreRIS, and PostRIS), the classification outcomes were very similar. Given that the accuracy and F-score values were very close to each other, the imbalanced cases did not affect the overall performance in a significant way, and there was a good balance between false positive and false negative classification outcomes. As expected, the increased number of classes led to a slightly reduced classification accuracy (by about 1%) compared to two-class prediction (trust and familiarity) in Fig. 5.

V. CONCLUSION
We have investigated acoustic speech characteristics as indicators of interpersonal trust and familiarity. A number of CNN models were trained to differentiate between different classes of trust, team members' familiarity with each other, and trust change after a team-based problem-solving mission. The experimental results provided a strong indication that all three categories can be efficiently predicted from speech. In addition, we have shown that models having a pre-requisite knowledge of team members' familiarity can be more efficiently trained to predict trust between team members compared with models not having such pre-requisite knowledge. Similarly, models having a pre-requisite knowledge of trust between team members can be more efficiently trained to predict familiarity.
In general terms, the results of our study indicate that the more related pre-requisite knowledge that is embedded in the model, the faster and less data-consuming is the learning of a new task. Using an analogy with human learning, for example, a person who already knows three different languages is likely to learn a new language faster and with fewer examples than a person who knows one language only. Likewise, a tennis player can most likely learn baseball faster than a person who never played any sports before. In both examples, the prior, task-related knowledge is the key factor increasing the learning efficiency.
One of the important general implications of our observations is that the embedding of pre-requisite, related knowledge into the model can be viewed as a way of dealing with the scarcity of specific task-related training data. Namely, instead of increasing the training set size by generating synthetic or augmented data, the model can be enriched with a pre-requisite knowledge and learn from the small in size, but actual data.
Our future research will investigate applications of the findings described here into human-machine interactions. We will also look into potential applications of multiple transfer learning to counteracting the data imbalance in the automatic monitoring of human-machine interactions.