Multimodal Assessment of Schizophrenia Symptom Severity From Linguistic, Acoustic and Visual Cues

Assessing the condition of every schizophrenia patient correctly normally requires lengthy and frequent interviews with professionally trained doctors. To alleviate the time and manual burden on those mental health professionals, this paper proposes a multimodal assessment model that predicts the severity level of each symptom defined in Scale for the Assessment of Thought, Language, and Communication (TLC) and Positive and Negative Syndrome Scale (PANSS) based on the patient’s linguistic, acoustic, and visual behavior. The proposed deep-learning model consists of a multimodal fusion framework and four unimodal transformer-based backbone networks. The second-stage pre-training is introduced to make each off-the-shelf pre-trained model learn the pattern of schizophrenia data more effectively. It learns to extract the desired features from the view of its modality. Next, the pre-trained parameters are frozen, and the light-weight trainable unimodal modules are inserted and fine-tuned to keep the number of parameters low while maintaining the superb performance simultaneously. Finally, the four adapted unimodal modules are fused into a final multimodal assessment model through the proposed multimodal fusion framework. For the purpose of validation, we train and evaluate the proposed model on schizophrenia patients recruited from National Taiwan University Hospital, whose performance achieves 0.534/0.685 in MAE/MSE, outperforming the related works in the literature. Through the experimental results and ablation studies, as well as the comparison with other related multimodal assessment works, our approach not only demonstrates the superiority of our performance but also the effectiveness of our approach to extract and integrate information from multiple modalities.


I. INTRODUCTION
S CHIZOPHRENIA is a severe psychotic disorder. According to the Diagnostic and Statistical Manual of Mental Disorders fifth edition (DSM-5) published by American Psychiatric Association, schizophrenia causes one or more of the following domains of abnormalities: delusions, hallucinations, disorganized thinking, grossly disorganized or abnormal motor behavior, and negative symptoms [1]. Distorted thinking and perception, which are characteristic symptoms of schizophrenia, can manifest in a variety of ways, including those through the content of a patient's speech, their prosody, or expressive gestures, and through facial expressions during conversation.
Symptom assessment of schizophrenia is primarily done through the psychiatric interview in which a clinician has a one-on-one, face-to-face conversation with a patient. Positive and Negative Syndrome Scale (PANSS) [2] and Scale for the Assessment of Thought, Language and Communication (TLC) [3] are the two commonly used scales to rate symptom severity of schizophrenia in both clinical and research settings. Each item in these scales quantitatively measures different features of schizophrenia and assigns a score based on the severity of the symptom. After the interview, the clinician completes the ratings, and these scales serve as a basis for diagnosis and measurement of treatment efficacy.
However, the interview normally takes up to half an hour and requires a professionally trained medical expert to conduct the follow-up assessment. Also, since schizophrenia is typically a chronic mental disorder with a high relapse rate, regular interviews are needed to track the status of symptoms. Thus, the time and workforce of mental health professionals are hardly able to meet the need for them to comprehensively assess all schizophrenia patients now and in the future, surely including the ones with suspected early psychosis. In order to address the above issues, many works on automatic schizophrenia symptom assessment systems have been proposed [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20] in the past. They tried to utilize machine learning techniques to automatically evaluate the severity of schizophrenia symptoms based solely on the audio or video recordings of interviews with psychiatric mental health nurses, without the involvement of human experts. This significantly alleviates the burden on mental health professionals, so they can take better care of inpatients in acute episodes. On the other hand, for many schizophrenia symptoms, the severity depends on each clinician's subjective judgments, and hence different clinicians may come up with different diagnosis results. Thus, by using the same automatic assessment system, the low inter-rater reliability problem can be promisingly avoided.
Generally, clinicians rely on three different modalities, namely linguistic, acoustic, and visual cues, to assess the severity of symptoms during patient interviews. Linguistic information plays a crucial role in investigating hallucinations and delusions, as patients tend to verbally express their subjective experiences during mental status examinations. Acoustic and behavioral cues provide valuable insights into the impact of symptoms on patients' emotions, behaviors, and judgments. For instance, blunted affect can be observed through monotonous speech, restricted facial expressions, and decreased gestures, where the skilled interviewers are particularly keen to those observations. Alogia, characterized by reduced spontaneity and verbal output, can also be evaluated through linguistic cues during conversation. In summary, the evaluation of schizophrenia patients' core psychopathology(ical) requires the integration of linguistic, acoustic, and visual behavioral information, enabling mental health professionals to comprehensively assess the severity of symptoms.
Despite various methods have been proposed for automatic schizophrenia symptom assessment, most of them simply focus on one or two of the modalities, except only [21] which analyzed three kinds of modalities simultaneously. Furthermore, to the best of our knowledge, almost all the previous works are based on relatively simple traditional machine learning techniques instead of deep learning, which has led to the recent success in artificial intelligence. In nowadays clinical study, without considering information from all three modalities altogether, one cannot comprehensively evaluate the full spectrum of schizophrenia symptoms. Furthermore, without using recent advanced deep learning techniques, the automatic assessment system won't have sufficient capacity to model the complexity of schizophrenia.
Given the above observations, this research aims to build a transformer-based deep learning model that automatically assesses the severity of schizophrenia symptoms based on all three essential modalities -linguistic, acoustic, and visual cues collected in psychiatric interviews. Specifically, the input to the model is a video and audio recording and a text transcript, and the output is the corresponding predicted TLC and PANSS ratings indicating the symptom severity.
In this paper, several objectives are listed as follows.
• Fine-grained severity scales assessment beyond negative symptoms: Current works on the automatic assessment of schizophrenia focused on detection, such as whether a person has schizophrenia, or whether symptoms are observable, which are all coarse-grained binary classification tasks. Instead, we aim to regress the fine-grained severity levels of symptoms in TLC and PANSS scales.
• Transformer-based assessment model based on linguistic, acoustic, and visual cues: we incorporate all three modalities presented in interview recordings into a schizophrenia automatic assessment model. In addition, transformer-based architectures make them suitable for processing sequential data, and external knowledge they have learned during pre-training in different fields can also enhance our model.
• A novel multimodal fusion strategy for assessing symptoms of schizophrenia: The linguistic features, including semantics and syntax, is the most important cue for schizophrenia. On the other hand, the acoustic and visual features serve as auxiliary cues that can provide complementary information to linguistic features. Thus, we introduce a novel text-centered fusion strategy to effectively integrate unimodal information together.
• Parameter-efficient fine-tuning: Since we incorporate one large pre-trained model for every single modality, naively fine-tuning them all together is prohibitively expensive in terms of computational resources and the model will easily be overfitting to our small schizophrenia dataset. Hence, we propose a way to train our multimodal model by parameter-efficient fine-tuning to address the problems.

II. RELATED WORK
In this section, we give a brief review of related works about the automatic assessment of schizophrenia using speech recordings in Section II-A and both speech and video as a multimodal system in Section II-B.

A. Automatic Assessment Using Speeches
Several works have focused on the automatic assessment of schizophrenia based on information extracted from speech recordings of conversations between patients and clinicians during psychiatric interviews. Mota et al. [4] and Xu et al. [14], [15] proposed to classify individuals as either schizophrenia or healthy through binary classifiers like linear regression, Naïve Bayes, support vector machine (SVM) based on linguistic features computed from transcripts of recordings, including speech graph [22], linguistic inquiry and word count (LIWC) [23], Diction [24] and Doc2Vec [25]. Instead of extracting linguistic features from transcripts, Tahir et al. [16], [17] proposed on evaluating schizophrenia symptom severity through SVM based on acoustic features including prosodic and conversational cues computed directly from speech recordings, e.g., volume, Mel-frequency cepstral coefficient (MFCC), response time, mutual silence percentage and interruptions. Since both linguistic and acoustic features contain information related to schizophrenia symptoms, recent works [18], [19] have combined both methods and achieved more robust performance. In the assessment model proposed by Huang et al. [10], besides linguistic semantics of transcripts and acoustic features of audio recordings, the work incorporates additional syntactic features to capture the grammatical patterns from transcripts. In addition, unlike previous works using conventional machine learning, they fine-tune entire large pre-trained transformerbased deep learning models to learn all the features and predict schizophrenia symptom severity levels in an endto-end manner. However, fine-tuning entire models requires extensive computational resources and is prone to overfit on small datasets.

B. Automatic Assessment Using Multimodalities
Recently, a few works have started focusing on automatic schizophrenia assessment based on both acoustic features from speech recordings and visual features from video recordings simultaneously, which is more comprehensive and closer to how clinicians evaluate patients' symptoms. Siriwardena et al. [9] proposed to classify schizophrenia patients from healthy subjects using FAC as visual features and vocal tract variables (VTV), which describe the movements of each speech organ like lip and tongue during speaking, as acoustic features.
Siriwardena et al. [7] extended the work with deep learning techniques. They adopted two Convolutional Neural Networks (CNNs) as backbone networks to process vocal tract variables and FAU respectively, and then the intermediate features from both CNNs were concatenated and passed thorough FC layers to predict the output. Saga et al. [8] directly adopted pre-defined features like frequency, bandwidth and volume of voice signals as acoustic features, which are then concatenated with FAU as input features for a random forest classifier to distinguish schizophrenia patients from healthy subjects. Xu et al. [21] built an ensemble model based on linguistic, acoustic and visual low-level hand-crafted features, which is the first research incorporating tri-modality information into schizophrenia assessment. However, despite these automatic assessment systems have considered multimodal information, they are all based on relatively simple traditional machine learning techniques instead of deep learning which has led to the recent success in artificial intelligence. Hence, in this work, we aim to build a multimodal transformer-based model that takes linguistic (including semantic and syntactic), acoustic and visual cues as inputs to comprehensively assess schizophrenia.

A. Overview
The overall structure of the proposed multimodal model for schizophrenia symptom severity assessment is illustrated in Figure 1. It is composed of four unimodal large pre-trained transformer-based backbone networks, i.e., semantic language backbone (purple), syntactic language backbone (blue), audio backbone (orange), and vision backbone (red), and a multimodal fusion framework. The three inputs to the multimodal assessment model come from acoustic, visual and linguistic modalities, respectively, i.e., audio recording, video recording, and transcription. The output of the model is a 31-dimensional vector, where each element is a real number that corresponds to one predicted symptom severity score defined in TLC or PANSS scales.
In the following sections, we first describe the four unimodal backbone networks. The semantic language backbone (Adapter-BERT) is described in Section III-B, the syntactic language backbone (Adapter-ELECTRA) is described in Section III-C, the audio backbone (Adapter-TERA) is described in Section III-D, and the visual backbone (Adapter-Swin) is described in Section III-E. Finally, we describe the multimodal fusion framework, including the fusion layer and prediction head, and the parameter-efficient fine-tuning in Section III-F.

B. Semantic Language Backbone Network
Since schizophrenia is characterized by the abnormal state of mind, semantic meanings of what patients say in response to interviewer's questions are one of the most important cues for assessing the severity of most schizophrenia symptoms. To this end, as shown in the first training stage in Figure 3, we first adopt Chinese-BERT-wwm [26], a large Chinese pre-trained BERT model containing 12 layers of transformer encoder blocks with an embedding size of 768 dimension, to process the textual input. The textual input is a sequence of sentences that alternate between the patient and the clinician.
It is worth noticing that there is an obvious domain gap between pre-training and downstream data, which can cause the pre-trained representations unsuitable for direct use in the downstream supervised task of assessing schizophrenia symptom severity. As a result, as shown in the second training stage in Figure 3, we conduct a second-stage self-supervised pre-training using our schizophrenia data before supervised fine-tuning, so that the domain gap can be mitigated by this intermediate training step. Following the original pre-training  task [27], we adopt the masked language modeling (MLM) as the second-stage self-supervised pre-training task for Chinese-BERT-wwm.
However, the trainable parameters of each backbone are too many to naively fine-tune our entire multimodal assessment. Instead, we conduct a parameter-efficient fine-tuning for the multimodal assessment model to drastically reduce the trainable parameter. To do so, each unimodal large pre-trained model needs to be modified using the adapter modules proposed by Houlsby et al. [28], which is designed originally for BERT in natural language processing tasks, before being incorporated into the multimodal model. Three training stages for ELECTRA (syntactic language backbone).
As shown in the third training stage in Figure 3, for the semantic language backbone networks described in this section, we take Chinese-BERT-wwm after second-stage pretraining as the initialization, and freeze all the pre-trained parameters in every self-attention layer and multi-layer perceptron (MLP) inside each transformer encoder block. Then, light-weight trainable adapter modules are inserted right after every self-attention layer and MLP, as shown in Figure 2. The adapter is a bottleneck structure consisting of two trainable fully-connected layers with a non-linear activation function in between and a residual connection. Hence, only the adapter modules and the layer normalization layers in the modified pre-trained BERT model are left to be fine-tuned, and we refer to this modified BERT as the Adapter-BERT. The bottleneck dimension is set to 48, so that the number of total trainable parameters is reduced to 1% of the original pre-trained parameters, allowing Adapter-BERT to be fine-tuned together with the other three backbones in the multimodal assessment model without exceeding the computational limit or risking overfitting on the small downstream dataset.
In order to aggregate all the sentence-level features into a single conversation-level semantic that summarizes an entire dialogue, a conversation-level self-attention layer is added on top of Adapter-BERT, and its i th input token is a sentence-level semantic feature vector corresponding to the i th sentence in the dialogue.

C. Syntactic Language Backbone Network
Syntax refers to the grammatical structure of words' or phrases' arrangement in sentences. As abnormal thought, language and communication behaviors are characters of schizophrenia, syntactic patterns of how patients talk is another most important cues, besides semantics, for assessing the severity of many schizophrenia symptoms. To this end, as shown in the first training stage in Figure 4, we adopt Chinese-ELECTRA [29], a Chinese pre-trained ELECTRA model containing 12 layers of transformer encoder blocks with an embedding size of 256 dimension, as an initialization for the second language backbone network to process the textual input. The textual input is the same as the semantic language backbone.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. However, the self-supervised pre-training task for ELEC-TRA does not specifically guide the model to learn syntactic representations that we want the syntactic language backbone to focus on. Therefore, as shown in the second training stage in Figure 4, following the training strategy proposed by Kitaev et al. [30], we conduct a second-stage supervised pre-training for constituency tree parsing task to train Chinese-ELECTRA to extract syntactic structure information of an input sentence. The annotated dataset is Chinese Treebank 8.0 [31], which contains annotated and parsed sentences collected from Chinese articles.
Finally, as shown in the third training stage in Figure 4, we take Chinese-ELECTRA-small after second-stage pre-training as the initialization, and freeze all the pre-trained parameters inside each transformer encoder block. Then, light-weight trainable adapter modules are inserted similar to the Adapter-BERT shown in Figure 2. We call this modified ELECTRA as Adapter-ELECTRA and make it the syntactic language backbone in our multimodal assessment model. The bottleneck dimension of adapter modules inserted in Adapter-ELECTRA is set to 16.

D. Audio Backbone Network
Acoustic information of schizophrenia patients' speech can serve as an important auxiliary cue for assessing the severity of some symptoms. Although, as mentioned in the previous two sections, the linguistic information of the conversation content is the most important cue for detecting the presence of schizophrenia symptoms, there are other modalities of information that cannot be recorded and shown in transcripts, including the acoustic information focused in this section. To this end, as shown in the first training stage in Figure 5, we adopt TERA [32], a large pre-trained transformer-based speech model containing 3 layers of transformer encoder blocks with an embedding size of 768 dimension, to process the acoustic input. The acoustic input is an audio recording of a psychiatric interview. Then, TERA is responsible for extracting acoustic information related to schizophrenia from the audio recording and outputting it as a feature vector.
Following the same idea of second-stage self-supervised pre-training for our semantic language backbone, as described in Section III-B, we also conduct a second-stage self-supervised pre-training for TERA on our schizophrenia audio recordings before fine-tuning it for the supervised symptom severity assessment, as shown in the second training stage in Figure 5.
Same as the original pre-training task, we adopt the altered acoustic frames reconstruction as the second-stage selfsupervised pre-training task for TERA. Its objective is to reconstruct acoustic frames of schizophrenia audio recordings that are altered randomly in three different dimensions, i.e., time, frequency and magnitude. By reconstructing different altered blocks on spectrograms based on its past and future context, TERA model can then learn contextualized representations for audio signals.
Finally, as shown in the third training stage in Figure 5, we take TERA after second-stage pre-training as the initialization, and freeze all the pre-trained parameters inside each transformer encoder block. Then, light-weight trainable adapter modules are inserted similar to the Adapter-BERT shown in Figure 2. Hence, in the modified pre-trained TERA, only the adapter modules and layer normalization layers are required to be fine-tuned. We call this modified TERA as Adapter-TERA, and make it the audio backbone in our multimodal assessment model. The bottleneck dimension of the inserted adapter modules is set to 48.

E. Vision Backbone Network
Visual information about physical manifestations of schizophrenia patients' affective tone and emotional responsiveness can also serve as an important auxiliary cue for assessing the severity of some symptoms. In addition to the acoustic information described in Section III-D, visual modality is another information not recorded and shown in the textual transcript. To this end, as shown in the first training stage in Figure 6, we adopt Swin [33], a large pre-trained transformer-based vision model, to process the visual input. The visual input is a video recording of a front view of a patient's face, shoulders, and upper body during a psychiatric interview, i.e., a sequence of video frames. We take one video frame as one input sample to Swin. Then, Swin is responsible for extracting visual information related to schizophrenia from the video recording and outputting it as a feature vector. Thus, after all the video frames is processed by Swin, we can get a sequence of feature vectors that has the same length as the number of video frames in a video recording.
Following the same idea of second-stage self-supervised pre-training for our semantic language backbone and audio backbone, we also conduct a second-stage self-supervised pre-training for Swin on our schizophrenia video recordings before fine-tuning it for the supervised symptom severity assessment, as shown in the second training stage in Figure 6. We adopt masked image modeling [34] as the second-stage self-supervised pre-training task for Swin. The input patches are randomly selected with a 50% chance and masked to zero, and then the objective is to correctly predict the original pixel values of masked patches based on their neighboring patches.
As shown in the third training stage in Figure 6, we take Swin after second-stage pre-training as the initialization, and freeze all the pre-trained parameters in every self-attention layer and MLP inside each transformer encoder block. Then, light-weight trainable adapter modules are inserted similar to the Adapter-BERT shown in Figure 2. Hence, in the modified pre-trained Swin, only the adapter modules and layer normalization layers are required to be fine-tuned. We call this modified Swin as Adapter-Swin and make it the vision backbone in our multimodal assessment model. The bottleneck dimension of the inserted adapter modules is set to 48.

F. Multimodal Fusion Framework
Now, the four unimodal large transformer-based backbone networks have all been pre-trained in two stages and adapted to be ready for parameter-efficient fine-tuning. The final step for building our multimodal assessment model is to fuse the unimodal representations extracted by each of these four backbones and then fine-tune them jointly to predict the schizophrenia symptom severity based on the fused multimodal features.
As stated in the previous sections, each modality has different importance for assessing schizophrenia symptoms. Since schizophrenia is characterized by the abnormal state of mind, the linguistic features, including semantics and syntax, is the most important cue for symptom assessment. On the other hand, the acoustic and visual features serve as auxiliary cues that can provide supplementary or complementary information to the linguistic features, which is often useful for symptoms related to prosody and emotion.
Therefore, as illustrated in the green part of Figure 1, we fuse the acoustic and visual features into an auxiliary embedding, and then pass it into the adapter modules of the two language backbones as a piece of additional information to the semantic and syntactic features. These enhanced linguistic embeddings are the key features of schizophrenia patients, and finally the prediction head regresses the symptom severity scores based on these two multimodal linguistic features.
The bottom part is shown in Figure 7. To fuse the acoustic and visual features into an auxiliary embedding, all the hidden states outputted by the same transformer layers in the audio or video backbone are averaged into an aggregated feature vector, and then a learnable weighted sum integrates all the aggregated feature vectors as a fused auxiliary feature vector containing  both acoustic and visual information. Then, the upper part is shown in Figure 8. To enhance the semantic and syntactic language features by providing additional information from acoustic and visual modalities, the fused auxiliary feature vector is linearly projected down to the same dimension as the bottleneck of the adapter by learnable fully-connected layers, and then it is added to the linguistic hidden representations at the bottleneck of each adapter in the semantic and syntactic language backbones.
Finally, the two features outputted from semantic and syntactic language backbones, which contain primary linguistic information and auxiliary acoustic and visual information, are concatenated and passed into a prediction head of threelayer MLP to regress the schizophrenia symptom severity scores. We can then calculate the mean absolute error loss based on the ground truth label annotated by psychiatric doctors.

IV. RESULTS
In this section, the dataset details are first presented in Section IV-A, and then the training and implementation details are described in Section IV-B. In Section IV-C, we compare the performance of our model with related works on multimodal assessment for schizophrenia patients. In Section IV-D, we conducted a series of ablation studies to evaluate the effectiveness of each design in our model. The format for three modal inputs to the backbone networks. In the input dialogue, the abbreviations "OT" and "PT" correspond to "occupational therapist" and "patient," respectively.

A. Datasets
To collect the data for automatic assessment of schizophrenia symptom severity, 37 interviews of 26 schizophrenia patients recruited from the psychiatry adult day care unit of National Taiwan University Hospital are recorded. The protocol and consent were approved by the IRB at the National Taiwan University and the National Taiwan University Hospital. Their ages range from 23 to 54 with an average of 34.3 years old, and 9 of them are male and 17 of them are female. All the subjects met the criteria of schizophrenia defined in DSM-5 [1], and all of them are not in an acute episode. For each subject, we recorded the psychiatric interview using a camera for collecting audio and video data, and then the textual transcription of the conversation was manually transcribed from the audio recording. Figure 9(a) depicts the setup of the recording environment during the clinical interview, showcasing both the top-down view and lateral view. On the other hand, Figure 9(b) illustrates the raw signals that are input to our system. In the visual modality, we consider multiple frames obtained by downsampling the original video as input to the vision backbone network. For the acoustic modality, we initially convert the raw audio signal into the format defined by Adapter-TERA, namely the Mel spectrogram, and treat it as the input to the acoustic backbone network. In the linguistic modality, the textual input to the semantic and syntactic backbone network consists of a transcript, which is a Chinese dialogue between an occupational therapist and a patient. After the interview, a psychiatrist rated the severity scores for selected items in TLC and PANNS scales, as listed in Table I and Table III

B. Implementation Details
For the second-stage pre-training, we adopt AdamW [35] as the optimizer for updating the model parameters, where the learning rate is linearly warmed up to 5e −5 over the first 30 steps. We pre-train the backbones for 3 epochs, and set the batch size to 64 for BERT and ELECTRA, 16 for TERA, and 32 for Swin, so that each pre-training setup can be accommodated in a single 1080Ti GPU. For efficient-based fine-tuning, we also adopt the AdamW as the optimizer for updating the training parameters in the multimodal assessment model. The learning rate is set to 0.001 and no learning rate scheduler is used. Our multimodal model is trained for 240 epochs with a batch size of one sample, so that this gigantic multimodal model composed of four large pre-trained transformer-based backbones can be trained using a single 1080Ti GPU as well.

C. Experimental Results
Due to the small size of the dataset, we trained our model using leave-one-out cross-validation and then took the average to validate the generalization ability of our model, and so does the reproduction of all related works. Specifically, we re-implement the bimodal assessment models [7], [8], [9], [10], as introduced in Section II, by replacing the classification parts with regression heads. To compare the overall performance between our multimodal model and the related works, we calculated the mean average error (MAE) and mean squared error (MSE) of each symptom, and then average the MAEs and MSEs over 16 TLC symptoms, 15 PANSS symptoms and all 31 symptoms in both scales, respectively, and the results are shown in Table I. We outperform all the related works and achieve an average of 0.534 MAE and 0.685 MSE over all the 31 symptom severity scores predicted by the assessment model. The detailed results of every TLC and PANSS symptom are listed in Table VIII.
The performance of related methods in the first three rows is significantly worse than the others, possibly because they only consider the acoustic and visual modalities, not linguistic modality, which is the most important basis for assessing  [10] considers both the semantic and syntactic language features and incorporates large pre-trained models as backbones, although it doesn't include the visual information, its performance is still significantly better than the first three related works. This shows the importance of textual inputs and large pre-trained models when training models to assess schizophrenia symptoms using a small amount of data. Furthermore, we add the visual modality into consideration and also the training technique of the second-stage self-supervised pre-training and parameter-efficient fine-tuning, which allow us to achieve the best performance among all other related works.

TABLE VIII DETAILED RESULTS FOR TLC'S AND PANSS'S REGRESSION
It is reasonable that the system performed better in predicting TLC than PANSS. From the perspective of professionally trained doctors, they also think rating TLC scores based on video clips and textual transcription is more comfortable. That is because the definition of individual items in TLC is clear and simple, and the scoring system is quite operational. In contrast, symptoms in PANSS are rather complex. For example, the "P2 conceptual disorganization" is almost a summary score of TLC. In addition, rating PANSS symptoms require information from other aspects, such as emotional and other non-verbal cues, quality of interaction between interviewer and interviewee, interview techniques which determine whether specific symptoms can be elicited during the interview, etc.
D. Ablation Studies 1) Influence of Each Modality: In our multimodal assessment model, we extract information from three modalities, i.e., linguistic, acoustic and visual cues. To investigate the contribution of each modality for predicting the symptom severity scores, we train the backbone networks of each modality separately and evaluate their performance and compare them with the full version of the multimodal assessment models. The results are shown Table IV, where the "All" modality at the last row represents the complete version of our multimodal models using all three modalities.
It can be observed that the model consisting of language backbones archives the lowest MAE and MSE among the three modalities, which indicates language modality is the most important source of information for predicting the symptom severity. As for the acoustic and visual modalities, they achieve similar performance and can only serve as auxiliary features for prediction. The observations are consistent with the characteristics of schizophrenia.
2) Influence of Second-Stage Pre-Training: Before incorporating the off-the-shelf pre-trained models as the backbones for our multimodal assessment model, we conducted second-stage pre-training (SSPT) for each unimodal pre-trained model. To investigate the effectiveness of second stage pre-training, we compare the performance of unimodal or multimodal models with weights from off-the-shelf checkpoints or weights learned after second-stage pre-training. The results are shown in Table V, where ✗ in the SSPT column represents no secondstage pre-training, and ✓ in the SSPT column represents that the backbones have been tuned in second-stage pre-training.
It can be observed that the second-stage pre-training can indeed help models to achieve better performance on downstream tasks, especially for the multimodal model that have more than one backbone containing weight learned in second-stage pre-training. The observations highlight the importance of second-stage pre-training when they are frozen in the adapter-based fine-tuning.
3) Influence of Adapter-Based Fine-Tuning: Since our goal is to fine-tune a gigantic multimodal model consisting of multiple large pre-trained backbones, we modified each backbone with adapter modules and trained the entire multimodal model by parameter-efficient fine-tuning, also called adapter-based finetuning, allowing the entire training process to fit into a single 1080Ti GPU. In theory, to investigate the influence of adapterbased fine-tuning on model performance, we should naively fine-tune all the pre-trained parameters in the multimodal models for comparison with the adapter version. However, that would increase the number of parameters that need to be fine-tuned by a factor of more than a hundred, which cannot be handled by the limited computation resources. Therefore, we instead compare the performance of standard fine-tuning and parameter-efficient fine-tuning using only unimodal models, which have lesser parameters in total. The results are shown in Table VI, where ✓ in Adapter column represents parameter-efficient fine-tuning, and ✗ in Adapter column represents fine-tuning all the pre-trained parameters.
It can be observed that parameter-efficient fine-tuning improves the unimodal model's performances, as lesser trainable parameters can reduce the chance of overfitting on a small dataset like ours. Parameter-efficient fine-tuning also allows multiple large pre-trained models to be fine-tuned simultaneously as a single model without the need for excessive computing power. 4) Influence of Proposed Fusion Framework: Instead of treating each modality as equally important in multimodal fusion, we first fuse the acoustic and visual features as auxiliary features, and then they are projected down to the bottleneck dimension to extract only useful information and inserted into the language backbones to enhance the linguistic features for schizophrenia symptom severity assessment. To investigate whether the proposed multimodal fusion that treats each modality with different importance is better than simply concatenating all the features from every unimodal backbone, we compare the performances of these two multimodal fusion methods. The results are shown in Table VII.
It can be observed that the proposed multimodal fusion framework achieves better performance for schizophrenia symptom severity assessment, which is consistent with the characteristics of schizophrenia. This demonstrates the effectiveness of designing multimodal fusion methods based on the domain knowledge of downstream tasks.

V. CONCLUSION
In this work, we proposed a multimodal model for assessing the symptom severity of schizophrenia patients based on linguistic, acoustic, and visual cues. The model takes the inputs from the textual transcription, audio recording, and video recording of a psychiatric interview between the clinician and schizophrenia patient. The goal of network prediction is the severity scores of 31 TLC and PANSS symptoms rated by psychiatrists.
To extract multimodal information from the inputs, we adopt four unimodal large pre-trained transformer-based models and conduct a second-stage pre-training for each of the models, respectively. After that, the light-weight trainable adapter modules are inserted to further effectively fuse the representations based on the proposed multimodal fusion framework. Throughout the ablation studies, we verify the effectiveness of the proposed methods and show that all three unimodal features help in predictions of both TLC and PANSS.
In future work, for more scientific analysis, accurate measurement, and even more comprehensive assessment beyond, we also plan to introduce physiological signals during psychiatric interviews.