What Can Facial Movements Reveal? Depression Recognition and Analysis Based on Optical Flow Using Bayesian Networks

Recent evidence have demonstrated that facial expressions could be a valid and important aspect for depression recognition. Although various works have been achieved in automatic depression recognition, it is a challenge to explore the inherent nuances of facial expressions that might reveal the underlying differences between depressed patients and healthy subjects under different stimuli. There is a lack of an undisturbed system that monitors depressive patients’ mental states in various free-living scenarios, so this paper steps towards building a classification model where data collection, feature extraction, depression recognition and facial actions analysis are conducted to infer the differences of facial movements between depressive patients and healthy subjects. In this study, we firstly present a plan of dividing facial regions of interest to extract optical flow features of facial expressions for depression recognition. We then propose facial movements coefficients utilising discrete wavelet transformation. Specifically, Bayesian Networks equipped with construction of Pearson Correlation Coefficients based on discrete wavelet transformation is learnt, which allows for analysing movements of different facial regions. We evaluate our method on a clinically validated dataset of 30 depressed patients and 30 healthy control subjects, and experiments results obtained the accuracy and recall of 81.7%, 96.7%, respectively, outperforming other features for comparison. Most importantly, the Bayesian Networks we built on the coefficients under different stimuli may reveal some facial action patterns of depressed subjects, which have a potential to assist the automatic diagnosis of depression.


I. INTRODUCTION
M ENTAL health holds the great importance in daily life and activities. An estimated 280 million people worldwide suffer from depression, which may even lead to suicide [1]. As a kind of complicated mood disorder, Major Depressive Disorder (MDD) may vary in severity and last for even years, resulting in severe social and psychological distress which may impair patient's ability to function in daily and social life [2], [3], [4].
Early detection of MDD is of great importance to patients, which has a potential to provide them with proper antidepressant medication and psychological counseling in case the symptoms of depression get worse. There are some common diagnostic styles including self-reporting questionnaire [5], [6] and interview assessment methods [7]. Depression diagnosis based on self-assessment scale is widely used in various scenarios, such as fundamental health care, but its disadvantages can not be ignored [8], [9]. The results of scales can't reflect the meaning of the feedback in perspective of clinical concept. Since the content of the scale is fixed, it is impossible to properly consider some individual characteristics and distinguish other psychiatric and medical comorbidities from depression. What's worse, clinical diagnosis conducted by experts such as psychologist and psychiatrist usually consist of comprehensive assessment about patient's life circumstances, symptoms, and etc, which highly relies on expert's subjective experiences, risking a range of subjective biases [10]. As the amount of patients grows, this experience-dependent diagnosis procedure needs intensive labor of experts and the follow-up treatment can be harder by a lack of resources and trained health care providers. To this end, it is of great importance to develop models diagnosing depression combing machine learning and the traditional evaluation methods to improve the diagnosis effectiveness of depression [11], [12], [13]. The idea inspires us to explore strategies of the depression recognition based on objective indicators, which are highly needed to assist clinical doctors in objective and efficient depression diagnosis [14], [15]. DSM-V [16] depicts that the diagnosis of mental disorders should help clinical doctors to make the patient's treatment plan and potential development of mental state. In order to make up for the inadequacy of traditional depression diagnosis methods such as the high reliance on doctors' experience and risk of subjective bias, an objective and efficient method is urgently needed to assist doctors in clinical diagnosis of depression. As the application in artificial intelligence promotes [17], [18], [19], depression assessment based on machine system is of help to ensure that depressed patients can get timely and effective clinical treatment, improving the well-being of patients' lives [20].
Within the past several years, depression recognition based on facial expressions has been paid increasing attention [21], which has a potential to enable an effective and objective diagnosis approach for people suffering depression, so that the valid treatment can be applied to patients in time. DSM-V [16] demonstrates that patients' facial expressions and actions could be an evidence to infer the existence of the depression. As a significant method for expressing emotion, facial expression has been explored by psychologists with various modeling strategies, such as the "Facial Action Coding System" (FACS) [22] with the famous idea that each facial expression can be visually represented as a combination of different facial muscles action. The process of activating each Action Units (AUs) largely take place in a visible way due to changes in facial geometry and appearance. The experiment results of research [23] exploring whether facial movements can identify depression concluded that the model can achieve similar depression detection accuracy utilising FACS parameter or active appearance model, which suggests that observable differences are existed in facial expressions in depressed patients and normal person.
Despite that significant researches have been conducted in depression detection based on facial expressions, due to the complexity of depression symptoms, this paper makes some contributions according to the specific challenges as follows: • A facial expression dataset was collected, comprising MDD patients and healthy subjects, to capture facial differences and optimize depression recognition.
• Optical Flow (OP) and wavelet transformation coefficients were extracted to reveal the subtle nuances of facial movements.
• Bayesian Networks (BN) were utilized to analyze the inherent correlation between specific facial regions, which had potentials to show underlying facial movements patterns in depressed patients. The remainder of the paper is organised as follows. Section II briefly introduces the related work, and Section III depicts the data set and our proposed models for depression recognition. Sections IV presents our experimental settings, results and analysis. Finally, we conclude the paper in Section V.

II. BACKGROUND AND RELATED WORK A. Depression Recognition Based on Facial Expressions
For the past few years, depression recognition [24], [25], [26] has been paid much attention. One of them is automated depression recognition based on facial expressions. Tremendous algorithms have been proposed to conduct novel and efficient face detector [27], [28], [29]. For instance, early face detection with novel hand-craft features such as the Viola and Jones [30] and Dlib library [31], could locate significant facial regions or landmarks. Moreover, some neural networks have been incorporated with algorithms of face detection, generating a series of promising face detectors [32], [33], [34].
Feature extraction demonstrates the process of information representation, aiming to obtain a robust descriptor and reduce the data redundancy [35]. Sariyanidi et al. [36] summarized some facial features such as the Local Binary Pattern (LBP) [37], which excelled in the image texture and highlighted certain facial areas including the mouth, eyebrows and eyes [38]. OP [39], [40] extracts and evaluates targeted object's magnitude and direction on the pixel level, which could effectively measure spatio-temporal intensity changes of facial landmarks or extracted ROIs. Moreover, statistical analyses such as Discrete Wavelet Transformation (DWT) could achieve the location of key frames without time consumption and inevitable redundancy comparing with continuous wavelet transformation [41]. DWT has been extensively applied to feature extraction from one-dimensional signals. Previous studies have explored its classical applications in depression recognition [42], [43] and facial analysis [44], [45]. The approximation and detail coefficients derived from the discrete wavelet transform can serve as effective features for capturing the information of signals at different scales and frequencies. These feature vectors can be utilized as inputs to machine learning algorithms for classification and recognition tasks. Specifically, in the domain of facial image analysis, the approximation and detail coefficients obtained from the discrete wavelet transform have been widely employed for tasks such as facial expression and action recognition. These coefficients capture both the global trends and local details of the images, which are crucial for extracting specific features from facial images.

B. Bayesian Networks Construction
BNs are probabilistic graphical models, which are directed acyclic graphs and could represent the joint probability distribution of variables in an effective form. The network consists of nodes, which represent the variables of interest, and edges, which represent the conditional dependencies between the nodes. The advantage of BNs is that they provide a compact and intuitive representation of complex probabilistic relationships between variables. They can also be leveraged to perform probabilistic inference, which allows for the calculation of the probability distribution of a subset of variables given evidence about other variables. BNs have been widely applied in scenarios of uncertainty reasoning [46], [47], with advantages of graphical representation and probability [48]. But the structure learning of BNs is NP-hard with its challenging learning process from the data set [49]. The effectiveness of model depends on the BNs structure to a large extent [50]. On the premise of utilising the whole data, in order to find the global score under the optimal structure, a lot of researches leverage the dynamic planning method to find the global optimal structure avoiding the search space [51]. The disadvantages of above methods are high time complexity and big memory consumption, which makes them unsuitable for variables in large amount. Perrier et al. [52] proposed an optimal undirected graph with structure constraint to reduce the search space. Kojima et al. [53] promote the idea of the method utilising variables clustering and ancestors constraints. Structure learning based on constraints is an effective strategy to reduce the search space [54] of BNs. Certain strategies should be performed to measure every two variables and determine whether to add an edge to them by certain measuring methods. Li et al. [55] used constraint-based BNs learning strategy in the process of AUs correlation exploration. Moreover, researchers have explored compound emotion categories by measuring AUs involved in facial expressions [56].

C. Action Units and Facial Expressions
Facial expressions recognition has attracted increasing attention recently [57]. The FACS could utilise the combination of facial action units to describe a host of facial expressions. In previous [58] studies, the spatio-temporal correlation and interaction between facial movements are rarely explored, and those correlations could help us better understand and analyze facial expressions [55]. The FACS divides facial area into 46 AUs, each of which is anatomically related to the contractile movements of a specific set of facial muscles. In general, various AUs representing different facial muscle movements are usually combined to form natural facial expressions to reveal emotions [59]. As depicted in the FACS [60], the intrinsic relationship between AUs can provide the information to better analyze facial expressions.

III. MATERIALS AND METHODS
A. Participants and Data Acquisition 1) Experimental Paradigm: To explore the facial movements of depressed patients and obtain nuances of expressions, three kinds of emotional stimuli are required. We recruit participants and collect data utilising the video stimuli from the "Chinese Affective Video System" including positive, neutral, and negative movie clips, and each video clip is about 2 minutes. The whole process can be seen as Fig. 1. At the beginning of the experiment, the subject is instructed to prepare to watch the videos. At the end of each video, subject takes a break of about 2 minutes to adjust his/her emotional state, which could eliminate the emotional experience produced by the previous video. Each participant was required to watch all three video clips for once, and their facial expressions Analyse image frames statistically with sliding windows 12: Compute the DWT approximate and detail coefficients, C a and C d based on the data extracted from sliding windows 13: Set C a and C d as facial movements coefficients for the extracted OP 14   interview that met the diagnostic criteria for MDD as described in DSM-IV. For all subjects, inclusion criteria are as follows: a) no psychotropic drug treatment could have been taken within the two weeks prior to diagnosis and b) no drug treatment could have been started after diagnosis. The exclusion criteria are: a) chronic somatic-like illness, b) long-term use of cortisol drugs, sedatives, analgesics, drugs for hypertension and heart disease, sleep medications, antiepileptic drugs, etc., c) presence of patients with mental disorders within two generations and three lines, d) neurological problems, alcohol or psychotropic substance abuse or dependence in the past year, and e) pregnant women and those who were breastfeeding or taking birth control pills.
In accordance with the above inclusion and exclusion criteria, this paper collected data based on 34 depression outpatients diagnosed by clinical psychiatrists at the Second People's Hospital of Gansu Province and 32 healthy controls, totaling 66. Excluding some invalid data, such as a part of the face disappearing for a long time during the acquisition process and frequent body movements leading to the face incompleteness, we had 30 outpatients (10 males, 20 females) and 30 healthy controls (16 males, 14 females).
3) Proprecessing: In order to extract more effective OP features, each video is cropped into a sequence of images, and we also delete the video sequences with incomplete faces. For each frame, we utilise the face detector of the Dlib to detect the face in the image, and then leverage 68 face landmarks to calibrate each part of the face to achieve the face crop. The whole process of the method can be seen in Fig. 2.

B. ROIs Extraction
We found that an AU-based approach had a potential for the extraction of facial ROIs. Combining the association between AU and facial expressions, we propose a specific scheme for extracting facial ROIs according to the facial regions involved in AU [61], as is shown in Table I. We also visualised the magnitude of OP in 14 ROIs about the proposed partition scheme. The preliminary facial area extraction scheme is shown in Fig. 3, which represent the box plots of normalized OP magnitude with three different stimuli as the facial regions. It can be seen that the activated areas of different facial regions showed a significant difference which indicated certain movements of different regions. C. Feature Extraction 1) ROIs-Based Optical Flow Feature: OP [39], [62] aims to evaluate the magnitude and direction of motion based on a video frame or image sequences leveraging a series of facial landmarks. Generally, for an object, OP based algorithms compute the pixel-level movements occurring between two frames. The method of OP has a dependence on the selection of the initial tracking points and their position changes, which is highly sensitive to some factors such as noise and occlusion [63].
For an image sequence, the OP infers the motion of a target object by detecting the changes of pixels between two frames over time. For each image, we used these 68 facial landmarks to compute our 14 ROIs referring to action units. Given a facial landmark L i , i = 1, 2, . . . , 68 derived from the detected face in the image, it has coordinates L i (x, y) with a pixel intensity as I L i (x, y, t) at time t, and it moves a certain distance of x, y between two frames in the time of t. According to the brightness constancy constraint of the OP, the OP of this landmarks can be described as the follows: Assuming that the distance of the pixel is small, the OP constraint equation for the pixel intensity I L i (x, y, t) at i t h point L i (x, y) can be developed according to the Taylor series as: where τ is a higher order infinitesimal, and from these equations, we can obtain: and: We can concluded that: where V x L i , V y L i are the velocities corresponding to the pixel intensities I L i (x, y, t) and the x and y components of the OP, respectively. Thus, between two frames at a distance t, the OP of a facial landmark L i (x, y) at t is represented as a two-dimensional vector as follows: Then, the Euclidean coordinates can be converted into polar coordinates represented by where ρ t L i and θ t L i are the magnitude and direction of OP in i th landmarks at time t, respectively. Given an image sequence f 1 , f 2 , . . . , f m , for each frame f j , j = 1, 2, . . . , m, compute the OP of each ROI R j k , k = 1, 2, . . . , 14, where m is the total number of frame and k is the number of ROIs. The normalized OP feature based on ROIs for a sequence is represented by:

2) Facial Movements Coefficients Based on Discrete Wavelet
Transformation: Wavelet analysis provides a highly flexible strategy to extract features based on the specified time or frequency [64]. We use the approximation coefficients and detail coefficients in DWT as those coefficients, respectively. We first extract optical flow from sliding windows as input signals and leverage DWT to obtain their time-frequency representations. This approach aims to capture the temporal dynamics and variations of optical flow features at different time scales. We then extract the approximation and detail coefficients as facial movements coefficients. The approximation coefficients capture the global trends of the optical flow features, while the detail coefficients capture their local variations. The DWT-based facial movements coefficient offers several advantages. Firstly, it provides a multi-scale representation, enabling us to analyze the variations of optical flow features across different frequency ranges. This helps us capture subtle dynamic patterns and extract features relevant to individual subjects or emotional states. Secondly, the approximation and detail coefficients provide descriptions of both the overall trends and local details of the optical flow features. This is crucial for understanding the global dynamics and local changes within the optical flow features. The research findings on the application of discrete wavelet transform in one-dimensional signal feature extraction and image analysis provide important insights for our study. By applying DWT to analyze OP, we could help us to better construct the nuances in facial movements exhibited between depressed patients and controls under the experimental stimuli across multiple time scales and frequencies, moreover, to uncover their potential implications for individual-specific and invariant feature representation.
For each image sequence f 1 , f 2 , . . . , f m , we utilise an sliding window with size of window as n, and step as step, the sub-sequence γ [a,b] obtained by sliding window is , and sliding window can be represented as: where a, b is the start and end frame of a sub-sequence, and η can be computed as: For a sequence from sliding windows, we compute the average optical flow feature, and then we conduct discrete wavelet transformation based on the feature and obtain the approximation coefficients and detail coefficients, which are denoted as C a , C d , respectively. The pywt package from the Python library was employed for DWT calculations using db3, and The mean values of the approximation and detail coefficients were separately calculated, resulting in the final facial movement coefficients. We combine these coefficients C with OP as our final feature: The two facial movements coefficients extracted in this paper demonstrate the motion characteristics of ROIs. Specifically, the approximate coefficients derived from the sliding window data reflect the process of changes in the subjects' facial OP throughout the experiment, which have significant ability to represent the overall changes. Moreover, the detailed coefficients could reveal the more acute parts of the changes depicting different responses of facial expressions generated by these two groups under the experimental stimuli. The whole algorithm could be seen in Algorithm 1.
3) Pearson Correlation Coefficients: The Pearson Correlation Coefficient measures the correlations of two or more variables. Studies exploring the correlation between different expressions and action units have received much attention in recent years. For example, Wang et al. [65] investigated the analysis of the correlation between six basic expressions and AU based on the Extended Cohn-Kanade (CK+) dataset [66]. Referring to those researches, in our paper, we build and analyse the correlation between our ROIs and various AUs, which could help us understand the nuances generated by different emotions from the perspective of facial muscles by utilising Pearson Correlation Analysis.

IV. RESULT A. Classification
Several classifiers including k Nearest Neighbors (kNN), Support Vector Machines (SVM), Naive Bayes (NaiveB), and Decision Trees (DT). Moreover, depression recognition was performed under different kinds of the stimuli for comparison, and all these classifiers underwent a grid parameter tuning step. For instance, in the decision tree, we leveraged the grid search to optimise the parameter selection including max_depth, min_samples_split, and min_samples_lea f . In SVM, we used the sigmoid kernel and performed grid tuning for the hyper-parameters. We conducted leave-one-subjectout cross validation, in each fold, we leveraged 59 subjects to train model and the left one to test. After 60 folds, we obtained confusion matrix. The standard deviation was calculated by results of each fold. The results with standard deviation are shown in Table II and the receiver operating characteristic (ROC) curves of optimized models are shown in Fig. 6, where C a and C d depict facial movements coefficients.
Additionally, we selected some facial features for comparison. For LBP [67], we extracted the LBP features of each frame image and then used the features as our classification features. For Histogram of Oriented Gradient (HOG) [68], we divided the images into blocks with cells. We computed the HOG features for each frame image and took the average values across all cells within the block. The Main Directional Mean Optical Flow (MDMO) [39] and the Transitional   Optical Flow under stimuli (TOFS) [58] are two optical flow features, which are also compared, and results with standard deviation are shown in Table III. From above results, we can see the depression recognition based on multiple classifiers achieved a high recall, which is consistent with our previous experiment [58]. The high recall of some classifiers in our experiment indicated that our proposed feature performed better for the recognition of depressed patients, while the diversity of facial activities of healthy subjects might lead classifiers misjudged. Moreover, about classifiers in Table II, we found that the DT performed significantly ( p<0.05) better than the kNN and NaiveB. About features in Table III, our feature performed significantly ( p<0.05) better than MDMO, LBP and HOG.  eyelid lift. We speculate that the higher correlations of these ROIs reveal the external brow elevation under positive stimuli. Under neutral stimuli of depressed group, ROI6 shows a high correlation with other ROIs, which reveals a facial movement of nasal wrinkling in the depressed subjects. Moreover, it can be seen that ROI7 correlates with most of the ROIs higher in the depressed and control groups than in the depressed group. Under negative stimuli, depressed subjects generally had higher correlation coefficients for ROI7 and other ROIs, but controls did not have this relationship. In order to analyse those patterns in depth, we built BN and showed the correlations of those various ROIs under different stimuli.

B. Bayesian Networks
In addition to the Pearson correlation coefficients for analyzing the differences in facial activity characteristics between the two groups of subjects, in order to further analyze the correlations between these ROIs and thus explore the facial movements characteristics of the depressed and control groups under three different paradigm stimuli, we constructed BN based on the facial movements coefficients. To learn the structure of BN, we leveraged the Pearson correlation coefficients as a network structural constraint. For each group of data, we take its second quartile as a threshold, and if the value of Pearson correlation coefficient of a two ROIs is smaller than this threshold, we set them as independent variables. In order to learn the correlation between ROIs, we do not add edges manually because determining the direction of edges requires significant domain knowledge about facial expressions, which is not necessarily applicable in depressed patients, so we put some constraints on the independence between ROIs based on Pearson correlation coefficients. The final constructed BN under different stimuli are shown in Fig. 7. Furthermore, it should be noticed that there are various learning algorithms for structures of BN, and the optimal network structure may deviate based on the different structure learning algorithms. In this paper, we construct the network from the detail coefficients under different paradigm stimuli based on the independence constraint and use those networks to analyse the characteristics of facial movements between those two groups.

C. Analysis and Evaluation
Based on the BN, we explored the facial movements of depressed and control subjects under different stimuli. Fig. 7 shows that the depressed group exhibited a series of movements associated with ROI9, corresponding to the facial action of lip pulling tight (AU12). Additionally, ROI7 and ROI14 showed high correlation with other ROIs, both corresponding to lip movements. Under neutral stimuli, the network structure of depressed subjects showed more multiple connections compared to positive stimuli, indicating richer interconnections among ROIs. Depressed subjects also showed a close relationship between ROI10 and ROI7, which correspond to muscle movements in the labial horn muscle decompression and upper lip elevation, respectively. The BN of the control group under negative stimuli showed a multiplicity of connectivity similar to that of ROI9 exhibited by depressed subjects under positive stimuli, and related studies have shown that the lip horn pulling and tightening corresponding to ROI9 occurs in compound classes of emotions, such as the expression [56] of happy and disgust. Furthermore, the depressed group showed a strong correlation of facial movements under negative stimulus with ROI7 (lip muscle movement) and ROI9. To conclude, lip movements of depressed subjects under all three stimuli revealed higher correlations with other facial movements, while the ROIs in the control group showed a multiple connection.
In some of our cases, Bayesian Networks exhibit independent subnetworks, which might be attributed to several reasons such as the lack of correlation among variables, subdomain specificity, and limitations of the learning algorithm. Furthermore, factors such as dataset characteristics, sample size, and algorithm constraints could contribute to the formation of separate subnetworks. To address these issues, future experiments will involve increasing participant numbers and optimizing the learning algorithm for a more comprehensive BN structure and lead us to an improved understanding of the dependencies among variables.

V. CONCLUSION
Facial expression data contain rich emotional information, and increasing studies have been conducted in recent years to combine facial expressions and depression recognition to utilise some information revealed by patients' faces to help psychiatrists diagnose depression. In this study, we extracted sparse OP features and conducted a series of statistical analysis to explore the time-frequency characteristics and recognise depression. The experimental results showed the validity of our model. Furthermore, we built BN based on wavelet transformation coefficients to analyse facial movements. In order to extract more specific and unique movements patterns of depressed group, ROIs should be explored more in future by exploiting other partition strategies, which could promote the performance of depression recognition models.

VI. LIMITATION AND FUTURE WORK
In this work, we proposed a sparse OP feature based on the coefficients of facial movements, and the classifier constructed based on it achieved effective depression recognition. Reviewing the whole process of depression, there are still certain aspects that deserve our further improvement and exploration. For example, our model achieved a high recall rate, which might be caused by the similar expression feedback of some subjects in the control group as depressed individuals. Moreover, besides the action units, various partition strategies are need to be explored to extract more effective face regions. Furthermore, it is worth exploring how to derive these independence constraints and threshold values to learn the BN structure more effectively.