Depression Recognition Using Remote Photoplethysmography From Facial Videos

Depression is a mental illness that may be harmful to an individual's health. The detection of mental health disorders in the early stages and a precise diagnosis are critical to avoid social, physiological, or psychological side effects. This work analyzes physiological signals to observe if different depressive states have a noticeable impact on the blood volume pulse (BVP) and the heart rate variability (HRV) response. Although typically, HRV features are calculated from biosignals obtained with contact-based sensors such as wearables, we propose instead a novel scheme that directly extracts them from facial videos, just based on visual information, removing the need for any contact-based device. Our solution is based on a pipeline that is able to extract complete remote photoplethysmography signals (rPPG) in a fully unsupervised manner. We use these rPPG signals to calculate over 60 statistical, geometrical, and physiological features that are further used to train several machine learning regressors to recognize different levels of depression. Experiments on two benchmark datasets indicate that this approach offers comparable results to other audiovisual modalities based on voice or facial expression, potentially complementing them. In addition, the results achieved for the proposed method show promising and solid performance that outperforms hand-engineered methods and is comparable to deep learning-based approaches.


I. INTRODUCTION
Major depressive disorder (MDD), also known as clinical depression, is a common mental disorder that contributes significantly to the global healthcare burden and can lead to severe consequences for individuals both personally and socially. In addition, several studies suggest long-term and clinically significant depression as a trigger for other serious medical conditions and physiological changes such as cardiovascular disease, diabetes, osteoporosis, aging, pathological cognitive changes, including Alzheimer's disease and other dementias, and even an increase in the risk of earlier mortality [1]. Currently, depression screening is usually based on medical interviews described in the Diagnostic and Statistical Manual of Mental Disorders (DSM-V), but depends on the subjectivity and experience of the psychiatrist and the subjective memory of the patient, a fact that can lead to misdiagnosis with its consequential social, physiological, or psychological side effects due to undertreatment or overtreatment of the illness.
In recent years, the assessment of depression from facial videos has aroused interest in the scientific community, since the clinical literature has documented particular visual cues and behaviors on faces and facial expressions triggered by major depressive disorder [2]. These facial signs go from reducing facial movements, eyebrow activity, eyes gaze, head pose, mood expressions occurrence, body gestures, or eyelid activity, among others. In addition, this discipline allows the development of a noninvasive and unobtrusive technology and modality that can support the medical diagnosis while the physician focuses exclusively on the patient. The literature studies based on facial visual information have concentrated mainly on three ideas: extracting features from textures and dynamic textures using handcrafted textural descriptors, extracting features from the facial geometry and morphology, and using deep learning approaches, which represent the stateof-the-art methods nowadays.
On the other hand, other objective biomarkers have been shown to be useful for physicians to evaluate and assess the level of depression of the patient in a more confident and precise manner. Recent studies have demonstrated the impact of depression on physiological biomarkers, such as heart rate variability (HRV) calculated from the electrocardiogram (ECG) [3] [4], HRV using PPG signals [5] or electrodermal activity (EDA) [6].
In this article, and based on these findings, we propose a novel approach for automatic depression screening using physiological signals extracted from facial videos and machine learning for the first time. Our main contribution can be summarized as follows: • We assess depression scores by extracting remote photoplethysmographic signals (rPPG), and use them to compute a set of statistical and heart rate variability (HRV) features, including and non-linear geometrical parameters from the blood volume pulse (BVP), feeding them to machine learning regressors based on Random Forests and Multilayer Perceptrons.
• To demonstrate the validity of our approach, we evaluate our methods in two publicly available video-based methods. The experiments are performed using a computer that includes an AMD ® Ryzen(TM) 3700X 8-core processor at 3.6GHz, with 64 Gigabytes of RAM, 4 terabyte SSD and two NVIDIA GeForce ® RTX(TM) 2080. We have also used the Puhti supercomputer at the IT Center for Science (CSC) in Finland to extract the visual texture-based features. We used Python 3.8 as the programming language.

C. Performance metrics
To evaluate the performance of these models and make a fair comparison with the state-of-the-art methods, we provide the two most common metrics in the automatic depression assessment literature, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The overall predicted depression score for each input video is obtained by averaging the estimation scores for all its windows.

D. Experimental results
In this section, we evaluate the performance and validity of the proposed modality and approach through a series of experiments in the benchmark databases. We compare them with other modalities and state-of-the-art approaches.
1) Performance in AVEC2013 and AVEC2014: In Table I and  Table II, we show the evaluation of the performance of the proposed approach using HRV and BVP features extracted from facial videos for both AVEC2013 and AVEC2014. In addition, we also explore a multimodal fusion by combining the heart-related features with textural and deep features to complement the results.
We observe that the results of the individual models (using HRV features and textural features) on the AVEC2013 and AVEC2014 test sets have similar performance, although slightly improved when using features from temporal visual descriptors based on textures. The most remarkable output is the combination of the features from both textural and physiological modalities, achieving the best results among those not based on deep learning features. For AVEC2014, we can observe that for the Freeform task the regression models work slightly better than for the Northwind task, as expected according to the baseline results  [24]. We can observe that the results of the individual models (using HRV features and textural features individually) when using the data joining both tasks are similar in both datasets. We show both results for individual modalities and the fusion of HRV features with both textural and deep features.In AVEC14, score-level fusion also results in better performance than feature-level fusion although slightly worst than in AVEC2013.
Although we can observe that the deep learning-based approach (ResNet-50) has better individual results than the models trained with handcrafted features extracted from either textural or rPPG features, its combination with rPPG features at score-level shows a further improvement of the results. This proves that, in the same manner as textural and rPPG modalities, deep models provide for information that is also complementary to that extracted from physiological signals. However, the best results are obtained when fusing all three data modalities at score level.
2) Error analysis: To further analyze the performance of the features we show the error distribution in AVEC2014 benchmark, and show it in Figure 4. The figure shows the absolute error for each of the 100 test videos ordered from the smallest to the largest. We can observe that the error distributions of both the HRV and the deep models have similar shapes. While the deep model shows an overall smaller error, the HRV shows a smaller maximum error, always smaller than 25. In contrast, the textural model shows a higher portion of videos with a very small error (error 10 or less), but jumps to very high errors for a significant portion of the videos.
3) Qualitative evaluation: For a qualitative evaluation of the models, we show the different predictions per window for three different example videos, depicted in Figure 5. We can observe that inference when using rPPG-based features to train the models is relatively stable and shows less variance for the different time windows that make up a single video. This is in contrast with the the inferences obtained from regressors trained with visual textural features, that show high variability morphology of both the image and the facial landmarks.
The approaches and methods that use these features focus primarily on translating the temporal information of the landmarks or head pose to images such as spectral heat maps, motion history images or motion maps. But other approaches use temporal and morphological information and facial landmark features, gaze, or Action Units (AU) to regress the level of depression.
• Texture features, mostly associated with the static visual features of only one frame. The approaches and methods use handcrafted visual descriptors such as LPQ or LBP features or deep learning features based on the facial appearance of one frame to infer an instantaneous level of depression from the appearance.
• Dynamic texture features include the temporal information based on visual features from a sequence of frames. This is the most explored feature modality since it is known that temporal facial reactions or expressions throw more information about a person's emotional state. The approaches focused on this modality have explored different features such as handcrafted spatiotemporal visual descriptors (LGBP-TOP, LBQ-TOP), different deep learning architectures that encode temporal information, or low-level deep learning features extracted from sequences of images.
• And finally, to the best of our knowledge, we have introduced a new data (feature) modality that can be used on RGB videos. It consists on the extraction of physiological signals (BVP) from faces using the temporal RGB information. We use remote photoplethysmographic waveforms to extract features related to the pulse signal, such as heart rate variability and fractal analysis, which have been shown to have a significant impact on the monitoring and diagnosis of mental health disorders such as depression, stress, or anxiety. From the comparative results, it can be seen that visual information seems to offer better cues for the assessment of depression than audio information. In particular, deep features that combine both spatial and temporal information offer the best overall performance, while other modalities such as geometrical features, behavioural signals and remote physiological signals (HRV) could offer complementary information, further improving the performance. For audio, deep models also outperform those created using handcrafted features. Overall, the multimodal combination of both audio and video shows the best individual performance.

F. Comparison with previous work
For modalities based only on visual information, we compare the results of our proposed method against state-of-theart methods on AVEC2013 and AVEC2014 datasets and show them in Table VII and Table VIII. We can observe that we can divide the previous works into two big groups, those based on hand-engineered representations and deep learning methods. In general, deep learning methods outperform methods that use handcrafted features. However, their black-box nature could result in decreased interpretability, missing cues that show where and when manifestations of depression are seen, something that could make them more useful as tools for medical practitioners.
Tables VII and VIII show, respectively, the performance of several of these methods on AVEC2013 and AVEC2014, both for (data) monomodal and multimodal approaches. The results of these methods seem to improve when using a multimodal approach with different feature modalities [38] where geometric and texture features are combined. Our proposed method builds on similar ideas, but combines novel physiological features with typical dynamic texture features to exploit mostly the complementary visual and physiological temporal information provided by each subject. The learning based methods mostly rely on exploiting also the temporal information using different different deep learning architectures that search for temporal cues in the stream of frames, potentially exploiting spatio-temporal relationships in the videos that could be indicative of depression.
For AVEC2013, the proposed modality in this study outperforms the hand-engineering "traditional" methods, even as a (data) monomodal approach, resulting on a 7.54 MAE. In addition, it has similar performance than one of the first learning-based method proposed to compute the depression level based in two DCNNs [41]. To show that our proposed  For AVEC2014, our method, using exclusively the HRV features as the data modality, also outperforms traditional methods using handcrafted features from the RGB videos, and is very close to some deep learning-based methods such as Zhu et al. [41]. When we combine the features derived from the rPPG signal with deep or visual texture-based features, we achieve results comparable to the state-of-the-art methods in the detection of depression. The improvement of modality fusion at the score level is worse than when testing in AVEC2013, probably due to a smaller amount of data.

IV. CONCLUSION
This paper introduced the extraction of remote biosignals from RGB videos to be used in automatic screening of depression levels from facial videos, a novel visual data modality explored here for the first time. In this context, we have proposed a novel scheme that directly extracts physiological signals in an unsupervised manner, just based on visual information, removing the need for any contact-based device or reference signal. We have directly used these signals to compute physiological features such as blood volume pulse features or heart rate variability parameters, training different machine learning regression models. We evaluated our approach using the AVEC2013 and 2014 benchmark databases. Our results show that our method provides information that can help in the assessment of depression, proving that it can be combined with other visual data modalities to improve the performance further. In our analysis, we have shown graphical examples that suggest that the inference of the models trained with this type of feature modality is slightly more stable than those of other models, such as those that exploit textural or deep features. Extensive experiments indicated the usefulness of such modality, when compared to different methods present in the literature.

V. ACKNOWLEDGEMENTS
This research has been supported by the Academy of Finland 6G Flagship program under Grant 346208 and PROFI5 HiDyn under Grant 326291. The authors wish to acknowledge CSC, IT Center for Scientific, Finland, for computational resources.