An Augmented Teleconsultation Platform for Depressive Disorders

Depressive disorders are a leading cause of morbility and disability in the working population worldwide. Psychiatric assessment and treatment of depressive disorders may be improved by using objective behavioral and physiological biomarkers. In this work we present a psychiatry telemedicine platform where two important biomarkers of depression, eye-blinking rate and heart rate variability, are computed from the video stream and shown to the clinician in real time during teleconsultation. In order to validate our video processing methods, we use a public annotated dataset with image and photoplethysmography information, complemented by us with eye blinking information. For the eye-blinking detection, we obtained 94%±16% precision in the best scenario. For the estimation of %LF and LF/HF, the mean errors obtained are 16.9NU and 1.1, respectively, outperforming the state of the art. These results show the potential of the proposed telemedicine platform.


I. INTRODUCTION
Depressive disorders are one of the leading causes of disability in society, with an estimated 322 million people affected [1]. The COVID-19 pandemic highlighted the impact of increased social isolation and restricted access to care, leading to higher rates of depression symptoms in the general population [2]. Depressive disorders are diagnosed based on a psychiatric history and mental state examination by a trained doctor. Notwithstanding, biomarkers could aid this approach by providing objective measurements, dissecting heterogeneity within the same diagnosis, helping predict response to treatment and developing targets for new interventions [3]. As telepsychiatry expands in the post COVID era [4], new The associate editor coordinating the review of this manuscript and approving it for publication was Michele Nappi . digital based tools could help implement systematic, nonintrusive, automatic measurement and recording of depression biomarkers in routine clinical practice.
Psychomotor retardation is common in people with depressive disorders [5], [6], thus being one of the nine symptoms that are part of the Diagnostic and Statistical Manual of Mental Disorders 5th edition (DSM-5) criteria for Major Depressive Disorder (MDD) diagnosis [7]. It can be manifested as slowed speech, dysfunctional cognition and decreased movement, affecting hands, legs, torso and head [8]. This psychomotor retardation has also been observed to influence blinking [9].
Heart Rate (HR) is defined as the number of heart beats per minute and Heart Rate Variability (HRV) concerns the fluctuation in the time intervals between adjacent heartbeats [10]. HR variation ensures optimal adaption to VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ environmental challenges [11] and is a reflection of the many physiological factors modulating the normal rhythm of the heart, namely the coordination of autonomic, respiratory, circulatory, endocrine and mechanical influences over time [12]. It has been reported that a lower HRV may be a predictor of depressive disorders [13], [14], and also that abnormal HRV patterns may be in general associated with mental disorders [15], social cognition [16], executive function [17] and emotional regulation [18]. A reason for HRV to became popular in psychology is that it can be obtained from lowcost, non-invasive, and accessible equipment. Furthermore, in 2008, remote Photoplethysmography (rPPG) research officially started with reports that a subject's face contains a signal sufficiently rich to measure HR under ambient light, using only a digital camera and signal processing [19]. This simple setup makes it possible to estimate HRV from a teleconsultation webcam video. By recognizing that blinking and HRV metrics are related to depression, that they can be extracted from teleconsultation video and that the number of teleconsultations is growing, we extend the state-of-the-art by proposing a novel augmented teleconsultation platform to support assessment of depressive disorders. It contains the canonical functionality of any teleconference application, with additional functionalities especially designed for psychiatric consultation of depressive disorders. The latter consists of detection and quantification in real time of the eye-blinking rate, the HR and the HRV of the patient during the session, using only a video camera as a data source. These metrics are obtained in each session, stored by the platform and may be used by the clinician to observe the evolution of the patient. In order to validate the method used to extract these metrics, we assessed its performance using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC-RPPG) public dataset.
The remainder of this work is organized as follows: Section II describes related work, Section III details the system architecture of the proposed platform, Section IV describes the methods used to extract the biomarkers, Section V presents and discusses our results, Section VI provides an outline of the main findings and suggests next steps.

II. RELATED WORK
Computer-aided diagnosis (CAD) is the use of a computergenerated output to assist clinicians in the decision-making process. The CAD work concerning objective metrics to support psychiatric assessment has been developing models to classify subjects into depression groups and regression models to estimate depression scores [20], [21]. Biomarkers that may support the diagnosis come from different data sources such as behavioural smartphone data [22], social media text [23], physiological signals [24], speech [25] and video of the subject. Regarding visual cues, it is common to use features from the subject's face such as facial expression, head movement, and characteristics from eyes and mouth [26]. For instance, Zhou et al. [27] developed an integrated multimodal system that uses webcam video, social media content and keyboard/mouse user interaction to classify depressed patients into two severity levels. However, they only used 5 depressed subjects out of 10, which is a small sample size.
Alghowinem et al. [9] found that in depressed subjects, the average distance between the eyelids was significantly smaller and the average duration of blinks was significantly longer. Al-gawwam et al. [28] proposed a decision support system to classify subjects into a depressed or non-depressed state based on eye blink features such as blinks per minute, blink duration and blink amplitude. In the dataset where the subjects' read a passage, the classifier obtained a 92% accuracy. This suggests that eye blinking features can be used as metrics to support clinical assessment of depressive disorders.
Studies that investigate the relationship between HRV and depression typically use time-domain metrics and frequencydomain metrics. These metrics are derived from the normalto-normal (NN) intervals, that represent the normalized time between two heartbeat detections. The most used time-domain metrics are the HR, the standard deviation of the NN intervals (SDNN) and the root mean square of the successive differences (RMSSD) of the NN intervals. Frequencydomain methods estimate the spectral power of bands of interest from the NN interval signal and three spectral components are distinguished in short-term recordings: very low frequency (VLF) in a range of 0.0033-0.04Hz, low frequency (LF) in a range of 0.04-0.15Hz and high frequency (HF) in a range of 0.15-0.40Hz. LH and HF bands are known to reflect changes in autonomic modulation of the heart period and can be represented in normalized units to emphasize the balanced behavior of the two branches of the autonomic system [29]. An LF/HF ratio can be computed from the LF and HF spectral powers, with the underlying assumption being that LF power is mainly generated by the sympathetic nervous system and HF power is mainly produced by the parasympathetic nervous system. Thus, a low LF/HF ratio can be observed in tend-andbefriend behaviors (parasympathetic dominance) and a high LF/HF ratio occurs in fight-or-flight situations (sympathetic dominance) [30].
Recently, comprehensive analyses of effect sizes for seven distinct resting-state HRV measures [31] were made, namely HF, LF, VLF, LF/HF, RMSSD, SDNN and mean NN interval duration. Results suggest that patients with depressive disorders are likely to display small reductions in HF, LF, VLF, RMSSD (largest effect size), SDNN and NN intervals and an increase in LF/HF ratio. They conclude that depressive states are not associated with alteration in specific indicators of HRV, but rather with abnormalities in several time-and frequency-domain measures. In [32], authors found correlation between 15-min ECG-based HRV measures and clinical state of depressive disorders. Within 62 depressed patients and 65 non-depressed controls and testing several HRV measures, they have seen a decreased HF%, amongs others, to be highly correlated with a diagnostic of depressive disorders.
To the best of our knowledge, this work is the first to propose a real-time system that displays depression-related biomarkers to the clinician using only the subject's teleconsultation video as input. Furthermore, this is the first work to combine eye-blinking rate with HRV for assessment of depression, in contrast to other works which either use these features alone or with other depression-related biomarkers.

III. SYSTEM ARCHITECTURE
The system is composed of two main parts: the web application and the video processing. In the web application, the clinician is able to start a teleconsultation and then invite the patient to join. After the patient joins the teleconsultation, the clinician runs the video processing script. This script will do the analysis of the biomarkers and store the results in the database in background during the teleconsultation. Finally, the clinician is able to select in the web application the results of the biomarkers' analysis he wishes to visualize. A general overview of the system architecture can be seen in Fig. 1a.

A. WEB APPLICATION
The web application includes authentication, video meeting integration and charts for data visualization. The technological stack we used to develop the web application was Bootstrap, JavaScript and Firebase. We used the Bootstrap framework to develop the graphical layout and the SB Admin 2 Bootstrap template was used as a starting layout for the web application. Furthermore, the authentication was done using Firebase Authentication with an email address/password sign-in method. The web application was deployed using Firebase Hosting.
We chose Cloud Firestore from Firebase as our application's database, which was used to store the results of the biomarker analysis. There is a collection called users that stores the patient's identification string and username, which follows the anonymous pattern PSI.TEC.APP1 (for patient 1). The association between username and the real name is kept in an Excel file stored in the clinician's computer. There is another collection called consultations that stores data about all the consultations performed in the platform. For each consultation we store the identification string of the patient and the clinician, the date of the consultation and the results obtained from the biomarker analysis. This data allows retrospective comparisons between consultations and to examine the overall longitudinal course of the patient's depression.
In order to allow the creation of a teleconsultation inside the web app we used the Zoom Web SDK. First, we created a Zoom developer account in Zoom Marketplace and then registered it as a JSON Web Tokens (JWT) App. After the registration, we obtained the API key and API secret which allow the authentication in the Zoom server when the request to start a teleconsultation is made. Then, we integrated the Zoom Web SDK by including the dependencies on the teleconsultation HTML file. Finally, we built a structure for the process of meeting creation with three components: 1) browser client; 2) proxy server; 3) Zoom server. The proxy server is a Firebase Cloud Function that exists because the API key and secret should not be stored in the browser client for security reasons. The process consists on HTTP messages exchanged between the components with the requests to start the meeting and the responses with the necessary information to do so (Fig. 1b). These messages carry the necessary authentication tokens, to allow the communication with Firebase and with Zoom. The settings of the meeting are defined in the proxy server.
We added the charts using D3.js. The eye-blinking rate, the Standard Deviation of NN Intervals (SDNN) and the HR are represented by line charts in the web application, while the LF/HF ratio is represented by a doughnut chart. In the line chart representation, the vertical axis represents the magnitude of the biomarker, such as the number of blinks. The horizontal axis represents the time instant in which the biomarker was obtained, with the axis ticks being displayed every minute. In addition to the line, every data point is also displayed by a circle with a radius larger than the line's width. If the mouse is hoovered over this circle, a tooltip shows the value and the instant associated to that circle in a blue box. An example of the charts' visualization can be seen in Fig. 1c.

B. VIDEO PROCESSING
The video processing consists of three main steps: 1) capture the frames from teleconsultation video; 2) perform the biomarker analysis; and 3) send results of the analysis to the database.
The processing is done by a Python script running in the background on the computer, which must be started by the clinician. In the main loop, the process goes from frame capture (Step 1) to biomarker analysis (Step 2) sequentially and uninterruptedly. The frame is obtained by capturing a portion of the teleconsultation video screen (in this case the Python library MSS was used). Every time a frame is captured the analysis of each biomarker is done on this frame. Finally, the results of the biomarkers analysis are sent (Step 3) every 15 seconds to the database in parallel, and thus the main loop is not affected.

IV. BIOMARKER ANALYSIS
The biomarker analysis is the part where we take a series of frames and extract measures that may be clinically relevant. In this work, the focus was on measures that could be extracted from the face, more specifically the eye-blinking rate, the heart rate and its variability.
For every frame, the first two steps are face and landmark points' detection, needed to compute the biomarkers. The face is detected using the dlib frontal face detector [33]. The detector receives a frame in grayscale and returns a square that fits the face of the subject. Then, the detected face is given as input to the dlib facial landmark detector [33], which outputs 68 facial landmark points, as can be seen in Fig. 2.  These points can be indexed to identify specific areas of our face such as eyes, nose, mouth and eyebrows, which can then be used to extract biomarkers. The facial landmark detector [34] is pre-trained on the 300-W dataset [35]. This dataset covers a large variation of identity, expression, illumination conditions, pose, occlusion and face size, which is relevant during a teleconsultation due to the diversity of patients and possible variations in the consultation environment.

A. EYE-BLINKING RATE
The eye-blinking rate is the number of blinks per time interval, which was defined as 15 seconds in this work. In order to obtain this value, three main steps were followed: 1) signal acquisition; 2) signal filtering; and 3) blink detection.
There are six landmark points per eye, as can be seen in Fig. 3. These landmark points represent (x,y) coordinates in pixels in the frame, which means that distances can be calculated between them. As our eyes open and close, the vertical distances will be varying over time, as noted by Soukupová andČech [36]. In their work, a ratio of how open or closed the eye is, the Eye Aspect Ratio (EAR), can be measured using the following formula: where p 1 , . . . , p 6 are the six landmark points, as represented in Fig. 1. As the eye closes, the vertical distance represented  by the numerator in the formula, will decrease its value (Fig. 3a), thus decreasing the EAR. On the other hand, as the eye opens, the vertical distance will increase (Fig. 3b) and thus the EAR will increase as well. The EAR is calculated for each eye and then the average is obtained. If the EAR is calculated for each frame, an EAR signal is generated, as can be seen in Fig. 4. The EAR signal is noisy and needs to be filtered. We chose a low-pass filter as it is fast and inexpensive, which is necessary in real-time applications. The eye blinking duration has been characterized to last 572 ± 25 ms [38] and thus the cutoff frequency was heuristically set to 3Hz.
To identify the blinks we use a detector with an adaptive threshold. Every time the EAR is smaller than the threshold, the detector predicts a new blink. The average open EAR and the EAR when a subject blinks varies between individuals and thus a threshold that adjusts automatically given the past signal characteristics is ideal for this detection. The detection threshold is updated given the previous maximum and minimum EAR values. More specifically, where d thr [n] is the detection threshold at frame n, EAR max [n] the maximum EAR in the filtered signal until frame n, EAR min [n] the minimum in the filtered signal until frame n and r thr is a parameter that controls how much of the difference will be summed to EAR min [n]. In Fig. 5, the threshold can be observed in blue, as a sum to the red line of 20% of the difference between the green and red lines.
The detection threshold requires calibration with the first blink and thus its initialization is made assuming that the difference between EAR max [n] and EAR min [n] is higher than a certain value that we called starting window (s w ). The r thr and the s w parameters were chosen experimentally as will be explained further.

B. HEART RATE AND HEART RATE VARIABILITY
The extraction of real-time HR and HRV from video is accomplished with seven steps: 1) Region of Interest (ROI) selection; 2) RGB signal extraction; 3) Resampling; 4) rPPG signal extraction, 5) Filtering, 6) Peak detection and 7) HR and HRV estimates.
The ROI selection is based on facial landmarks and allows to segment skin regions with an accessible pulse signal. Here, the regions comprising the cheeks, the nose and the forehead are considered, while the regions for beard and eyes are removed (Fig. 6).
The starting point is the computation of the average intensities of each color component within the ROI for each frame n, a n = (r n , g n , b n ) Because the Zoom session does not guarantee an uniform frame rate, a resampling step is introduced. To obtain a uniformly sampled color average linear interpolation is used.
To extract the one-dimensional rPPG signal from three-dimensional RGB signal the POS method is used [39]. This is a chrominance-based method, which means it uses skin tone knowledge a priori, thus requiring less knowledge of the rPPG signature while staying tolerant to distortion. The POS method is a sliding window algorithm. Here, window size, POS l = f s · POS s , was chosen to correspond to POS s = 1.6 seconds, which encapsulates one cardiac cycle as recommended by the authors. This choice introduces only a 1.6 seconds delay from acquiring the frame to extracting the corresponding facial blood volume, which is short enough for real-time applications.
The next step is filtering the raw rPPG signal. Filtering is performed through a second-order Infinite Impulse Response (IIR) bandpass butterworth filter, [0.8, 2.5] Hz. These cut-off VOLUME 10, 2022 frequencies correspond to a human heart rate range of 48 to 150 beats per minute (bpm). See Fig. 7 for the rPPG signal and respective filtering.
To detect the peaks of the rPPG signal, the mountaineer's method is used [40], which does not depend on the amplitude of the signal. From the signal processing point of view, this is point-by-point, windowless algorithm, which is very convenient for a real-time application.
From the detected peaks, we extract Pulse-to-Pulse (PP) intervals as the difference between pairs of consecutive peaks: where PP[n] is the n th time interval and PT [n] and PT [n − 1] are the n th and n − 1 th time peaks respectively. Here, photoplethysmography-based PP intervals are taken as the equivalent to electrocardiography-based NN intervals. Finally, statistical and spectral features can be computed from a collection of PP intervals which fall within a given time window.
Assuming that PP intervals are in seconds, heart rate, in bpm, is computed as follows [41]: where PP is the mean of PP intervals, here, N is the number of PP intervals. The choice of time window should reflect the user's requirement (e.g. instantaneous HR/HRV, long-term HR/ HRV).
Two time-domain HRV metrics are considered: SDNN and RMSSD. For a given time window with N PP intervals, the formulas are as follows [29]: Three frequency-domain metrics are considered: normalized LF power, normalized HF power and LF/HF ratio. The Lomb-Scargle method [42] is used for deriving frequencydomain estimations. This method is less computationally expensive than others [43], which suits the real-time scenario.

V. BIOMARKER VALIDATION
The proposed methods were validated in the UBFC-RPPG video dataset [37]. The dataset includes 40 videos with a frame rate of 30 frames-per-second and a resolution of 640 × 480. In the videos the subject sits in front of the camera (about 1 meter away from the camera) with his/her face visible for more than one minute, which is a similar setting to a teleconsultation. The dataset includes ground truth signals acquired using finger PPG, running parallel to the video capture. Even though the dataset was created for the rPPG field, it is also a good fit for blinking research. In this sense, we manually labeled the dataset, providing for each video a text file containing the frames matching blinking events. We make this ground truth available at Zenodo [44].

A. EYE-BLINKING
In order to evaluate the performance of the blink detection model, we compared the blinking ground truth of the dataset with our model predictions. We obtained the ground truth by manually retrieving the number of blinks and the frame associated to each blink. We considered the frames where the eyes were more closed, which are coincident with the negative peak of the signal. Then, our blinking detection method created a list with the predictions of the blinking events for each video. In order to check whether the ground truth and prediction events are coincident, a five frame error temporal window was considered, because it is considered the time it takes for a blink.
When comparing both, three situations were considered: true positive, false negative and false positive. A true positive occurs when the model predicts a blink and it occurred. A false positive occurs when the model predicts a blink and it did not exist. A false negative occurs when the model did not predict a blink and it existed. The true negatives were ignored as they occur when the model did not predict a blink and it did not exist, which would give many events. In Fig. 8 we can see two true positive events in Subject 1, with the predictions represented as red peaks and the ground truth represented as green peaks. The metrics used to evaluate the performance are precision, recall and F1-score, as they are not affected by the imbalance in events.  The performance of the proposed detector was assessed with five different r thr and seven different s w . We concluded that the best detector uses s w = 0.07 and r thr = 0.2, as it was the one presenting higher F1-score. We only took in account the predicted events after the first true blink, ignoring the false positives that occur before the detection threshold calibration.
The quality of the facial landmark prediction in the dataset decreased with subjects that use glasses, have hair occluding the eyes and half-close the eyes for long periods of time. Considering this, the model's performance was assessed in the whole dataset (Dataset A1 -39 subjects) and in two subsets of this dataset. One that excludes subjects with glasses (Dataset A2 -29 subjects) and a second one that excludes subjects that satisfy all the previous conditions (Dataset A3 -25 subjects). The average (AVG) precision was 75% in Dataset A1, 84% in Dataset A2 and 94% in Dataset A3. These results can be seen in Table 1.
After observing the precision results of each subject individually, it was observed that the standard deviation (SD) was very large (Table 1). In order to better understand the standard deviation, the precision results of each individual for the three datasets were sorted in descending order and plotted in Fig. 9. In this plot, it can be seen that the percentage of subjects where the model had 100% of precision is approximately 60% in dataset A1, 70% in dataset A2 and 80% in dataset A3. This shows that the proposed detector performs very well in subjects under normal conditions. The obtained results were not compared to other works because we're the first to validate a blinking method in the UBFC-RPPG dataset.

B. HEART RATE AND HEART RATE VARIABILITY
The performance of the proposed rPPG method is reported in terms of mean absolute error between estimated and ground truth HR and HRV features. For each video of the dataset, we computed the average HRV metrics for the full record and the HR for 15-second windows, 30-second windows and the full record. This was done for both the face rPPG signal (estimations) and the finger PPG signal (ground truth).    [41], SSF [45], CHROM [45], [46], [47] and PulseGAN [47].
Then, we computed the mean absolute error between the HR estimations and the ground truth HR values (Table 2).
Comparing against the state-of-the-art in HR estimation (Table 3), the proposed method ( HR = 2.16 bpm) performs extremely well, competing with the best previous non-supervised or supervised methods. Regarding HRV estimations, the method performs well when compared to stateof-the-art reports (see Table 3). For the SDNN and RMSSD, the method performs in line with the state-of-the-art, in particular [41] which is the best performing method in the literature. For %LF (NU) and LF/HF estimation, the proposed method outperforms all methods reported.
According to [30], average short-term human HRV features have the following ranges: SDNN ranges from 32-93 ms; RMSSD ranges from 19-75 ms; %LF (NU) ranges from 30-65 and LF/HF ranges from 1.1-11.6. For SDNN, RMSSD and %LF the proposed method performs with an average error of 20.5 ms, 39.2 ms and 16.9 NU, respectively. This suggests that the method should be able to accomplish only a qualitative assessment of these features. For LF / HF ratio, the mean error is the smallest compared to its respective VOLUME 10, 2022 human range, making it the most reliable HRV biomarker coming out of the proposed method.

VI. CONCLUSION
The psychiatric approach on depressive disorders is based on an interaction between patient and clinician that uses subjective measures to diagnose and follow-up. Biomarkers can aid this approach by providing objective measurements. This paper presents a telemedicine platform that provides depression biomarkers in real-time during a teleconsultation. This platform allows the clinician to start a teleconsultation with a patient using the teleconference platform Zoom and visualize in almost real time results of blinking and heart rate variability analysis. The results are shown in charts and the clinician is able to select the results he wishes to visualize. The results of the analysis are stored in a database and can be analysed after the teleconsultation ends.
The video processing methods were validated on the UBFC-RPPG dataset. The videos in the UBFC-RPPG dataset resemble a teleconsultation setting, reason for which we chose this dataset to develop our method. HRV estimation from webcam video looks promising, and the proposed real time method performs very well compared to the state of the art. Hence, rPPG can become an important tool for teleconsultation and depressive disorders' research.
Future work will aim to test our psychiatry telemedicine platform and validate the incorporated biomarker measurements in patients diagnosed with depression. Emphasis will also be given to studying the effect of head movement on the proposed distance-based (eye aspect ratio) and luminosity-based metrics (HR and HRV).