Introduction
Major depressive disorder (MDD) is linked to diminished social functioning, along with a variety of chronic physical conditions [1], [2]. MDD is estimated to have a lifetime prevalence ranging from 2 to 21% worldwide [3]. Despite widespread agreement that effective depression therapy is essential for a patient’s health as well as for lowering the global burden of disease, the burden has remained constant over the past few decades [4]. This is partially due to treatment options being chosen through trial and error, without methods to forecast the treatment response [5]. MDD recovery is a dynamic process and non-response is determined differently for various treatment measures [6]. A typical course of treatment for MDD may last up to 12 weeks or longer [7]. The delay in receiving effective treatment often translates into unnecessary personal suffering and burden. If there were methods to identify patients who will not respond in the first few weeks, this burden might be alleviated.
Although strong inter-individual variability can be seen during the recovery of symptoms after treatment, it is hypothesized that there are predictive patterns that are crucial for successful treatment [8]. A well-established result that applies to outpatients found that early treatment response (e.g., in 2 weeks) is highly predictive of the longer course [9]. A review by Kudlow et al. [10] found people are more likely to respond to therapy later on if they show early partial symptomatic improvement.
There is increasing evidence that digital phenotyping data collected throughout daily life can be used to monitor and manage MDD [11]. Sensors built into smartphones and wearable sensors can track core parts of daily living like social communication, movement patterns, sleep, and physical exercise. These data can be used to extract characteristics corresponding to typical daily behaviors to describe behavioral depression symptoms [12], [13], [14]. However, up until now, the majority of research on identifying and inferring depression has focused on routinely measuring depressive status (e.g., bi-weekly) [15]. One unavoidable challenge in digital phenotyping is that while processing and aggregating the raw data with missing values, it may be imperative to preserve important behavioral information [16]. Most of the previous clinical studies with digital phenotyping are correlation analysis-based studies [17], while learning-based methods are often performed with a more general population. Besides, their research questions were not a prediction of treatment responses.
Therefore, we hypothesize that whether a treatment is effective or not can be inferred based on the early response trend of curative effect, and this trend can be captured through digital phenotyping, by which the treatment response at the end can be predicted weeks in advance. In this study, we utilized digital phenotyping with sequence modeling to predict treatment responses in people with MDD. Extending our earlier work which utilized passively collected data to track and monitor mood stability in MDD, here we sought to include further sequence modeling methods with missing values and clinically relevant features for treatment response prediction. To the best of our knowledge, this is the first attempt to predict treatment response 10 weeks in advance by digital phenotyping and sequence modeling with missing values. We highlight the main contributions of our work as follows:
The feasibility of treatment response prediction in MDD based on passive sensing is investigated. The passive sensing data are collected over a year with a considerably large size in clinical digital phenotyping research. The results show passive sensing data aid the prediction task; Sequence modeling method based on a variant of GRU is proposed to handle missing values in passive sensing with a decay mechanism;
Comprehensive evaluation shows best performance was achieved with the proposed method in comparison to other recurrent neural networks and machine learning classifiers, which indicates the missing pattern is informative;
Feature ablation study shows the effectiveness of each feature set is comparable, but combining all feature indeed improve the performance.
Related Work
A. Treatment Response Prediction in MDD
As a heterogeneous illness, MDD responds differently to a wide range of treatments, thus resulting in varying degrees of success [6]. Studies using neuroimaging techniques (EEG, MRI) seem to be pointing to characteristics that lead to varied responses to pharmacotherapy and psychotherapy [18], [19], [20]. However, no biomarker has yet been identified to be accurate enough to be applied practically nowadays [19], [21]. Ideally, a prediction method is preferable to be simple to perform, inexpensive, and ubiquitous which neuroimaging methods do not possess. Furthermore, Hornstein et al. evaluate the feasibility of using clinical and sociodemographic variables for treatment response prediction in depression and anxiety patients with machine learning [22]. Pedrelli et al. investigate the performance of using behavioral and physiological features obtained from wearable sensors for depressive symptom severity assessment [23]. The former one only used sociodemographic and clinical variables, while the latter one contain relatively small sample numbers (N = 31).
B. Machine Learning for Mental Health Through Digital Phenotyping
Depressive symptoms are shown to be associated with social activities, such as phone usage, internet usage, movement, sleep, etc. Saeb et al. examined the connections between depression severity scores and passive sensing data, including location traces and phone usage, and discovered a strong relationship between depression severity and location characteristics like variation in the sites visited, and daily mobility patterns [24]. According to a study by [25], there is a direct correlation between internet usage and depression, with depressive symptoms being associated with much higher internet usage. Wang et al. found that there are appreciable relationships between depression and sleep length, conversation duration, and frequency of collocations [26]. Additional examination of the data revealed substantial correlations between variations in depression ratings and characteristics like sleep duration, speech duration, etc. Machine learning analysis of passively collected data has shown potential in predicting treatment responses for individual patients, which might make treatment decisions more individually tailored and boost treatment effectiveness [19]. However, missing data arise frequently in digital phenotyping that involves active and passive sensing, as demonstrated by studies such as [27] and [28]. This missingness is particularly significant for the patient populations [26] and may contain informative patterns. For example, empty values in call logs may result from certain participants using free calling apps like WeChat instead of traditional phone calls, and filling in missing values with averages may obscure important variations in social activities [29]. How to handle missing data remains a significant challenge, nevertheless, most previous statistical or network-based methods for digital phenotyping are either incapable of handling missing values or rely solely on simple average imputation [30].
C. Sequence Modeling With Missing Data
Digital phenotyping is indeed sequential data. Logistic regression (LR), Naïve Bayes, support vector machines (SVM), and random forest (RF) are widely used machine learning algorithms in digital phenotyping. In several applications with sequential data, it has been demonstrated that recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) [31] and Gated Recurrent Unit (GRU) [32], achieve state-of-the-art performance. When it comes to handling missing data, the simplest approach is to omit them and do analysis solely on the observed data. However, this would not perform well when the missing rate is high and insufficient samples are left. Data imputation is another approach that involves replacing the missing information with alternative values. Simple and effective techniques like smoothing, interpolation, and splines, however, do not account for variable correlations, and they might not account for intricate patterns needed for imputation. Recent advances in sequence prediction with missing values have been made using end-to-end deep learning algorithms [33]. Song et al., for instance, apply the attention-based strategy to healthcare applications without filling in any absent information [34]. Zhang et al. reduce the influence of uncorrelated dimensions with missing values to provide trustworthy features for prediction [35]. There has been a surge of Transformer-based solutions for time series forecasting [36], [37], however, a recent study found that the fulfillment of this task that is demonstrated in previous work has little to do with the temporal relations extraction ability of the Transformer architecture [38]. There are inherent constraints in the application scenarios of digital phenotyping, including small sample sets and missing data. Therefore, how to achieve robust multivariate time series modeling and forecasting in the case of small samples and high missing rates is worth studying. In this paper, we explore the modeling and prediction ability of classical machine learning methods commonly used in digital phenotyping and deep learning models. We also utilize a variant of GRU, GRU-D [39], which proposes a decay mechanism to capture the temporal correlations while modeling missing patterns.
Clinical Trial and Data Collection
In this section, we describe the clinical trial, data collection, and task definition in our study.
A. Participant Recruitment
The clinical trial (Trial Registration: Chinese Clinical Trial Registry ChiCTR1900021461; https://www.chictr.org. cn/showprojen.aspx?proj=36173) was designed for the management of MDD with mobile health technologies. This was a prospective multisite cohort study. The study was approved by the Independent Medical of Ethics Committee Board of Beijing Anding Hospital (ethical approval no. 201917FS-2). In this paper, we will focus on the prediction of treatment response. Participants were recruited from February 2019 to April 2020 from the outpatient clinics at 4 sites including one psychiatric hospital and three psychiatric units in general hospitals. The participants were recruited following the eligibility criteria:
Outpatient, aged between 18-65yrs, either sex; meets the diagnostic criteria for MDD in the Diagnostic and Statistical Manual of mental disorders, version IV, without any other psychotic symptoms assessed by MINI 5.0.0;
Primary school education or above, able to understand the content of the questionnaire;
Written informed consents were obtained;
No history of previously diagnosed bipolar disorder, schizophrenia, schizophrenic affective disorder, and mental disorders associated with other diseases;
No history of alcohol and substance dependence;
Not suffering from any serious physical diseases that are not suitable to be included in this study.
B. Data Collection
In outpatient clinics, clinicians informed patients who met the inclusion criteria. The participants were informed of the design of the app and the type of data being recorded by research assistants. Each participant was then given a wristband and instructed to the installation of the self-developed app on their smartphones. All participants were requested to come for 4 follow-up visits and complete the clinical assessment at 0, 2, 4, 8, and 12 weeks which is analogous to a standard treatment course. There was no strict restriction to their treatment. Patients were treated psychopharmacologically according to the attending doctor’s choice.
A previously developed app was used to passively record patients’ daily phone usage, sleep, and step data with minimal human action. A digital wristband was worn accompanied by the app to collect sleep and step data. The design of the app was detailed in our previous work [40]. With users’ consents, the app runs in the background to collect phone usage data, including call logs, text message logs, app usage logs, GPS, and screen status. The app requires users to wear a digital wristband was worn accompanied with the app to collect sleep, heart rate, and step count data. Strict confidentiality measures are implemented in the app to ensure data security and privacy protection. The app was evaluated for compatibility on more than 20 different phones such as HUAWEI, Xiaomi, and OPPO, and had also been tested on different Android operating systems.
C. Task Definition
In this paper, we chose treatment response prediction as our primary target which is of great clinical importance. If treatment response can be predicted in advance, it would be beneficial for early and acute adjustment of the treatment plan. The treatment responses were classified into two categories (labels): (1) treatment responded (HAMD score decreased by 50% from the baseline [41]), (2) stable or not responded. Specifically, our goals are to predict the treatment response of one patient having depression based on sequence modeling. Since we wish to have an early estimate of each patient’s outcome, wearable data collected between the baseline and first follow-up visit at two weeks, as well as the twice clinical assessment scores, were used as the input.
D. Participant Flow
The flow of participants in this study is shown in Figure 1. At the baseline, 358 MDD patients participated in the study. We excluded participants that absent for follow-up visits more than 2 times or absent for the last visit at the
At the baseline assessment, we gathered a range of demographic information and clinical characteristics (detailed in Table I). All of the clinical assessments were administered by qualified clinicians. After dividing subjects into two groups, the Multivariate Analysis of variance (MANOVA) test was performed. The significance p-value was expressed in actual values rather than expressing a statement of inequality (
Distribution of HAMD scores between two groups: (a) assessed at baseline, (b) assessed at the 12th week.
Data Processing and Feature Extraction
Figure 3 illustrates the overall pipeline. Features were extracted from the raw data with high-level behavior meanings. There were 4 categories of features being defined and extracted, and all features were extracted on a daily basis (detailed in Table II).
Pipeline for data processing, feature extraction, and treatment response prediction.
Call log is widely believed to be the key feature that reflects one’s status in social life. For each type of phone call (incoming, outgoing, missed, or rejected), the mean and SD time of all calls being made (in numerical form), mean and SD of duration, the number and entropy of phone calls were extracted as features.
Phone usage was extracted with the screen on and off status as a proxy [43]. The frequency and the duration of smartphone usage in a day were computed by the on and off status of the screen. In addition, the duration of phone usage for each period was calculated. The ratio of the phone usage duration in each period (6-12 pm, 12–6 pm, 6–0 am) to all day was calculated as well as the earliest and latest phone usage time.
For app usage, apps were grouped into 4 sets, namely social apps, content-providing apps, shopping apps, and entertainment apps. For each set, the total duration of usage, the total number of usage, the duration in each period, and the ratio of usage of one set of apps to all were calculated.
The sleep and step data were collected with wearable sensors embedded in the wristband. The duration and ratio of both light and deep sleep were calculated for sleep data. The user’s wake-up and sleep times were also taken into account when estimating their daily routine. For step data, the total step count and maximum step counts in every successive 5 minutes period were extracted.
One thing to note here is that text messages were not included since it is no longer widely used in China nowadays. Text messaging is sort of replaced by instant messaging apps, such as WeChat. Together, the total number of features is 71. These features are assumed to be potent signs of social disengagement and decreased interest or enjoyment in nearly all activities, particularly social and professional ones [44].
A. Handling of Missing Data
Missing data can be caused by various reasons. For example, miss data from both smartphone and wearable sensors caused by technical reasons, such as the phone/app not working, the data not being transferred promptly, or the server is down temporally. Missing data can also arise from user-related issues which may be associated with depressed mood. We acknowledge that these two types should be dealt differently, however, it is hard to distinguish them technically. When a data loss occurred, all related feature items were lost. Rather than simply disregarding them, because we are unsure of whether it was recorded or whether they never exist, we attempted to encode these missing patterns in our model. The “4 days missing in two weeks” (28.57%) threshold was determined empirically in this study, and it seems reasonable while remaining enough samples for analysis. In the end, we were left with roughly 101–175 participants for each task setting. The exact numbers were different due to the availability of the data. Figure 1 gives a detailed depiction. Otherwise, we do not remove data or participants since we were not sure whether it is due to non-semantic reasons. While data missing is almost unavoidable in digital phenotyping for the above reasons, modeling without manual data removal would somehow assure the robustness of a model.
B. Sequence Modeling for Digital Phenotyping
Besides machine learning methods commonly used in digital phenotyping (e.g. SVM, LR, and RF) and influential deep learning models (e.g. LSTM, GRU), a variant of GRU, namely GRU-D [39], was also adopted for sequence modeling. GRU-D introduces decay rates in conventional GRU to control the decay mechanism. To utilize informative missing patterns, the decay rates vary from variable to variable depending on the underlying characteristics associated with the features and are learnable from the training data. Two representations of informative missingness patterns are considered as follows \begin{align*} {\mathrm {Mask}}_{t}^{f}=\begin{cases} \displaystyle \mathrm {0,} & if x_{t}^{f} is missed\\ \displaystyle 1, & otherwise\end{cases} \tag{1}\end{align*}
\begin{align*} \delta _{t}^{f}=\begin{cases} \displaystyle 1+\delta _{t-1}^{f} & t>1,\mathrm { }{\mathrm {Mask}}_{t-1}^{f}=0\\ \displaystyle 1 & t>1,{\mathrm {Mask}}_{t-1}^{f}=1\\ \displaystyle 0 & t=0\end{cases} \tag{2}\end{align*}
And the modified functions of GRU are:\begin{align*} \begin{cases} \displaystyle r_{t}=\sigma \left ({W_{r}\hat {x}_{t}+U_{r}\hat {h}_{t-1}+V_{r}m_{t}+b_{r} }\right) \\ \displaystyle z_{t}=\sigma \left ({W_{z}\hat {x}_{t}+U_{z}\hat {h}_{t-1}+V_{z}m_{t}+b_{z} }\right) \\ \displaystyle \tilde {h}_{t}=\tanh \left ({W\hat {x}_{t}+U\left ({r_{t}\hat {h}_{t-1} }\right)+Vm_{t}+b }\right)\\ \displaystyle h_{t}=\left ({1-z_{t} }\right)\odot {\mathrm { }\hat {h}}_{t-1}+z_{t}\odot \tilde {h}_{t}\end{cases} \tag{3}\end{align*}
A decay \begin{equation*} \gamma _{t}=\exp \left \{{-\max \left ({0,W_{r}\mathrm {\delta }_{t}+b_{\gamma } }\right) }\right \} \tag{4}\end{equation*}
\begin{equation*} \hat {x}^{f}_{t}={\mathrm {Mask}}_{t}^{f}x_{t}^{f}+\left ({1-{\mathrm {Mask}}_{t}^{f} }\right)\left ({\gamma ^{f}_{x_{t}}x_{t^{\prime }}^{f}+\left ({1-\gamma ^{f}_{x_{t}} }\right)\tilde {x}^{f} }\right) \tag{5}\end{equation*}
There is also a hidden state decay \begin{equation*} \hat {h}_{t-1}=\gamma _{h_{t}}\odot h_{t-1} \tag{6}\end{equation*}
Thereafter,
Evaluation
A. Evaluation Settings
We included the HAMD scores of the baseline and
B. Correlation Analysis
Correlation analysis was also performed (with IBM SPSS Statistics 24) between simple statistics (mean and SD in 14 days) of extracted features and reduction in HAMD scores to see the feasibility of simple statistics of passive sensing features for the prediction of treatment response. The bivariate Pearson correlation analysis shows that, among all feature statistics, only four (4 out of 71) were found significantly correlated with the percentage of reduction in HAMD. A positive correlation was found between mean of the latest phone usage time and reduction in HAMD (
Correlation analysis between simple statistics of extracted features and reduction in HAMD scores: (a) Mean of the latest phone usage time; (b) Mean of phone usage in 0:00– 6:00 am; (c) Mean of entertainment apps usage in the afternoon; (d) SD of phone usage in 0:00– 6:00 am.
C. Evaluation Results
The evaluation results of various machine learning and deep learning classifier are shown in Table III. In terms of recall, F1 score, and AUROC, the sequence model based on GRU-D achieve the best performance with a relative improvement of about 10% in all three metrics over the best of those baselines. We further calculated the Pearson correlation coefficients between the predicted labels of GRU-D and the actual class of treatment response to serve as an analysis of criterion validity. The results showed that the correlation coefficient (
D. Feature Ablation Study
To select the meaningful feature indicative of depressive recovery from longitudinal data, we conducted the feature ablation study with each feature set. Since call logs, phone usage, and app usage are all collected with smartphones, we also evaluated with all smartphone sensing features. Only the best method was utilized. As shown in Table IV, the models that take phone usage, app usage, and sleep & step features achieve slightly better results than call logs, while the combined features set perform the best. The model with smartphone sensing features performed slightly worse than phone usage and app usage alone which indicates the addition of call logs in smartphone sensing features may perturb the modeling capacity rather than assist.
E. Effects of Duration of Passive Sensing
To investigate whether longer passive sensing could lead to better prediction performance, we experimented with 3 or 4 weeks of passive sensing data. The results in Table V show that longer passive sensing does improve performance in all metrics (see Figure 5). However, this is a trade-off between earlier prediction and better performance. It is clinically instructive to predict 2 weeks in advance, but if it takes 4 weeks, the results would be less interesting.
Performance evaluation of different passive sensing durations: (a) Precision; (b) Recall; (c) F1 score; (d) AUROC. (Error bands: 95% confidence interval).
Discussion and Broader Impacts
By correlation analysis, only statistics of a few features were found to be correlated with the recovery of MDD, this is not surprising since the simple statistic operation loses the temporal information which is believed crucial for treatment response prediction. Therefore, we further conducted sequence modeling and considered the modeling of missing patterns in the meantime.
As shown in the evaluation, the improved performance of the method based on GRU-D, in comparison with GRU, manifested that the missingness in digital phenotyping is informative and there seems to exist an inherent correlation between the missing patterns and the prediction task. In addition, to investigate the usefulness of each feature set, we show results with different feature categories respectively. Considering the results of the feature ablation study in Table IV, feature selection would be worth investigating to further improve the performance. Compared with digital phenotyping based studies on depressive symptom prediction [15], [48], [49], [50], Wahle et al. predict intervention outcome in PHQ-9 values with 2 weeks of smartphone mobile sensing data achieved an accuracy of 59.1-60.1%. Lu et al. achieved an F1-score as high as 0.77, however, their study was performed with college students, not clinically validated. As to neuroimaging methods using MRI or EEG to predict individual-level response to treatment in patients with MDD which achieve a recall (sensitivity) of around 0.77 [19], our results are not impressive considering the binary classification task with a chance level of 0.5. However, neuroimaging-based methods usually include small samples leads to easy-to-fit and even overfitting. Particularly a concern in MRI studies which contain small sample sizes and plenty of fitting data [19].
If there exists a method for treatment response prediction, it should ideally be simple to perform and inexpensive. Therefore, we hope to achieve this by combining digital phenotyping and machine learning, which could enable us to obtain predictive models and may achieve even better performance when combined with other clinical modalities.
We discussed both recurrent neural networks and classic machine learning methods in the related work, but did not cover Convolutional Neural Networks (CNNs) or Spiking Neural Networks (SNNs). Indeed, these models have the potential to excel in sequence modeling tasks [51], [52], [53]. The primary distinction between a CNN and an RNN is their ability to process temporal information. RNNs are designed for this specific purpose and have been shown to perform well in a broad range of sequence-related tasks [54]. Furthermore, SNNs are more biologically plausible and have the advantage of low power consumption. Although SNNs and RNNs both involve temporal and spatial dimensions, they differ in their connection patterns and modeling details within each neuron unit [55]. When compared on some vision datasets [56], SNNs perform similarly to LSTM and GRU. However, the performance of both models on sequence modeling needs further exploration. In this paper, we used a modified GRU network to model both temporal information and informative missing patterns, which conventional CNNs and SNNs are incapable of doing. We believe this specifically modified GRU network is better suited for this task.
Although the current results are not satisfactory, this work has room for improvement in several aspects, such as in-depth feature mining, direct modeling of sequential data with recurrent models, and separate modeling of time series or non-time series data, which are promising to improve the performance. This study is an exploration step of this kind of work. We present preliminary results and hope to promote further in-depth research. We believe effective digital phenotyping would become a support tool for clinical decision-making that can increase treatment efficiency while reducing ineffective treatments.
A. Practical Implications for Interventions
Developing methods that can forecast treatment success at the early stage to improve patient care and decrease the length of illness is an urgent task. The clinical importance of digital phenotyping-based social activity sensing to predict treatment response has been emphasized by the COVID-19 pandemic and the widespread use of telehealth. Learning-based predictive modeling may allow clinicians to assess the efficacy of individual treatment responses. As a result, it might help in relieving the global burden of MDD by improving treatment effectiveness.
This study adds value to the research field that showing how unobtrusive mobile sensor data can help with the diagnosis, monitoring, and understanding of depression. Although we examine the proposed method in the context of depression, it may be generalizable to any persistent and long-term mental health issue.
B. Limitations and Future Work
We acknowledge that the current study has several limitations. First, technical issues with the compatibility of the smartphone and the study app significantly affect data quality. However, while we try our best to make them compatible with various models of smartphones and various versions of the Android operating system, it still occurs data missing occasionally. Second, the prevalence of messaging apps from third parties would have a large impact on sociability features, such as call logs, since phone calls with Wechat cannot be uncaptured due to their internal security settings. Missingness in the passive data may decrease the likelihood of capturing behavior that matches social activities. Although digital health technologies have demonstrated benefits in scientific research, these technologies, as well as ours, have rarely been adopted in clinical settings [57]. One possible way of adoption could be to prescribe the app to patients. However, the clinicians’ attitudes towards this approach are quite divided, with some expressing optimism while others question its value [58]. Despite these limitations, our study represents one of the largest digital phenotyping studies for treatment response prediction in MDD in terms of sample size and duration.
Conclusion
The primary goal of the present study is to investigate the feasibility of treatment response prediction based on digital phenotyping and machine learning. Predicting the outcome 10 weeks in advance based on 2 weeks’ passively collected data (compared with the 12 weeks standard treatment course) is a highly challenging task and of significant research value. This paper presents a sequence modeling method based on a variant of GRU to model the time series features and handle missing values simultaneously. Based on extensive experiments conducted with the data collected from 245 MDD participants, we demonstrate that the proposed method outperforms all baseline classifiers. The results suggest passive sensing aids prediction performance and the missing pattern is informative for the prediction task. Although the result of the current study is lower than that of neuroimaging methods and is far from practical, we believe the potential in predicting the treatment response of depressed individuals at an early stage, thereby patients can receive proper and timely clinical interventions.