Loading web-font TeX/Main/Regular
Sequence Modeling of Passive Sensing Data for Treatment Response Prediction in Major Depressive Disorder | IEEE Journals & Magazine | IEEE Xplore

Sequence Modeling of Passive Sensing Data for Treatment Response Prediction in Major Depressive Disorder


Abstract:

Major depressive disorder (MDD) is a prevalent mental health condition and has become a pressing societal challenge. Early prediction of treatment response may aid in the...Show More

Abstract:

Major depressive disorder (MDD) is a prevalent mental health condition and has become a pressing societal challenge. Early prediction of treatment response may aid in the rehabilitation engineering of depression, which is of great practical significance for the relief of suffering and burden of MDD. In this paper, we present a sequence modeling approach that uses data collected by passive sensing techniques to predict patients with an outcome of treatment responded defined by the reduction in clinical administrated scales. Hundreds of patients with MDD have been recruited from outpatient clinics at 4 psychiatric sites. Each has been delivered with a self-developed app to passively record their daily phone usage and physical data with minimal human action. An unavoidable dilemma in passive sensing is missing values. To overcome that, the proposed approach combined feature extraction and sequence modeling methods to fully utilize the pattern of missing values from longitudinal data. With no treatment constraints, it enables us to predict the treatment response of MDD 8–10 weeks before the completion of the treatment course, leaving time for preventative measures. Our work explored the feasibility of treatment response prediction using longitudinal passive sensing data and sparse ground truth, and also has the potential for preventing depression by forecasting treatment outcomes weeks in advance.
Page(s): 1786 - 1795
Date of Publication: 22 March 2023

ISSN Information:

PubMed ID: 37030733

Funding Agency:


SECTION I.

Introduction

Major depressive disorder (MDD) is linked to diminished social functioning, along with a variety of chronic physical conditions [1], [2]. MDD is estimated to have a lifetime prevalence ranging from 2 to 21% worldwide [3]. Despite widespread agreement that effective depression therapy is essential for a patient’s health as well as for lowering the global burden of disease, the burden has remained constant over the past few decades [4]. This is partially due to treatment options being chosen through trial and error, without methods to forecast the treatment response [5]. MDD recovery is a dynamic process and non-response is determined differently for various treatment measures [6]. A typical course of treatment for MDD may last up to 12 weeks or longer [7]. The delay in receiving effective treatment often translates into unnecessary personal suffering and burden. If there were methods to identify patients who will not respond in the first few weeks, this burden might be alleviated.

Although strong inter-individual variability can be seen during the recovery of symptoms after treatment, it is hypothesized that there are predictive patterns that are crucial for successful treatment [8]. A well-established result that applies to outpatients found that early treatment response (e.g., in 2 weeks) is highly predictive of the longer course [9]. A review by Kudlow et al. [10] found people are more likely to respond to therapy later on if they show early partial symptomatic improvement.

There is increasing evidence that digital phenotyping data collected throughout daily life can be used to monitor and manage MDD [11]. Sensors built into smartphones and wearable sensors can track core parts of daily living like social communication, movement patterns, sleep, and physical exercise. These data can be used to extract characteristics corresponding to typical daily behaviors to describe behavioral depression symptoms [12], [13], [14]. However, up until now, the majority of research on identifying and inferring depression has focused on routinely measuring depressive status (e.g., bi-weekly) [15]. One unavoidable challenge in digital phenotyping is that while processing and aggregating the raw data with missing values, it may be imperative to preserve important behavioral information [16]. Most of the previous clinical studies with digital phenotyping are correlation analysis-based studies [17], while learning-based methods are often performed with a more general population. Besides, their research questions were not a prediction of treatment responses.

Therefore, we hypothesize that whether a treatment is effective or not can be inferred based on the early response trend of curative effect, and this trend can be captured through digital phenotyping, by which the treatment response at the end can be predicted weeks in advance. In this study, we utilized digital phenotyping with sequence modeling to predict treatment responses in people with MDD. Extending our earlier work which utilized passively collected data to track and monitor mood stability in MDD, here we sought to include further sequence modeling methods with missing values and clinically relevant features for treatment response prediction. To the best of our knowledge, this is the first attempt to predict treatment response 10 weeks in advance by digital phenotyping and sequence modeling with missing values. We highlight the main contributions of our work as follows:

  • The feasibility of treatment response prediction in MDD based on passive sensing is investigated. The passive sensing data are collected over a year with a considerably large size in clinical digital phenotyping research. The results show passive sensing data aid the prediction task; Sequence modeling method based on a variant of GRU is proposed to handle missing values in passive sensing with a decay mechanism;

  • Comprehensive evaluation shows best performance was achieved with the proposed method in comparison to other recurrent neural networks and machine learning classifiers, which indicates the missing pattern is informative;

  • Feature ablation study shows the effectiveness of each feature set is comparable, but combining all feature indeed improve the performance.

SECTION II.

Related Work

A. Treatment Response Prediction in MDD

As a heterogeneous illness, MDD responds differently to a wide range of treatments, thus resulting in varying degrees of success [6]. Studies using neuroimaging techniques (EEG, MRI) seem to be pointing to characteristics that lead to varied responses to pharmacotherapy and psychotherapy [18], [19], [20]. However, no biomarker has yet been identified to be accurate enough to be applied practically nowadays [19], [21]. Ideally, a prediction method is preferable to be simple to perform, inexpensive, and ubiquitous which neuroimaging methods do not possess. Furthermore, Hornstein et al. evaluate the feasibility of using clinical and sociodemographic variables for treatment response prediction in depression and anxiety patients with machine learning [22]. Pedrelli et al. investigate the performance of using behavioral and physiological features obtained from wearable sensors for depressive symptom severity assessment [23]. The former one only used sociodemographic and clinical variables, while the latter one contain relatively small sample numbers (N = 31).

B. Machine Learning for Mental Health Through Digital Phenotyping

Depressive symptoms are shown to be associated with social activities, such as phone usage, internet usage, movement, sleep, etc. Saeb et al. examined the connections between depression severity scores and passive sensing data, including location traces and phone usage, and discovered a strong relationship between depression severity and location characteristics like variation in the sites visited, and daily mobility patterns [24]. According to a study by [25], there is a direct correlation between internet usage and depression, with depressive symptoms being associated with much higher internet usage. Wang et al. found that there are appreciable relationships between depression and sleep length, conversation duration, and frequency of collocations [26]. Additional examination of the data revealed substantial correlations between variations in depression ratings and characteristics like sleep duration, speech duration, etc. Machine learning analysis of passively collected data has shown potential in predicting treatment responses for individual patients, which might make treatment decisions more individually tailored and boost treatment effectiveness [19]. However, missing data arise frequently in digital phenotyping that involves active and passive sensing, as demonstrated by studies such as [27] and [28]. This missingness is particularly significant for the patient populations [26] and may contain informative patterns. For example, empty values in call logs may result from certain participants using free calling apps like WeChat instead of traditional phone calls, and filling in missing values with averages may obscure important variations in social activities [29]. How to handle missing data remains a significant challenge, nevertheless, most previous statistical or network-based methods for digital phenotyping are either incapable of handling missing values or rely solely on simple average imputation [30].

C. Sequence Modeling With Missing Data

Digital phenotyping is indeed sequential data. Logistic regression (LR), Naïve Bayes, support vector machines (SVM), and random forest (RF) are widely used machine learning algorithms in digital phenotyping. In several applications with sequential data, it has been demonstrated that recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) [31] and Gated Recurrent Unit (GRU) [32], achieve state-of-the-art performance. When it comes to handling missing data, the simplest approach is to omit them and do analysis solely on the observed data. However, this would not perform well when the missing rate is high and insufficient samples are left. Data imputation is another approach that involves replacing the missing information with alternative values. Simple and effective techniques like smoothing, interpolation, and splines, however, do not account for variable correlations, and they might not account for intricate patterns needed for imputation. Recent advances in sequence prediction with missing values have been made using end-to-end deep learning algorithms [33]. Song et al., for instance, apply the attention-based strategy to healthcare applications without filling in any absent information [34]. Zhang et al. reduce the influence of uncorrelated dimensions with missing values to provide trustworthy features for prediction [35]. There has been a surge of Transformer-based solutions for time series forecasting [36], [37], however, a recent study found that the fulfillment of this task that is demonstrated in previous work has little to do with the temporal relations extraction ability of the Transformer architecture [38]. There are inherent constraints in the application scenarios of digital phenotyping, including small sample sets and missing data. Therefore, how to achieve robust multivariate time series modeling and forecasting in the case of small samples and high missing rates is worth studying. In this paper, we explore the modeling and prediction ability of classical machine learning methods commonly used in digital phenotyping and deep learning models. We also utilize a variant of GRU, GRU-D [39], which proposes a decay mechanism to capture the temporal correlations while modeling missing patterns.

SECTION III.

Clinical Trial and Data Collection

In this section, we describe the clinical trial, data collection, and task definition in our study.

A. Participant Recruitment

The clinical trial (Trial Registration: Chinese Clinical Trial Registry ChiCTR1900021461; https://www.chictr.org. cn/showprojen.aspx?proj=36173) was designed for the management of MDD with mobile health technologies. This was a prospective multisite cohort study. The study was approved by the Independent Medical of Ethics Committee Board of Beijing Anding Hospital (ethical approval no. 201917FS-2). In this paper, we will focus on the prediction of treatment response. Participants were recruited from February 2019 to April 2020 from the outpatient clinics at 4 sites including one psychiatric hospital and three psychiatric units in general hospitals. The participants were recruited following the eligibility criteria:

  • Outpatient, aged between 18-65yrs, either sex; meets the diagnostic criteria for MDD in the Diagnostic and Statistical Manual of mental disorders, version IV, without any other psychotic symptoms assessed by MINI 5.0.0;

  • Primary school education or above, able to understand the content of the questionnaire;

  • Written informed consents were obtained;

  • No history of previously diagnosed bipolar disorder, schizophrenia, schizophrenic affective disorder, and mental disorders associated with other diseases;

  • No history of alcohol and substance dependence;

  • Not suffering from any serious physical diseases that are not suitable to be included in this study.

B. Data Collection

In outpatient clinics, clinicians informed patients who met the inclusion criteria. The participants were informed of the design of the app and the type of data being recorded by research assistants. Each participant was then given a wristband and instructed to the installation of the self-developed app on their smartphones. All participants were requested to come for 4 follow-up visits and complete the clinical assessment at 0, 2, 4, 8, and 12 weeks which is analogous to a standard treatment course. There was no strict restriction to their treatment. Patients were treated psychopharmacologically according to the attending doctor’s choice.

A previously developed app was used to passively record patients’ daily phone usage, sleep, and step data with minimal human action. A digital wristband was worn accompanied by the app to collect sleep and step data. The design of the app was detailed in our previous work [40]. With users’ consents, the app runs in the background to collect phone usage data, including call logs, text message logs, app usage logs, GPS, and screen status. The app requires users to wear a digital wristband was worn accompanied with the app to collect sleep, heart rate, and step count data. Strict confidentiality measures are implemented in the app to ensure data security and privacy protection. The app was evaluated for compatibility on more than 20 different phones such as HUAWEI, Xiaomi, and OPPO, and had also been tested on different Android operating systems.

C. Task Definition

In this paper, we chose treatment response prediction as our primary target which is of great clinical importance. If treatment response can be predicted in advance, it would be beneficial for early and acute adjustment of the treatment plan. The treatment responses were classified into two categories (labels): (1) treatment responded (HAMD score decreased by 50% from the baseline [41]), (2) stable or not responded. Specifically, our goals are to predict the treatment response of one patient having depression based on sequence modeling. Since we wish to have an early estimate of each patient’s outcome, wearable data collected between the baseline and first follow-up visit at two weeks, as well as the twice clinical assessment scores, were used as the input.

D. Participant Flow

The flow of participants in this study is shown in Figure 1. At the baseline, 358 MDD patients participated in the study. We excluded participants that absent for follow-up visits more than 2 times or absent for the last visit at the 12^{th} week from the analysis. This process ends up with data from 245 participants. Next, considering our threshold for missing data rate, participants with available passive sensing data greater than or equal to 10 days in 2 weeks were selected as valid data.

Fig. 1. - Participant flow diagram.
Fig. 1.

Participant flow diagram.

At the baseline assessment, we gathered a range of demographic information and clinical characteristics (detailed in Table I). All of the clinical assessments were administered by qualified clinicians. After dividing subjects into two groups, the Multivariate Analysis of variance (MANOVA) test was performed. The significance p-value was expressed in actual values rather than expressing a statement of inequality (p < .05), unless p < .001. Effect sizes were reported in terms of partial square of Eta (\eta _{p}^{2} ), and \eta _{p}^{2} values of 0.01, 0.06, and 0.14 were considered as small, moderate, and large effect sizes respectively [42]. It was observed that there was little difference at the baseline between the two groups, but significant differences in depression and anxiety-related scales at the 12^{th} week visit (p <.001). The distribution of HAMD scores between the two groups is shown in Figure 2.

TABLE I Demographics and Clinical Characteristics
Table I- 
Demographics and Clinical Characteristics
Fig. 2. - Distribution of HAMD scores between two groups: (a) assessed at baseline, (b) assessed at the 12th week.
Fig. 2.

Distribution of HAMD scores between two groups: (a) assessed at baseline, (b) assessed at the 12th week.

SECTION IV.

Data Processing and Feature Extraction

Figure 3 illustrates the overall pipeline. Features were extracted from the raw data with high-level behavior meanings. There were 4 categories of features being defined and extracted, and all features were extracted on a daily basis (detailed in Table II).

TABLE II Listing of Category, Set, Details and Number of Extracted Features
Table II- 
Listing of Category, Set, Details and Number of Extracted Features
Fig. 3. - Pipeline for data processing, feature extraction, and treatment response prediction.
Fig. 3.

Pipeline for data processing, feature extraction, and treatment response prediction.

Call log is widely believed to be the key feature that reflects one’s status in social life. For each type of phone call (incoming, outgoing, missed, or rejected), the mean and SD time of all calls being made (in numerical form), mean and SD of duration, the number and entropy of phone calls were extracted as features.

Phone usage was extracted with the screen on and off status as a proxy [43]. The frequency and the duration of smartphone usage in a day were computed by the on and off status of the screen. In addition, the duration of phone usage for each period was calculated. The ratio of the phone usage duration in each period (6-12 pm, 12–6 pm, 6–0 am) to all day was calculated as well as the earliest and latest phone usage time.

For app usage, apps were grouped into 4 sets, namely social apps, content-providing apps, shopping apps, and entertainment apps. For each set, the total duration of usage, the total number of usage, the duration in each period, and the ratio of usage of one set of apps to all were calculated.

The sleep and step data were collected with wearable sensors embedded in the wristband. The duration and ratio of both light and deep sleep were calculated for sleep data. The user’s wake-up and sleep times were also taken into account when estimating their daily routine. For step data, the total step count and maximum step counts in every successive 5 minutes period were extracted.

One thing to note here is that text messages were not included since it is no longer widely used in China nowadays. Text messaging is sort of replaced by instant messaging apps, such as WeChat. Together, the total number of features is 71. These features are assumed to be potent signs of social disengagement and decreased interest or enjoyment in nearly all activities, particularly social and professional ones [44].

A. Handling of Missing Data

Missing data can be caused by various reasons. For example, miss data from both smartphone and wearable sensors caused by technical reasons, such as the phone/app not working, the data not being transferred promptly, or the server is down temporally. Missing data can also arise from user-related issues which may be associated with depressed mood. We acknowledge that these two types should be dealt differently, however, it is hard to distinguish them technically. When a data loss occurred, all related feature items were lost. Rather than simply disregarding them, because we are unsure of whether it was recorded or whether they never exist, we attempted to encode these missing patterns in our model. The “4 days missing in two weeks” (28.57%) threshold was determined empirically in this study, and it seems reasonable while remaining enough samples for analysis. In the end, we were left with roughly 101–175 participants for each task setting. The exact numbers were different due to the availability of the data. Figure 1 gives a detailed depiction. Otherwise, we do not remove data or participants since we were not sure whether it is due to non-semantic reasons. While data missing is almost unavoidable in digital phenotyping for the above reasons, modeling without manual data removal would somehow assure the robustness of a model.

B. Sequence Modeling for Digital Phenotyping

Besides machine learning methods commonly used in digital phenotyping (e.g. SVM, LR, and RF) and influential deep learning models (e.g. LSTM, GRU), a variant of GRU, namely GRU-D [39], was also adopted for sequence modeling. GRU-D introduces decay rates in conventional GRU to control the decay mechanism. To utilize informative missing patterns, the decay rates vary from variable to variable depending on the underlying characteristics associated with the features and are learnable from the training data. Two representations of informative missingness patterns are considered as follows \begin{align*} {\mathrm {Mask}}_{t}^{f}=\begin{cases} \displaystyle \mathrm {0,} & if x_{t}^{f} is missed\\ \displaystyle 1, & otherwise\end{cases} \tag{1}\end{align*}

View SourceRight-click on figure for MathML and additional features. where {\mathrm {Mask}}_{t}^{f} is a masking vector introduced to denote feature f of which variables x are missing at day t . The other one is time interval which is characterized daily. The below equation is used to model the time interval \delta _{t}^{f} for each feature f .\begin{align*} \delta _{t}^{f}=\begin{cases} \displaystyle 1+\delta _{t-1}^{f} & t>1,\mathrm { }{\mathrm {Mask}}_{t-1}^{f}=0\\ \displaystyle 1 & t>1,{\mathrm {Mask}}_{t-1}^{f}=1\\ \displaystyle 0 & t=0\end{cases} \tag{2}\end{align*}
View SourceRight-click on figure for MathML and additional features.

And the modified functions of GRU are:\begin{align*} \begin{cases} \displaystyle r_{t}=\sigma \left ({W_{r}\hat {x}_{t}+U_{r}\hat {h}_{t-1}+V_{r}m_{t}+b_{r} }\right) \\ \displaystyle z_{t}=\sigma \left ({W_{z}\hat {x}_{t}+U_{z}\hat {h}_{t-1}+V_{z}m_{t}+b_{z} }\right) \\ \displaystyle \tilde {h}_{t}=\tanh \left ({W\hat {x}_{t}+U\left ({r_{t}\hat {h}_{t-1} }\right)+Vm_{t}+b }\right)\\ \displaystyle h_{t}=\left ({1-z_{t} }\right)\odot {\mathrm { }\hat {h}}_{t-1}+z_{t}\odot \tilde {h}_{t}\end{cases} \tag{3}\end{align*}

View SourceRight-click on figure for MathML and additional features. which has a reset gate r_{t} and an update gate z_{t} to control the hidden state \hat {h}_{t} for each hidden unit. Where W_{r} , W_{z} , W , U_{r} , U_{z} , U , V_{r} , V_{z} , V and vectors b_{r} , b_{z} , b are model parameters. \sigma denotes element-wise sigmoid function, and \odot denotes element-wise multiplication.

A decay \gamma _{t} was introduced to decay the input over time in favor of the empirical mean. If a variable’s previous observation occurred a while ago, its value should be near some default value. This property would be maintained in digital phenotyping. The decay rate is defined as:\begin{equation*} \gamma _{t}=\exp \left \{{-\max \left ({0,W_{r}\mathrm {\delta }_{t}+b_{\gamma } }\right) }\right \} \tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where W_{r} and b_{\gamma } are model parameters that we train jointly with all the other parameters.\begin{equation*} \hat {x}^{f}_{t}={\mathrm {Mask}}_{t}^{f}x_{t}^{f}+\left ({1-{\mathrm {Mask}}_{t}^{f} }\right)\left ({\gamma ^{f}_{x_{t}}x_{t^{\prime }}^{f}+\left ({1-\gamma ^{f}_{x_{t}} }\right)\tilde {x}^{f} }\right) \tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \hat {x}^{f}_{t} is the last observation of the f th feature (t^{\prime } < t ) and \hat {x}^{f} is the empirical mean of the f^{\mathrm {th}} feature.

There is also a hidden state decay \gamma _{h} to capture richer knowledge from missingness and is implemented by multiplying the previous hidden state h_{t-1} to form the new hidden state:\begin{equation*} \hat {h}_{t-1}=\gamma _{h_{t}}\odot h_{t-1} \tag{6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Thereafter, \hat {x}_{t} and \hat {h}_{t} in equation (3) are given in equations (5) and (6).

SECTION V.

Evaluation

A. Evaluation Settings

We included the HAMD scores of the baseline and 2^{nd} week follow-up visit as features in our models since it is available under our task setting. All input variables were Z-score normalized. LR, Naïve Bayes, SVM, and RF were adopted as the baseline classifier, all were implemented with default parameters with the Scikit-Learn toolkit [45]. Furthermore, a linear fitting of the first two HAMD scores (baseline and the 2^{nd} -week visit) was computed and used as a trivial baseline to predict the outcome at the 12^{th} week. One-layer RNNs were used in GRUs and LSTM with a soft-max layer after the last hidden state to perform the classification (prediction). The number of hidden units in all RNN models was set as 72. Batch normalization and a dropout rate of 0.5 were deployed in the regressor layer. All the deep learning models were trained with the Adam optimization [46] with a learning rate of 0.001, a batch size of 16, and early stopping was used to determine the best weights. All deep models are implemented with Keras and PyPOTS libraries [47] in Python. To enable reproducibility, the core code of the evaluated methods has been made publicly available on GitHub.1 As real-time requirements were not necessary, the hardware requirements for data analysis are not demanding. The experiments were conducted using an 11^{th} Gen Intel Core i7 CPU and an Nvidia RTX 3070 Ti GPU. We report the results from the mean of stratified 10-fold cross-validation, which is commonly adopted in treatment prediction tasks [48], in terms of precision, recall, F1 score, and area under the ROC curve (AUROC).

B. Correlation Analysis

Correlation analysis was also performed (with IBM SPSS Statistics 24) between simple statistics (mean and SD in 14 days) of extracted features and reduction in HAMD scores to see the feasibility of simple statistics of passive sensing features for the prediction of treatment response. The bivariate Pearson correlation analysis shows that, among all feature statistics, only four (4 out of 71) were found significantly correlated with the percentage of reduction in HAMD. A positive correlation was found between mean of the latest phone usage time and reduction in HAMD (r (208) =.136, p = .049). Negative correlations were found between mean (r (162) = -.214, p = .006) and SD (r (162) = −.174, p = .026) of phone usage in 0:00– 6:00 am, as well as mean of entertainment apps usage in afternoon (r (127) = −.183, p =.038). These results are reasonable since the latest phone usage time and phone usage in 0:00– 6:00 am may indicate insomnia and hard to fall asleep which is one of common depression symptoms. As shown in Figure 4. Though some of the correlation analysis results reach the significant level of 0.05, the data points are rather scattered. There is a wake tendency that these features for social activity descriptions to be correlated with the recovery of MDD, however, these simple statistics would not be able to serve as markers for treatment response prediction. The results are not surprising since the simple statistic operation loses the temporal (tendency) information which is believed crucial for treatment response prediction. These are intuitive motivations of sequence modeling that may aid the prediction process with improved performance in these passively collected data.

Fig. 4. - Correlation analysis between simple statistics of extracted features and reduction in HAMD scores: (a) Mean of the latest phone usage time; (b) Mean of phone usage in 0:00– 6:00 am; (c) Mean of entertainment apps usage in the afternoon; (d) SD of phone usage in 0:00– 6:00 am.
Fig. 4.

Correlation analysis between simple statistics of extracted features and reduction in HAMD scores: (a) Mean of the latest phone usage time; (b) Mean of phone usage in 0:00– 6:00 am; (c) Mean of entertainment apps usage in the afternoon; (d) SD of phone usage in 0:00– 6:00 am.

C. Evaluation Results

The evaluation results of various machine learning and deep learning classifier are shown in Table III. In terms of recall, F1 score, and AUROC, the sequence model based on GRU-D achieve the best performance with a relative improvement of about 10% in all three metrics over the best of those baselines. We further calculated the Pearson correlation coefficients between the predicted labels of GRU-D and the actual class of treatment response to serve as an analysis of criterion validity. The results showed that the correlation coefficient (r (111) =.220, p =.019) had reached a significant level, which meant the model established had high criterion validity. This suggests that the proposed sequence modeling method can generate representations that capture the missing pattern that helps the prediction of treatment response with digital phenotyping. As to precision, the trivial baseline obtains a value of 0.71, but both recall and F1 are quite low, such a straightforward indicator for treatment response prediction seems not feasible. Besides GRU-D, the other two recurrent models, LSTM and GRU, performed comparably to those machine learning classifiers which is reasonable considering the small sample size. In comparison with conventional GRU, we would anticipate that the gain is due to the decay mechanism for missing pattern modeling.

TABLE III Performance Comparisons With Different Feature Sets
Table III- 
Performance Comparisons With Different Feature Sets

D. Feature Ablation Study

To select the meaningful feature indicative of depressive recovery from longitudinal data, we conducted the feature ablation study with each feature set. Since call logs, phone usage, and app usage are all collected with smartphones, we also evaluated with all smartphone sensing features. Only the best method was utilized. As shown in Table IV, the models that take phone usage, app usage, and sleep & step features achieve slightly better results than call logs, while the combined features set perform the best. The model with smartphone sensing features performed slightly worse than phone usage and app usage alone which indicates the addition of call logs in smartphone sensing features may perturb the modeling capacity rather than assist.

TABLE IV Performance With Different Length of Available Passive Sensing Weeks
Table IV- 
Performance With Different Length of Available Passive Sensing Weeks

E. Effects of Duration of Passive Sensing

To investigate whether longer passive sensing could lead to better prediction performance, we experimented with 3 or 4 weeks of passive sensing data. The results in Table V show that longer passive sensing does improve performance in all metrics (see Figure 5). However, this is a trade-off between earlier prediction and better performance. It is clinically instructive to predict 2 weeks in advance, but if it takes 4 weeks, the results would be less interesting.

TABLE V Performance Comparison of Different Machine Learning and Deep Learning Classifiers
Table V- 
Performance Comparison of Different Machine Learning and Deep Learning Classifiers
Fig. 5. - Performance evaluation of different passive sensing durations: (a) Precision; (b) Recall; (c) F1 score; (d) AUROC. (Error bands: 95% confidence interval).
Fig. 5.

Performance evaluation of different passive sensing durations: (a) Precision; (b) Recall; (c) F1 score; (d) AUROC. (Error bands: 95% confidence interval).

SECTION VI.

Discussion and Broader Impacts

By correlation analysis, only statistics of a few features were found to be correlated with the recovery of MDD, this is not surprising since the simple statistic operation loses the temporal information which is believed crucial for treatment response prediction. Therefore, we further conducted sequence modeling and considered the modeling of missing patterns in the meantime.

As shown in the evaluation, the improved performance of the method based on GRU-D, in comparison with GRU, manifested that the missingness in digital phenotyping is informative and there seems to exist an inherent correlation between the missing patterns and the prediction task. In addition, to investigate the usefulness of each feature set, we show results with different feature categories respectively. Considering the results of the feature ablation study in Table IV, feature selection would be worth investigating to further improve the performance. Compared with digital phenotyping based studies on depressive symptom prediction [15], [48], [49], [50], Wahle et al. predict intervention outcome in PHQ-9 values with 2 weeks of smartphone mobile sensing data achieved an accuracy of 59.1-60.1%. Lu et al. achieved an F1-score as high as 0.77, however, their study was performed with college students, not clinically validated. As to neuroimaging methods using MRI or EEG to predict individual-level response to treatment in patients with MDD which achieve a recall (sensitivity) of around 0.77 [19], our results are not impressive considering the binary classification task with a chance level of 0.5. However, neuroimaging-based methods usually include small samples leads to easy-to-fit and even overfitting. Particularly a concern in MRI studies which contain small sample sizes and plenty of fitting data [19].

If there exists a method for treatment response prediction, it should ideally be simple to perform and inexpensive. Therefore, we hope to achieve this by combining digital phenotyping and machine learning, which could enable us to obtain predictive models and may achieve even better performance when combined with other clinical modalities.

We discussed both recurrent neural networks and classic machine learning methods in the related work, but did not cover Convolutional Neural Networks (CNNs) or Spiking Neural Networks (SNNs). Indeed, these models have the potential to excel in sequence modeling tasks [51], [52], [53]. The primary distinction between a CNN and an RNN is their ability to process temporal information. RNNs are designed for this specific purpose and have been shown to perform well in a broad range of sequence-related tasks [54]. Furthermore, SNNs are more biologically plausible and have the advantage of low power consumption. Although SNNs and RNNs both involve temporal and spatial dimensions, they differ in their connection patterns and modeling details within each neuron unit [55]. When compared on some vision datasets [56], SNNs perform similarly to LSTM and GRU. However, the performance of both models on sequence modeling needs further exploration. In this paper, we used a modified GRU network to model both temporal information and informative missing patterns, which conventional CNNs and SNNs are incapable of doing. We believe this specifically modified GRU network is better suited for this task.

Although the current results are not satisfactory, this work has room for improvement in several aspects, such as in-depth feature mining, direct modeling of sequential data with recurrent models, and separate modeling of time series or non-time series data, which are promising to improve the performance. This study is an exploration step of this kind of work. We present preliminary results and hope to promote further in-depth research. We believe effective digital phenotyping would become a support tool for clinical decision-making that can increase treatment efficiency while reducing ineffective treatments.

A. Practical Implications for Interventions

Developing methods that can forecast treatment success at the early stage to improve patient care and decrease the length of illness is an urgent task. The clinical importance of digital phenotyping-based social activity sensing to predict treatment response has been emphasized by the COVID-19 pandemic and the widespread use of telehealth. Learning-based predictive modeling may allow clinicians to assess the efficacy of individual treatment responses. As a result, it might help in relieving the global burden of MDD by improving treatment effectiveness.

This study adds value to the research field that showing how unobtrusive mobile sensor data can help with the diagnosis, monitoring, and understanding of depression. Although we examine the proposed method in the context of depression, it may be generalizable to any persistent and long-term mental health issue.

B. Limitations and Future Work

We acknowledge that the current study has several limitations. First, technical issues with the compatibility of the smartphone and the study app significantly affect data quality. However, while we try our best to make them compatible with various models of smartphones and various versions of the Android operating system, it still occurs data missing occasionally. Second, the prevalence of messaging apps from third parties would have a large impact on sociability features, such as call logs, since phone calls with Wechat cannot be uncaptured due to their internal security settings. Missingness in the passive data may decrease the likelihood of capturing behavior that matches social activities. Although digital health technologies have demonstrated benefits in scientific research, these technologies, as well as ours, have rarely been adopted in clinical settings [57]. One possible way of adoption could be to prescribe the app to patients. However, the clinicians’ attitudes towards this approach are quite divided, with some expressing optimism while others question its value [58]. Despite these limitations, our study represents one of the largest digital phenotyping studies for treatment response prediction in MDD in terms of sample size and duration.

SECTION VII.

Conclusion

The primary goal of the present study is to investigate the feasibility of treatment response prediction based on digital phenotyping and machine learning. Predicting the outcome 10 weeks in advance based on 2 weeks’ passively collected data (compared with the 12 weeks standard treatment course) is a highly challenging task and of significant research value. This paper presents a sequence modeling method based on a variant of GRU to model the time series features and handle missing values simultaneously. Based on extensive experiments conducted with the data collected from 245 MDD participants, we demonstrate that the proposed method outperforms all baseline classifiers. The results suggest passive sensing aids prediction performance and the missing pattern is informative for the prediction task. Although the result of the current study is lower than that of neuroimaging methods and is far from practical, we believe the potential in predicting the treatment response of depressed individuals at an early stage, thereby patients can receive proper and timely clinical interventions.

References

References is not available for this document.