Predicting Depression in Canada by Automatic Filling of Beck’s Depression Inventory Questionnaire

The risk for depression and anxiety increased as people adjusted to a new normal after the COVID-19 pandemic. Early detection and appropriate onset treatment and support can reduce the consequences of depression. Automatic detection of depression in social media has recently become an important area of investigation. However, because of the lack of extensive annotated data, we propose a method for using a model that learns to answer a depression questionnaire and apply it to make population-level predictions. We used the eRisk 2021 Task 3 training dataset to build an automated model to fill the Beck’s Depression Inventory (BDI) questionnaire. We selected the best performing model for each group of questions based on predefined metrics and consolidated those models into one model (called the <inline-formula> <tex-math notation="LaTeX">$BDI\_{}Multi\_{}Model$ </tex-math></inline-formula>). The <inline-formula> <tex-math notation="LaTeX">$BDI\_{}Multi\_{}Model$ </tex-math></inline-formula> achieved better performance than the state-of-the-art for this challenging task. Then, we used this model for inference on a Canadian population dataset and compared its predictions with the statistics of the most recent mental health survey conducted by Statistics Canada. The correlation between the inference of the answered questionnaire based on our <inline-formula> <tex-math notation="LaTeX">$BDI\_{}Multi\_{}Model$ </tex-math></inline-formula> and the official statistics showed a strong Pearson correlation of 0.90.

nizing persons suffering from depression and assisting those 23 in need is a critical step toward building a better living envi-24 ronment. However, the process of identifying those who have 25 a mental illness is a difficult task. 26 Various psychiatric scales are used to assess individuals' 27 mental health. For example, researchers may use PHQ-9 28 The associate editor coordinating the review of this manuscript and approving it for publication was Amin Zehtabian . 1 https://www.who.int/news-room/fact-sheets/detail/depression (Patient Health Questionnaire) to quantify depressed symp-29 toms. It is a commonly used tool for diagnosing and mea-30 suring the severity of depression, and it assesses behavioral 31 characteristics, self-harm, and suicidal thoughts. Another 32 option is the Beck Depression Inventory (BDI) question-33 naire, developed by [1]. It is one of the most commonly 34 used tools for estimating the severity of depression. It is a 35 self-reported inventory of 21 questions with multiple choices 36 that are grounded in the patient's thoughts rather than psy-37 chodynamic perspectives. Even though these measures are 38 well-established psychiatric instruments, choosing the scale 39 that will be most accurate for a given demographic sample 40 and using it is a complicated task. 41 Meanwhile, the use of social media has significantly 42 increased over the past few years. As a result, it has attracted 43 many researchers to analyze its contents in different fields and 44 attempt to predict mental health problems within a particular 45 VOLUME 10,2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ population. In this article, we aim to build a risk-mitigation 46 tool that allows professionals and decision makers such as the Public Health Agency of Canada (PHAC) to detect the level section IV for details about the evaluation measures).

82
The main contributions of this article are as follows:  In Section VI we present our population-level experiments 101 and discuss the results. Finally, conclusions and future work 102 are discussed in Section VII.

104
Depression has gained significant interest from researchers 105 due to its effect on human beings and society. Current 106 research shows essential associations between an individ-107 ual's mental health and the linguistic content they share on 108 social media. Recent advances in applying Natural Language 109 Processing and other Machine Learning techniques to social 110 media to address mental health are found in the following 111 surveys [2], [3], [4], [5], [6], [7].

112
Analysis of social media for predicting mental disor-113 ders can be done on a post-level basis using explicit or 114 implicit attributes of the post [8], [9], [10], at the user-115 level by aggregating multiple posts as a single document 116 or analyzing behavioral changes over time [9], [11], [12], 117 [13], [14], [15], [16], or finally at population-level. For 118 example, [17] developed a probabilistic model to detect the 119 behavioral changes associated with the onset of depression. 120 Whereas, [18] achieved 0.88 AUC score by training a random 121 forest model using an estimated weight of psychological 122 factors such as stress, depression, anxiety, hopelessness, lone-123 liness, burdensomeness, insomnia, and sentiment polarity to 124 predict suicide ideation within the tweets.

125
On the population level, [19] represented the US counties 126 as graph interactions between Linguistic Inquiry, and Word 127 Count (LIWC) features, then trained several graphs neural 128 networks: graph convolutional network, graph attention net-129 work, a hybrid network, and graph isomorphism network to 130 learn the population health representation, and finally, used 131 logistic regression (LR) to estimate the health indices of 132 3,221 counties. A significant correlation was observed with 133 six health measures, and models with linguistically analyzed 134 Twitter data improved predictive accuracy for 20 community 135 health measures. Note that this work is not about mental 136 health.

137
Based on data from Reddit users who changed from a 138 mental health condition to suicide ideation, [20] developed a 139 statistical approach relying on three cognitive psychological 140 integrative theories of suicide, including thinking, ambiva-141 lence, and decision making; they identified markers to detect 142 this changeover episode.

143
From the Chinese Longitudinal Healthy Longevity 144 Study (CLHLS) survey, [21] chose 1,538 senior persons. 145 Six machine learning models, including deep neural network 146 (DNN), gradient boosted decision tree (GBDT), SVM, and 147 LR with lasso regularisation were used along with multivari-148 ate long short-term memory (LSTM). Different depression 149 risk indicators and the risk of depression in the older popula-150 tion have been studied using this LSTM.

151
The most related to our research is the eRisk 2021 task 3, 152 which is concerned with measuring the severity of depression 153 signs. The dataset is explained in detail in section III. The 154 task is a continuation of Task 3 at eRisk 2019 and Task 2 at 155 eRisk 2020, and its objective is to automatically estimate 156 165 These findings support the task's potential to extract specific 166 depression-related data from social media behavior automat-167 ically. However, there is still room for improvement in the 168 generalization process to advance toward a more comprehen-169 sive and adequate depression screening tool. The multi-model 170 we will present in section V is able to achieve higher scores, 171 advancing state of the art.

173
The datasets used in this research are listed in Table (1). 174 We utilize eRisk Dataset (R1) explained in section (III-A) to 175 train a machine learning model to answer the BDI depres-

A. eRisk DATASET
188 eRisk is an initiative to explore issues of evaluation method-189 ologies, performance metrics, and other aspects related to 190 building test collections and defining challenges for early risk 191 detection related to health and safety. 2 The dataset used in this 192 research is based on the eRisk 2021 Task 3 (Measuring the 193 severity of the signs of depression). The task is a continuation 194 of Task 3 at eRisk 2019 and Task 2 at eRisk 2020, and it is 195 the last in this series. Its objective is to automatically estimate 196 a user's degree of depression by building machine learning 197 models to answer a standard depression questionnaire (BDI) 198 using the users' social media postings.

199
The dataset includes 170 social media users who have filled 200 the BDI questionnaire and voluntarily provided the reference 201 to their Reddit forum posts, the history of their writings was 202 extracted right after the user filled the questionnaire.

203
The questionnaire contains 21 questions (see Appendix VII 204 for the complete BDI questionnaire [1]) that assess the 205 existence of depression signs such as sadness, pessimism, 206 fatigue, and so on. Each question has four possible responses 207 (0, 1, 2, 3) except for question 16 Figure 3 displays the distribution of the answers for BDI 216 questions in the training data. When making these counts, 217 branches (a) and (b) for questions 16 and 18 were grouped 218 together because they all contribute the same number of 219 points when determining a user's depression level.

220
The following preprocessing rules are applied to R1 Reddit 221 posts:     probability distribution is adjusted by comparing the first 261 name with Canada's birth records and the life tables 5 that con-262 tain life expectancy and associated age and sex projections for 263 Canada [26], [27]. 264 The age and sex probability distribution is deduced for each 265 user in 12 fields as follows: The differences between the probabilities of each category 275 vary. Thus, we decided to keep users with high confidence 276 for both Age and Sex prediction based on the following rules: 277 • Sex: We assign to the user the sex of the maximum sex 278 probability of all age groups with a probability more than 279 92.5%, using the following equation: • Age: We assign to the user the age group of the maxi-283 mum age group probability (P(α)), given that the differ-284 ence between the largest and the second largest is greater 285 than , where = 2 * P(α j ), using the following steps: 286 A tweet is a short status update posted by the user with a limit 292 of 140 characters, which doubled to 280 in 2017.

293
The P1 dataset is a subset of the ASI dataset with the 294 following conditions:

295
• Each user must have a location mapped to a Canadian 296 province/territory.

297
• Each user must have age or sex prediction with the 298 minimum defined confidence.

299
• Each user must have a minimum of 5 posts.

300
• Each post must be at least 32 characters in length.

301
• The posts need to have timestamps during 2015.

302
After applying the above conditions, the number of posts 303 decreased from 9, 304, 441 to 2, 582, 912 tweets and the 304 number of users from 278, 627 to 15, 982 users.   324 We used four evaluation metrics on the R1 dataset level: the 325 ones from the shared task explained in the task overview questionnaire represent the well-established depression cat-345 egories in psychology, the Closeness Rate CR computes the 346 standard deviation called absolute difference (AD) between 347 the real and the automated answers. The absolute difference is 348 transformed into an effectiveness score as (IV), where MAD 349 is the maximum absolute difference, which is equal to the 350 number of possible answers minus one: The depression class distribution in the training dataset is 365 displayed in Figure 3. The association between the prediction output (PD) and the 381 CCHS data is calculated using the Pearson correlation coef-382 ficient (ρ) based on the following equation:

IV. EVALUATION METRICS
The correlation coefficient ranges from −1 to 1. The higher  We employed three methods for filtering posts: topic-412 based, similarity-based, and a hybrid approach, as follows:      (4).
where x is the similarity score of question i and µ is the mean 463 Furthermore, it should be noted that not all the categories 464 are discussed in the posts and some categories appear more 465 often than others. If there are a limited number of posts for a 466 specific user, we would consider all the posts for the learning 467 process (all the posts are included in the top n posts). Table 7 468 shows the number of posts for each BDI question as per 469 RoBERTa similarity withθ 1 = 0.6. It shows that the posts 470 related to eating and sleeping habits are significantly less 471 common, and posts relating to guilt and punishment feelings 472 are the most frequent.

474
The Hybrid approach uses a combination of topic and 475 similarity-based approaches based on different sentence 476 transformer models and topic modeling. We used the all-477 mpnet-base-v2 sentence transformer, which was developed 478 by HuggingFace. 8   on the sentence level is used with an attention mechanism 519 to aggregate the most significant sentences to form the user-520 category vector, which is then passed on to a dense layer 521 for text classification using softmax activation as shown in 522 Figure 10. HAN employs two levels of attention mechanisms at the 524 word and sentence levels. First, a word attention mechanism 525 is utilized to identify keywords and then aggregate them to 526 create a sentence vector. Then a sentence attention mecha-527 nism is used to emphasize the importance of a sentence. applied to tune the model as explained in Table 4.

555
As explained earlier, we adapted different deep learning mod-556 els to learn the answers to each question in the BDI question-557 naire based on the related posts. Figure 4 shows an example  Table 5 shows the results of the selected models Model i : The BDI _Multi_Model is formed after several iterations 568 of the above-mentioned deep learning models ending up 569 with five hierarchical attention networks (HAN), three LSTM 570 models, and finally, two transformers. The models and the 571 parameters were set -based on the accuracy for each question 572 -ending up with ten different versions. The ten models' 573 parameters are included in Appendix VII. Although there is 574 not much improvement in the ACR metric, the performance 575 of BDI _Multi_Model exceeded the latest best model [22] 576 using the same training and test dataset in the following 577 metrics: AHR, ADL, and DCHR with a difference of more 578 than 7% of the latest. In addition, the ADL metric used to 579 predict the depression level exceeded 84%, considering that 580 ADL is the most critical metric for measuring depression at 581 the population level.

582
The model's performance is enhanced in small steps due 583 to the lack of an adequate dataset for deep learning model 584 training. The training dataset contains only 90 users, and the 585 total number of training posts is less than 50, 000, which is 586 relatively small for a deep learning model. The quality of 587 the posts can be enhanced by filtering the indicative posts by 588 experts. This labeling would help train a model to filter the 589 data based on the extracted features, which may help enhance 590 the posts filtering process and the classifiers' accuracies.

592
In this research, we use a bottom-up technique for population-593 level detection [27], starting with individual models, then 594 VOLUME 10, 2022  The overall process is illustrated in Figure 12, and it can 599 be summarized as follows: Starting with Reddit posts pre-600 processing and BDI questionnaire responses, each post is 601 examined against the BDI questionnaire's topics and tagged 602  The inference of BDI _Multi_Model estimated that 9.75% 622 of the sample is classified as depressed (|P1|), whereas 54% 623 are classified with non depressed or minimal depression, 624 as opposed to the 7.1% estimated official prevalence rate for 625 depression in 2015. Table 6   Similarly, Table 6 shows the distribution of age demo-639 graphics within 7 of the Canadian provinces and the NT 640 territory, and the estimated depression on the P1 dataset.

641
The age distribution shows bias towards younger age in the 642 population, mainly for the age group 18 − 24. This is due 643 to the demographics of social media users. Thirty percent 644 of internet users under the age of 50 use Twitter, compared 645 to eleven percent of online users aged fifty and more [37].  However, the use of this tool differs in the psychiatric 666 environment from social media settings. We suggest that 667 the BDI questionnaire could be revised in future work to 668 be adapted to social media characteristics. In addition, the 669 eRisk dataset could be enhanced by experts to label the 670 questionnaire-related posts so that a reliable automated ques-671 tionnaire model can be trained to fill up the questionnaire 672 for a better estimation of the depression level of a defined 673 population.

675
This questionnaire (BDI-II) consists of 21 groups of state-676 ments. Please read each group of statements carefully. And 677 then pick out the one statement in each group that best 678 describes the way you have been feeling during the past 679 two weeks, including today. Circle the number beside the 680 statement you have picked. If several statements in the group 681 seem to apply equally well, circle the highest number for that 682 group. Be sure that you do not choose more than one state-683 ment for any group, including Item 16 (Changes in Sleeping 684 Pattern) or Item 18 (Changes in Appetite).