MGL-CNN: A Hierarchical Posts Representations Model for Identifying Depressed Individuals in Online Forums

More users suffering from depression turn to online forums to express their problems and seek help. In such forums, there is often a large volume of posts with sensitive content, indicating that the user has a risk of suicide and self-harm. Early detection of depression using appropriate deep learning models and social media data can prevent potential self-harm. However, existing depression detection models are not powerful enough to capture critical sentiment information from the large volume of posts published by each user, which makes the performance of these models not satisfying. To address this problem, we propose a hierarchical posts representations model named Multi-Gated LeakyReLU CNN (MGL-CNN) for identifying depressed individuals in online forums. The model consists of two parts: the first one is a post-level operation, which is used to learn the representation of each post of the user, and the second one is a user-level operation, which is used to obtain the overall representation of the user’s emotional state. Besides, we propose another depression detection model by changing the number of gated units in the MGL-CNN, which is named Single-Gated LeakyReLU CNN (SGL-CNN). We show how to use our models to identify depressed users through a lot of posted content. Experimental results showed that our models performed better than the previous state-of-the-art models on the Reddit Self-reported Depression Diagnosis dataset, and also performed well on the Early Detection of Depression dataset.

media data for early detection of depression tasks has become an effective means. Meanwhile, massive social media data makes it difficult to identify users with depression or at risk of suicide manually, which makes the development of automatic depression detection technology more critical. Early detection of depression on social media is a continuous process of data accumulation, and only when there is sufficient evidence can a certain degree of depression risk be identified. This requires collecting a large number of posts with strong temporal associations and long time spans. However, most existing approaches use data from a limited number of health centers involving privacy and confidentiality issue for depression detection [13]. There are a few methods for depression detection based on large-scale social media data. Therefore, how to build user emotional state representations and identify critical sentiment information from a large number of posted contents is very important.
Self-expression and social support can help improve the psychological state of depressed people [14]. Moreover, the words people use on social media can reveal real and significant aspects of their social and psychological worlds [15]. Natural language is related to personality, mental state, and situational fluctuation. Therefore, how to identify the linguistic style of the individuals involved is particularly important. There has been a great deal of research that is focused on the depression detection and mental health problems, starting from the analysis of text extracted from social media. For example, Mowery et al. conducted a large number of machine learning algorithms to classify depressive symptoms from twitter data for mental health [16]. Choudhury et al. collected several hundred Twitter users who have been diagnosed with depressive disorder, using a statistical classifier to estimate the risk of depression through measuring the users' emotions, language, and linguistic styles [17]. Moreno et al. utilized a large amount of data from Facebook, referencing to depression symptoms to ultimately determine the depression users [18]. Towards large-scale data like Reddit dataset, Yates et al. presented a general neural network architecture for combining posts into a representation of a user's features to assess depression and self-Harm risk [19]; Then Qing Cong et al. proposed a deep learning-based approach for solving the imbalanced RSDD datasets [20]. In the Natural Language Processing field, text-based depression detection can also be considered as a sentiment analysis task. In fact, in addition to depression detection, early detection techniques can also be used in many other health-related fields. For example, it might be used to identify potential pedophiles, people with suicidal tendencies, and to monitor the evolution of psychological disorders.
In this work, we propose two general hierarchical posts representations models for identifying depressed users on large-scale datasets, including two essential parts: the post-level operation and the user-level operation. We attempt to apply gating weight to construct the representation of the users' posts. The main contributions of our work are as follows: • We introduce two hierarchical neural network models with gated units and convolutional networks  for fulfilling depression detection task, which are  named Multi-Gated LeakyReLU CNN (MGL-CNN) and Single-Gated LeakyReLU CNN (SGL-CNN). The user's dataset is divided into a certain number of posts, and we can use our models to identify the genuinely crucial sentiment features of each user's posts and suppress other unimportant information as possible.
• The proposed models can encode the relations between posts in user representation. It consists of two parts: the first one is a post-level operation, which is used to learn the representation of the user's every post, and the second one is a user-level operation, which can obtain the overall representation of the user's emotional state. The traditional convolutional neural network is weak in identifying crucial depression features. According to this situation, we add gated units to improve the performance of this task dramatically.
• Empirical results on the RSDD dataset task demonstrate that our models perform better than the state-of-theart methods. To prove the generality of our models, we also introduce the Early Detection of Depression dataset from another online forum to estimate the risk of depression. Our methods also perform well on this dataset compared to strong existing methods, demonstrating that our framework is robust and general.
The rest of this paper is organized as follows. Section II gives an overview of related works. Section III introduces our depression detection models. Section IV describes in detail how we conduct the experiment and discusses the experimental results of our proposed models. Section V is the conclusion of our depression detection work.

II. RELATED WORK
Many studies analyze mental health-related texts in social media to better identify and understand mental health-related issues. Some of these studies use traditional machine learning methods. For instance, Schwartz et al. built a regression model by using Facebook data to predict multiple-granularity depression in individuals [21]. Thompson et al. used the clinical notes and online social media data to build a model based on Random Forest classifier [22] with bag-of-words features, detecting the risk of suicide in military personnel and veterans [23]. Furthermore, many traditional methods were used in the shared task automatic identification of content in mental health forums by the 2016 Computational Linguistics and Clinical Psychology Workshop. For instance, Malmasi et al. used a Random Forest meta-classification approach on top of a set of base classifiers [24]. Brew used SVM with Radial Basis Function (RBF) kernel [25].
Aside from the machine learning explorations which have achieved sound results in depression detection, many deep learning methods have had impressive successes on text classification and sentiment analysis. These methods only rely on text and are not dependent on any external features. For example, Long Short-term Memory (LSTM) [26] networks and their variants introduced the memory units and the gating mechanism to decide whether to delete or add information from memory units so that longer dependency information can be learned. Much research about sentiment analysis is based on LSTM. For instance, Gui et al. [27] proposed a novel cooperative multi-agent model to depression detection on Twitter. The model included a text feature extraction and an image feature extraction. The part of text feature extraction applied a gated recurrent unit and convolutional neural networks to extract the textual sentiment features. Tang et al. [28] developed two effective target-dependent models for sentiment classification on Twitter by using the bidirectional LSTM. In addition to LSTM, Convolutional Neural Networks (CNNs) are actively exploited for text classification in the medical domain or other NLP tasks, designed to learn to extract a hierarchy of crucial text elements. For instance, Kim first applied a simple CNN with one layer of convolution for sentence classification [29]. Yates et al. proposed a general neural network architecture for combining posts into a representation of a user's features to assess depression and self-Harm risk [19]. Although CNN is originally designed for Computer Vision, they are very successful in NLP tasks and easily parallelized during training for not having time dependency.
Gated Convolutional Neural Network [30] (GCNN) firstly introduced the gated units into CNN for language modeling, providing a linear path for the gradients while retaining nonlinear capabilities for reducing the gradient vanishing. This model utilizes one convolutional layer to produce gating weights and identify abstract features. Since the weights and the abstract features are convolved at the same level, the significant features identified by the gating weights is very monotonous. To better discover contextual information in text classification, Yang Liu et al. introduced a new CNN model (AGCNN) for sentence classification, which generated the gating weights by a variety of specialized convolution kernels to integrate the contextual information of a particular context window into the control weights [31]. And to achieve better performance with aspect-based sentiment analysis, Wei Xue et al. proposed a model based on gated convolutional neural networks, which can selectively output the sentiment features according to the given aspect or entity [32].

III. THE PROPOSED METHOD
We proposed two novel hierarchical depression detection models named MGL-CNN and SGL-CNN for identifying depressed individuals in online forums. Since a user's overall data consists of a list of posts, and each post consists of a list of words, the models consist of two parts: a post-level operation and a user-level operation. It first produces continuous post representations from word representations. Afterward, post representations are treated as inputs of the second part to get the user's overall emotional state representations. Users' activity representations are then used as features for VOLUME 8, 2020 depression classification. The architecture of our proposed depression detection models is shown in Fig. 1. The difference between the two models is the number of gated units in their two parts. Current natural language processing methods mostly use long short-term memory and attention mechanisms to predict the sentiment polarity of the concerned targets, which need more training time and computational cost. Our proposed models replace the recurrent connections typically used in recurrent networks with gated temporal convolutions. Meanwhile, special convolution encoders are used to convolve the inputs and obtain gating weights independently, and the hierarchical structure can reuse parameters. Therefore, the computations of our models don't have time dependency and can be easily parallelized over the individual words of every post in the user's document.
In this session, we mainly describe the details of the post-level operation for the two models. The structure of the user-level operation is the same as the post-level operation. The input to the models will pass through multiple layers of the convolutional neural network with gated units, making full use of limited contextual information to obtain the critical features of the post's representation.  on the input). The second convolutional layer with the gated unit then uses convolution kernels to obtain different gating weights (padded where necessary). These gating weights are applied to do the element-wise product with the feature map generated by the first convolutional layer to get a post representation.
We have described the process by which one feature is extracted from one filter. Each word is represented by a embedding stored in a word embedding matrix L w ∈ R d×|V | where |V | is the number of words in the vocabulary and d is the dimension of word vector. Formally, let us denote a user's each post consisting of n words as {w 1 , w 2 , . . . w i , . . . w n }, let x i ∈ R d be the d-dimensional word vector corresponding to the i-th word in the post. A post embedding of length n is represented as In the first convolutional layer, we use CNN with multiple convolutional filters of different widths [33] to produce post's representation. The convolutional filters with different widths can be regarded as extractors to obtain multi-grained local information like N-Grams. Similarly, a convolutional filter with a width of 2 essentially captures the semantics of bigrams in a user's post. Multiple convolutional filters with different window sizes are applied to obtain multiple feature maps. Let K ∈ R s×d with the stride of 1 be a convolutional filter, which is applied to a window of s words to produce a new feature. Let [x i , x i+1 , . . . , x i+s−1 ] refers to the concatenation of word embeddings in a fixed-length window size s, which is denoted as X i:i+s−1 . A new feature a i is generated from X i:i+s−1 by where b ∈ R is a bias term, * denotes convolution operation and f is a activation function (LeakyReLU). This filter is applied to each possible window of s words in the post {X 1:s , X 2:s+1 , · · · , X n−s+1:n } to produce a feature map A = [a 1 , a 2 , . . . , a n−s+1 ] with A ∈ R (n−s+1)×1 . Each feature map A from all feature maps obtained by filters with different sizes is then fed into the second convolutional layer. The second convolutional layer consists of a convolutional layer and gated units. This layer is designed to produce different gating weights. We denote a convolution operation involving a kernel F ∈ R h×1 , which is applied to contextual features A. The kernel F with window size h (padded when necessary) slides on a feature a l to generated a gating weight Here g l ∈ R, l = 1, 2, · · · , n-s+1. All gating weight elements generated by the feature map A and the kernel F produce the gating weights matrix with G ∈ R (n−s+1)×1 . Let m be the number of convolution kernels used in the second convolutional layer. We utilize the gated units of MGL-CNN to extract different gating weight matrix: G 1 , G 2 , · · · , G m . Afterwards we get the output feature map O through the gating weight matrix G where ⊗ is the element-wise product between matrices. O ∈ R 1×(n−s+1) when we use the SGL-CNN. O ∈ R m×(n−s+1) when we use the MGL-CNN. The output O of the first convolutional layer is modulated by the gating weight G. These gating weights multiply feature map A and control what information should be propagated through the layers. To capture global information of a post, we then feed the outputs of the second convolutional layers to a global average pooling layer and concatenate all the outputs to get post representation (concatenate when in MGL-CNN model).
The obtained post representations are fed to the user-level operation to calculate the user's activity representation. We use the same method as the post-level operation. The obtained user's features are then passed to a fully connected softmax layer whose output is the probability distribution over labels. Categorical cross-entropy is used as the model's loss function. Let p T be the target sentiment distribution for each document, p be the predicted document sentiment distribution.
where T is the training data, C is the number of categories, i is the index of the document, j is the index of class. The goal of training is to minimize the cross-entropy error between p T and p for all training documents.

IV. EXPERIMENTS AND RESULTS
Experiments are conducted based on the Reddit Self-reported Depression Diagnosis (RSDD) dataset and the Early Detection of Depression dataset (eRisk 2017). We evaluate the performance of our proposed models by comparing them with other strong baseline models and analyze the performance of our models. The reported results are on the test set.

A. EXPERIMENTAL DATASETS
The large-scale novel Reddit Self-reported Depression Diagnosis (RSDD) dataset [19] contains over 9,000 diagnosed users with depression, which is matched with approximately 107,000 control users who have a healthy mental state (data imbalance). On average, there are about 900 posts per user in the dataset, with 148 words per post. This dataset is created from a publicly-available online forums Reddit, which is used to train and test the model of identifying the users with depression. The RSDD dataset is magnitude larger and more high-accurate than prior work creating self-reported diagnoses datasets. The diagnosis posts, which includes false positives diagnosis such as hypotheticals, negations are all removed from the diagnosed users, and the users publishing fewer than 100 posts are also discarded. Meanwhile, in order to avoid easy identification of the diagnosed users through sensitive terms strongly associated with depression, the posts with depression terms are removed. The Early Detection of Depression dataset (eRisk 2017) [34] can be used to develop an exploratory task on early risk detection of depression. It is a collection of posts from a set of social media users, including two categories of users: depressed users and mental health users. Both categories are unbalanced (more mental health users than depressed users). For each user, the collection contains a sequence of posts (in chronological order). The number of all users is not very high (about 486 users), but each user has a long history of writings (on average hundreds of messages from each user). Furthermore, the mean date from the first to the last submission is quite long (more than 500 days).

B. BASELINE MODELS
We compare our methods with the following baseline methods used on the Reddit Self-reported Depression Diagnosis dataset. The previous state-of-the-art model on the RSDD dataset is User model-CNN [19].
• BoW-SVM and BoW-MNB classifiers [35]. Support Vector Machines (SVM) or Multinomial Naive Bayes (MNB) combines with the post itself represented as a sparse bag of words features for depression detection tasks.
• Feature-rich-SVM and Feature-rich-MNB. The two methods use multiple features such as a sparse bag of VOLUME 8, 2020 words features, external psycholinguistic features captured by LIWC5 [36] and emotion lexicon features [37].
• User model-CNN [19]. The depression detection model consisted of a shared architecture based on a CNN, a merge layer, model-specific loss functions, and an output layer. It was the previous state-of-the-art model on the RSDD dataset.
Besides, we introduce several popular models in natural language processing and compare our models with them.
• Long Short-Term Memory is a recurrent neural network with memory cells and three gate mechanisms, which is designed to avoid long-term dependency [26]. In our depression detection task, it takes the whole words of a post as a single sequence to obtain the post's representation and use the whole posts of a user to get the user's representation for detection.
• Bi-directional Long Short-Term Memory consists of two LSTMs, which can capture bidirectional semantic dependency and improve the abilities of memory [38].
• GRU-Attention model consists of a word-and sentence-level attention mechanisms and sequence encoders, which is based on GRU for document classification [39]. Besides, we also replace GRU with LSTM (LSTM-Attention) and Bidirectional LSTM (Bi-LSTM-Attention) to be the baselines.
• CIFG-LSTM is a variant on Long Short Term Memory, which is designed to couple the input gate and the forget gate as one uniform gate [40]. Instead of individually deciding what to forget and add, the CIFG-LSTM makes those decisions together to simplify the structure of the LSTM.
• To consider the spatial structure between words in the user's posts, we also introduce Tree-LSTM [41] to achieve the representation of words to sentences over parse tree structures rather than in a sequential way. For the depression detection task, we firstly use the Stanford CoreNLP [42] to do tokenization and split sentences on the RSDD datasets and generate dependency parses using Stanford Neural Network Dependency Parser. Then we use Tree-LSTM to obtain the post representations and LSTM to get the user's activity representations.
• We also introduce Bert [43] for the depression detection task and make some modifications. We use Bert to obtain the representation of posts that integrate the context semantics. All posts published by a user are then fed to LSTM to get the user's activity representations.
For the eRisk 2017 dataset, we choose the top methods [34] from the early detection of depression task as baselines and compare our methods against these baselines.

C. EXPERIMENTAL SETUP
The RSDD dataset consists of training, validation, and testing datasets, and each contains approximately 3,000 diagnosed users and 35,000 control users. We used the validation set to   The value of the hyperparameters of our model is shown in Table. 2. The RSDD validation set is used to select the depression detection model's hyperparameters, and the test set is used to report the results. We do not initialize the embedding layer with pre-trained embeddings such as publicly available Glove or Word2Vec. The input of the depression models is composed of original terms encoded as one-hot vectors. The input layer is then used to learn 50-dimensional and 100-dimensional embeddings of the terms (Embed_size). The learning rate (lr) is set to 0.001. For RSDD, eRisk 2017, we set the mini-batch size to 64, 128. We define Maxm to represent the maximum number of posts of each user and Maxn to represent the maximum number of tokens of each post. When one user's document which exceeds the maximum number of posts, we will shuffle the posts and randomly select posts with length Maxm. For example, We set our models receiving up to 600 posts (Maxm = 600) and a post containing up to 100 words (Maxn = 100) for each target user on the RSDD dataset. We have increased the maximum number of posts, but the performance on the validation data did not be improved significantly. Our proposed depression detection models are implemented in Keras.
For the post-level operation of the MGL-CNN, the window sizes of the first convolutional kernels (s) are set as 2, 3, 4, 5 and 6, with 30 different kernels for each window size. In the second convolutional layer, the window sizes of kernels (h) are 1, 3, 5, 7 with 30 different kernels each window size. We use convolution kernels of size 3 in the second layer of SGL-CNN. For the user-level operation of our model, we set the same parameters as the post-level operation. We don't perform any specific tuning on the datasets. Class balancing was performed with Categorical Cross Ent in our models, which uses a softmax function and categorical cross-entropy as its loss function. All models are trained using stochastic gradient descent with the Adam optimizer [44].

D. RESULTS
The results in RSDD for identifying depressed users from both our methods and other baselines are shown in Table. 3. The differences between our models and baselines are statistically significant (McNemar's test,p < 0.05). We compare our models against several baselines using MNB and SVM classifiers with two sets of features. Although the traditional methods SVM and MNB with rich features can achieve high precision, the performance on Recall and F1 are not good compared with the state-of-the-art User model-CNN and other popular models in NLP (e.g., CNN-based and LSTM-based methods). For instance, Feature-rich-SVM and Bow-SVM give outstanding performance 0.71 and 0.72 respectively on precision but only have performance 0.31 and 0.29 respectively on recall.
Besides, our models also gain competitive results over several popular models in natural language processing. The proposed model MGL-CNN achieved precision close to Bi-LSTM-Attention but performed better on recall and F1. Moreover, we can conclude from Table. 3 that the selected seven sequence models have achieved almost the same performances on the RSDD dataset (aside from Bi-LSTM-Attention). The Bi-LSTM-Attention model achieved the best performance among them. Compared to the User model-CNN, the precision increased of 5.1%, but the recall decreased. The bidirectional architecture can look forward and backward to capture bidirectional semantic dependency and improve the abilities of the memory. Therefore, Bi-LSTM has a better performance than single directional models. And experiments on the data also indicated that the attention mechanism could help LSTM and variants achieved good results in this task.
Compared with previous work, our proposed SGL-CNN model outperforms the state-of-the-art User model-CNN in terms of Recall and F1 on depressed users (increases of 24.4% and 3.9%, respectively). Besides, our proposed MGL-CNN model outperforms the User model-CNN in terms of Precision, Recall, and F1 on depressed users  (increases of 6.8%, 6.7%, and 5.9%, respectively). We can find the comprehensive result of MGL-CNN is slightly better than SGL-CNN. Our proposed models can obtain an effective improvement over the User model-CNN and perform better than other strong baseline models. We believe that the Multi-Gated (Single-Gated) LeakyReLU unit can help CNN make full use of limited contextual information to obtain the critical features of the post's representation. For our model, the first convolutional layer can capture the n-gram features of the text. The gated unit with different kernels then obtains gating weights to effectively identify language associated with negative sentiment across a user's posts and suppress the impact of unimportant information.
The results on the Early Detection of Depression dataset for our models and the current best-performing methods are shown in Table. 4. The absolute values of the metrics from baselines illustrate that the early detection of depression task is difficult. In terms of F1, performance is low. The highest F1 is 0.64. Some methods, e.g., FHDO-BCSGB, opted for optimizing precision but had a low recall, while other methods, e.g., UNSLA, chose for optimizing recall but had low precision. This might be related to the scale and creation of the dataset. We can find that our proposed models (SGL-CNN and MGL-CNN) achieve performance close to several state-of-the-art methods in terms of Precision, Recall, and F1 on depressed users. Besides, our models are not aimed at improving one indicator like the baseline models, but perform well on all three indicators(Precision, Recall, and F1). Comparison between the result of our models and that of the latest methods suggested that our proposed general neural network architecture can also be applied to the early detection of depression in different online forums. Besides, as shown in Fig. 4, the comparison is based on the changing of training loss within the 40 epochs. We can find that they have almost the same convergence speed, and the results of our models are even slightly better than the state-of-the-art User model-CNN model, indicating that although our models are more complex, the convergence speed does not decrease.

V. CONCLUSION
In this work, we proposed two hierarchical posts representations models for identifying depressed individuals, which was more accurate and efficient than general early depression detection models. The proposed models can effectively represent the user's overall emotional state through their posts. We applied our models on the large-scale Reddit Self-reported Depression Diagnosis dataset and found that it substantially outperformed strong existing methods in terms of Precision, Recall, and F1. However, the absolute values of the metrics illustrate that depression detection on large-scale datasets in social media is still a challenging task and worthy of further exploration. And for demonstrating that our models focus on learning representations of the user's posts from different online forums, we also applied our models on the Early Detection of Depression dataset. We found that it also achieved performance close to strong previously-proposed methods.
Our work is significant from several perspectives: we provide strong models to identify depressed users on social media and a method for large-scale public mental health studies about depression, and do a more in-depth study of the close connection between social media and mental health; we demonstrate the possibility of sensitive applications in combining clinical care with users' online activities, where doctors could be notified and help in time if the activities of user suggest they have symptoms of depression. For future work, we will explore the application of MGL-CNN and SGL-CNN to general document-level sentiment analysis.