Utilizing Neural Networks and Linguistic Metadata for Early Detection of Depression Indications in Text Sequences

Depression is ranked as the largest contributor to global disability and is also a major reason for suicide. Still, many individuals suffering from forms of depression are not treated for various reasons. Previous studies have shown that depression also has an effect on language usage and that many depressed individuals use social media platforms or the internet in general to get information or discuss their problems. This paper addresses the early detection of depression using machine learning models based on messages on a social platform. In particular, a convolutional neural network based on different word embeddings is evaluated and compared to a classification based on user-level linguistic metadata. An ensemble of both approaches is shown to achieve state-of-the-art results in a current early detection task. Furthermore, the currently popular ERDE score as metric for early detection systems is examined in detail and its drawbacks in the context of shared tasks are illustrated. A slightly modified metric is proposed and compared to the original score. Finally, a new word embedding was trained on a large corpus of the same domain as the described task and is evaluated as well.


INTRODUCTION
A CCORDING to World Health Organization (WHO) [1], more than 300 million people worldwide are suffering from depression, which equals about 4.4% of the global population. While forms of depression are more common among females (5.1%) than males (3.6%) and prevalence differs between regions of the world, it occurs in any age group and is not limited to any specific life situation. Depression is therefore often described to be accompanied by paradoxes, caused by a contrast between the self-image of a depressed person and the actual facts [2]. Latest results from the 2016 National Survey on Drug Use and Health in the United States [3] report that, during the year 2016, 12.8% of adolescents between 12 and 17 years old and 6.7% of adults had suffered a major depressive episode (MDE).
Precisely defining depression is not an easy task, not only because several sub-types have been described and changed in the past [4], but also because the term "being depressed" has become frequently used in everyday language. In general, depression can be described to lead to an altered mood and may also be accompanied, for example, by a negative self-image, wishes to escape or hide, vegetative changes, and a lowered overall activity level [2, p. • The authors are with the Department of Computer Science, University of Applied Sciences and Arts Dortmund, Germany. E-mail: mtrotzek@stud.fh-dortmund.de, sven.koitka@fh-dortmund.de, and christoph.friedrich@fh-dortmund.de • S. Koitka is also with the Department of Computer Science, TU Dortmund University, Germany, and with the Department of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Germany. • C. M. Friedrich is also with the Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Germany.
This work has been submitted to the IEEE and has been accepted for future publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 8]. The symptoms experienced by depressed individuals can severely impact their ability to cope with any situation in daily life and therefore differ drastically from normal mood variations that anyone experiences. At the worst, depression can lead to suicide. WHO estimates that, in the year 2015, 788,000 people have died by suicide and that it was the second most common cause of death for people between 15 and 29 years old worldwide [1]. In Europe, self-harm was even reported as the most common cause of death in the age group between 15 and 29 and the second most common between 30 and 49, again in results obtained by WHO in 2015 [5].
Although the severity of depression is well-known, only about half of the individuals affected by any mental disorder in Europe get treated [6]. The proportion of individuals seeking treatment for mood disorders during the first year ranges between 29-52% in Europe, 35% in the USA, and only 6% in Nigeria or China [7]. In addition to possible personal reasons for avoiding treatment, this is often due to a limited availability of mental health care, for example in conflict regions [8]. Via a telephone survey in Germany [9], researchers found out that shame and self-stigmatization seem to be much stronger reasons to not seek psychiatric help than actual perceived stigma and negative reactions of others. They further speculate that the fear of discrimination might be relatively unimportant in their study because people hope to keep their psychiatric treatment secret. Another study amongst people with severe mental illness in Washington D.C. showed that stigma and discrimination indeed exist, while they are not "commonly experienced problems" but rather "perceived as omnipresent potential problems" [10, p. 1].
While depression and other mental illnesses may lead to social withdrawal and isolation, it was found that social arXiv:1804.07000v3 [cs.CL] 21 Dec 2018 media platforms are indeed increasingly used by affected individuals to connect with others, share experiences, and support each other [11], [12]. Based on these findings, peerto-peer communities on social media can be able to challenge stigma, increase the likelihood to seek professional help, and directly offer help online to people with mental illness [13]. A similar study in the USA [14] came to the conclusion that internet users with stigmatized illnesses like depression or urinary incontinence are more likely to use online resources for health-related information and for communication about their illness than people with another chronic illness. All this emphasizes the importance of research toward ways to assist depressed individuals on social media platforms and on the internet in general.
This paper is therefore focused on ways to classify indications of depression in written texts as early as possible based on machine learning methods. The work presented in this paper is structured as follows: Section 2 gives an overview of related work concerning depression, its influence on language, and natural language processing methods. Section 3 describes the dataset used in this work, analyzes the evaluation metric of the corresponding task, and proposes an alternative. Section 4 introduces the userbased metadata features used for classification, while Section 5 describes the neural network models utilized for this task. Section 6 contains an experimental evaluation of these models and compares them to published results. Finally, Section 7 concludes this work and summarizes the results.

RELATED WORK
This section describes the context of this work based on previous research concerning depression and its effects on language. Since social media research in general and health research in particular require ethical considerations, an overview of the current ethical discussion in the field of natural language processing is given. Finally, the practical basis of this work is described by investigating previous and current work in text classification using machine learning.

Depression and Language
Previous studies have already shown that depression also has an effect on the language used by affected individuals. For example, a more frequent use of first person singular pronouns in spoken language was first observed in 1981 [15], [16]. An examination of essays written by depressed, formerly-depressed, and non-depressed college students at University of Texas [17] confirmed an elevated use of the word "I" in particular and also found more negative emotion words in the depressed group. Similarly, a Russian speech study [18] found a more frequent use of all pronouns and verbs in past tense among depression patients. A recent study based on English forum posts [19] observed an elevated use of absolutist words (e.g. absolutely, completely, every, nothing 1 ) within forums related to depression, anxiety, and suicidal ideation than within completely unrelated forums as well as ones about asthma, diabetes, or cancer.
The knowledge that language can be an indicator of an individual's psychological state has, for example, lead to the development of the Linguistic Inquiry and Word Count (LIWC) software [20], [21]. By utilizing a comprehensive dictionary, it allows researchers to evaluate written texts in several categories based on word counts. A more detailed description of LIWC and its features is given in Section 4. With a similar purpose, Differential Language Analysis Toolkit (DLATK) [22], an open-source Python library, was created for text analysis with a psychological, health, or social focus.

Ethical Perspective
Driven by the growing availability of data, for example through social media, and the technological advances that allow researchers to work with this data, ethical considerations are becoming more and more important in the field of Natural Language Processing (NLP). Based on these developments, NLP has changed from being mostly focussed on improving linguistic analysis towards actually having an impact on individuals based on their writings. Still, a proper discussion about ethics in NLP has only been started in 2016 by Hovy and Spruit [23]. Although Institutional Review Boards (IRBs) have been well-established to enforce ethical guidelines on experiments that directly involve human subjects, the authors note that NLP and data sciences in general have not constructed such guidelines. They further argue that language "is a proxy for human behavior, and a strong signal of individual characteristics" and that, in addition, "the texts we use in NLP carry latent information about the author and situation" [23, p. 592]. On top of this direct connection to the individual, they also describe the social impact of NLP research [23, pp. 593-594]. A demographic bias in the selection of training texts can lead to the exclusion of specific groups, overgeneralization based on false positives can have serious consequences depending on the task, and research results can potentially cause or confirm biases and ultimately discrimination by topic overexposure. Even if all these factors are considered, they conclude that dual-use problems can exist for any research if results are used in a different way than originally intended. The same applies to pre-trained machine learning models that get published and could theoretically be used in unintended ways.
These discussions about ethics in NLP have lead to the First Workshop on Ethics in Natural Language Processing 2 during the conference of the European Chapter of the Association for Computational Linguistics in 2017 (EACL 2017). Some interesting results of this workshop include, for example, a proposed process to make NLP research "ethical by design" [24] by installing an Ethics Review Board (ERB) in research organizations that has to approve or veto all steps during research, development, and deployment. Specifically for health research in social media, guidelines for ethical research have been proposed [25]. They include obtaining consent from users whenever possible, carefully considering the consequences of any interactions with users or modifications of the user experience, protecting the data during research and when sharing it with other researchers, and de-identifying users during analysis, presentation, and 2. http://www.ethicsinnlp.org/, accessed on 2018-02-14 when linking data from several platforms. From another perspective, there are also ethical considerations to keep in mind for NLP shared tasks and shared tasks in general [26]. The competitive nature of such tasks may lead researchers to be secretive about their systems and methods, ethical concerns may be overlooked, and conflicts of interest may arise if organizers themselves participate in a task.
While most discussions about health research in social media focus on the important theoretical groundwork to establish guidelines, there has also been a qualitative study using focus group interviews with 16 depressed and 10 non-depressed participants [27] to investigate their opinion about population-level mental health monitoring on Twitter. Firstly, participants of this study were generally aware of the fact that their Twitter messages are public, but showed misconceptions about how access to them could be limited by deleting them, by limitations of the user interface, or by the sheer amount of messages on the platform. While the participants mainly accepted aggregated depression monitoring based on Twitter, some still found it "creepy" and a particular participant stated: "The fact that if it was an algorithm, and they were looking like, 'Hey, we think you're feeling low right now.' I feel like it might make me feel even more low." [27, p. 6] Similar to this statement, participants were concerned about the possibility to use populationbased data to identify specific individuals, while others had the opinion that "pinpointing individuals could help them access much-needed mental health services by paying attention to cues that friends may ignore" [27, p. 7]. In general, participants supported the idea to use social media data as an additional source for professional therapists.

Natural Language Processing
The work described in this paper belongs to the area of Natural Language Processing (NLP) [28] and text classification in particular. The origins of text classification tasks can be found in early research to automatically categorize documents based on statistical analysis of specific clue words in 1961 [29]. Later, similar research goals lead to rulebased text classification systems like CONSTRUE in 1990 [30] and finally the field began to shift more and more to machine learning algorithms around the year 2000 [31], [32]. In addition to text categorization, machine learning was also a driving force in other text-based tasks like sentiment analysis, which is focussed on extracting opinions and sentiment from text documents [33]. It was first used in combination with machine learning to find positive or negative opinions in movie reviews [34] and was then extended to other review domains [35], as well as completely different areas like social media monitoring and general analysis of consumer attitudes [33].
More recently, deep learning has been utilized for text classification [36], [37] in addition to its more common usages in image classification. State-of-the-art results in several text-based tasks could, for example, be achieved by transfer learning methods like Universal Language Model Fine-tuning (ULMFit) [38] and the Google research project Bidirectional Encoder Representations from Transformers (BERT) [39] for the training of language representations, which includes ULMFit and several other methods. The code of BERT and several pre-trained models are also available on GitHub 3 .
Based on these developments, research evolved to text classification tasks that extract more than just opinions from documents: Especially the availability of social media messages enabled researchers to extract population-based health information that made it possible to track diseases, symptoms, and medications [40]. More specifically, Twitter messages were used for population level tracking of depression [41], detection of depression [42], [43], bipolar disorder [44], and post traumatic stress disorder (PTSD) [45] for individuals. Depression detection from text documents in particular has become an increasingly important research area, with interesting methods and results reported for Twitter, Facebook, and forum posts [46]. To directly help depression patients, systems like Psychologist in a Pocket [47], [48], an Android smartphone app, are being developed: Users of this app can choose specific text inputs on their device that should be monitored (e.g. social media posts, mails, or text messages) to be informed about possibly alarming mood changes that they themselves might overlook. By installing an additional plugin, data can be shared with a third party, for example a therapist, and is otherwise password secured and only saved locally.
In addition to text-based depression detection, the second sub-task of the work described in this paper can be found in the area of early detection. Early detection based on text documents can be seen to originate from the idea of sequential reading to allow predictions based on as few documents as possible [49]. An approach using a modified naïve Bayes classifier was shown to be viable for text categorization and sexual predator detection with partial information [50]. Other interesting use cases of early detection applied in practice have been found in the detection of early signs of epidemics [51] or rumors [52] from social media messages.
The fields of depression detection and early detection were first combined by the publication of a dataset for early detection of depression in reddit messages [53] and research using this dataset was driven by the Conference and Labs of the Evaluation Forum (CLEF) 2017 conference 4 workshop on early risk detection on the internet [54], [55]. As this task and dataset are also utilized in this paper, further details can be found in Section 3. During the workshop in 2017, interesting results could be obtained using combinations of Information Retrieval (IR) and supervised learning based on bag of words and dictionaries [56], a two-step classification based first on posts and then on users [57], purely userbased features and random forests [58], lexicon word counts and medial concepts using Support Vector Machines (SVM) or Recurrent Neural Networks (RNN) with Gated Recurrent Units (GRU) [59], and graph models [60]. The Temporal Variation of Terms (TVT) model for early detection, based on the variation of vocabulary over time, was proposed [61] and successfully evaluated [62]. The authors of this paper participated in the task by using models that combined user-based linguistic metadata with bag of words, document embeddings, and RNNs using Long Short Term Memory (LSTM) layers [63]. Results from this task are used to evaluate the experiments in Section 6. Similar text classification research in a psychological context has been conducted at the CLPsych conferences 5 of the past years. In 2016 and 2017, for example, the conference presented a shared task [64], [65] that challenged participants to prioritize posts in an online peer-support forum to tell moderators how urgently a message needs their attention. The CLPsych shared task in 2018 [66] focused on an even more notable approach to early detection: Based on essays written by 11-year-olds, participants had to predict the current as well as future psychological state of the author at specific times in their life.

DATASET OVERVIEW
This section gives an overview of the dataset used for the experiments described in this paper and its main characteristics. It also details the corresponding task and the evaluation criteria.

Dataset
The dataset utilized in all experiments for this paper was first described in 2016 for research on depression and language use [53] and then finally published as part of the CLEF 2017 conference eRisk pilot task on early detection of depression [54]. It contains chronological sequences of posts and comments from reddit.com, collected for a total of 135 depressed users and a random control group of 752 users. Depressed users were identified by searching for posts that clearly mention a diagnosis (e.g. "I was diagnosed with depression"). Since there is no way to validate these statements and no further investigation of the users was possible, there could theoretically be non-depressed individuals in this group but also depressed ones in the control group. Any occurrence of user names has been replaced by an ID like train subject 1 to anonymize users. The number of messages collected for each user ranges from 10 to 2,000 due to API limitations and the fact that some of them have posted very rarely. The dataset has been split into a training and test set as displayed in Table 1.
Each message in the dataset may consist of a title, text, or both, depending on its type: Users on reddit are able to 5. http://clpsych.org/, accessed on 2018-11-24 post content in terms of an image or URL (title only), as text content (title and optional text), or as comment on another message (text only). A total of 91 messages in the dataset are completely empty and can therefore be discarded. Since deleted messages are normally exchanged with the text "[deleted]", these seem to be caused by a fault in reddit, the API, or the preprocessing before publishing the dataset.
In addition, each message also contains a date attribute with the timestamp of when the user has published it exact to the second. Since the reddit API returns all timestamps in UTC (or the local timezone of the reddit server 6 ), these timestamps can primarily be used to sort messages and search for time patterns of a single user. Comparing timestamps between different users would most likely give misleading results because their actual timezone is unknown and they could live anywhere in the world.
Since users for the control group were collected by selecting users that had posted recently when the dataset was collected, instead of using a distribution over time similar to the depressed users, the timestamps also contain a hidden feature that could be exploited: When using the time of the latest post per user (in seconds since epoch) as only input for a logistic regression, this single feature was enough to obtain an F 1 score of 0.78 on the test data. This feature could easily be used as soon as the last data chunk (see Section 3.2) is available. As this is clearly not intended and not in the interest of this task, all models created for this paper completely discard the timestamp information and a detailed analysis of this fact has been sent to the organizers of eRisk to prevent this in future tasks 7 .

Task and Evaluation Criteria
The given dataset was explicitly published for research toward early detection of depression within the previously described eRisk task. To measure this criterion, the data was also split into ten chunks by the organizers, containing 10% of each user's messages in chronological order. During the test phase of the eRisk task, a single chunk of data was published each week, starting with the oldest messages of the users. Participants then had the possibility to classify a user as depressed, non-depressed, or delay the decision to see additional data in the next week. Submitted predictions were final and could not be reversed later. In the last week, a prediction had to be given for every user. In addition to the correct and wrong predictions, evaluations could therefore also take into account how many messages participants had seen for each user before giving a prediction. This information can be utilized by the organizers' early risk detection error (ERDE) measure for early detection systems that was defined in their dataset paper as well [53, pp. 7-8]: With a binary decision d submitted for a user after reading 6. https://github.com/praw-dev/praw/issues/243, accessed on 2018-01- 24 7. According to the organizers, this will already be done for the eRisk 2018 task. k of his messages, ERDE o is defined as: for true negatives (TN) .
(1) The values of c f p and c f n can be used to adjust the severity of false positives and false negatives to the given domain, while c tp defines how late predictions of positive cases are punished. For the eRisk 2017 task, c f n was set to 1, c f p to n tp n u , with n tp denoting the number of positive cases in the test data and the total number of test users n u . Finally, c tp was set to 1 in order to treat late predictions equally to no prediction at all [54, p. 5]. The function lc o (k) determines after how many messages k the cost for true positives starts to grow and is defined as: where the free parameter o controls around which point this logistic sigmoid function is centered. Results of the eRisk 2017 task were evaluated based on ERDE 5 , ERDE 50 , and F 1 score.
Since the results given for the baseline experiments in the original paper [53, p. 11] were obtained by using systems that could submit a prediction after reading each message per user separately, they cannot be compared to results of the actual eRisk task that required to read a whole chunk of between one and 200 messages per user. As Fig. 1 illustrates, this means that depressed users with about ten and more (for ERDE 5 ) or about 55 and more (for ERDE 50 ) messages per chunk basically cannot be predicted correctly because the cost would be very close to c f n . Table 2 shows the ERDE o scores of perfect predictions (F 1 = 1.0) submitted after reading n chunks with no predictions submitted in the chunks before this one. It also includes the corresponding scores obtained from ERDE % o described at the end of this section. The scores obtained for n = 1 are the best possible ERDE o scores for this task, while n = 10 gives the best possible scores for a system that has read all messages. Only predicting the 18 depressed users with less than ten messages per chunk as early as possible and predicting every other user as negative, results in an F 1 score of 0.51 (1.0 precision and 0.35 recall) but still obtains an ERDE 5 score of 10.61% and ERDE 50 of 8.48%. The additional F 1 score is therefore especially important to evaluate systems in the general task of depression detection. To achieve better ERDE o scores, systems not only have to be optimized for this task but also need optimized prediction thresholds to make early predictions without too many false positives. This twofold optimization makes this task especially challenging.
All experiments described in this paper are based on the exact same training and test data as the eRisk 2017 task and also process it by evaluating the same chunks of test data in chronological order. This ensures that the results are directly comparable to those of the pilot task.
Nevertheless, as the detailed look at the ERDE o function shows, there is a need to modify this function especially  for future work with data in chunks. A first modification of ERDE has already been proposed by Loyola et al. [67] and, in addition to making the score usable for multi-class problems, it mainly consists of an altered cost function defined now as: where o is no longer used to parameterize the cost but equals the number of documents per user and k is the number of documents already read for this user. This ensures that the cost is actually based on the proportion of read documents instead of a fixed number. Still, there is no way to parameterize this function and it immediately grows linearly, without any way to predict a subject correctly with a cost of zero. We therefore propose a modification of the original sigmoid cost function: where n d is the total number of documents per user and k still equals to the number of documents already read per user. The cost can still be parameterized by using o ∈ [0, 100] to make the cost grow around the point where o percent of data has been read. This results in a more intuitive cost that grows equally for all users independent of their number of messages. The newly proposed error function based on lc o (p) is denoted as ERDE % o and is evaluated in addition to the original function in Section 6. Table 2 shows how this score compares to the original one for perfect predictions of the eRisk 2017 test data. Since at least 10% of the messages per user have to be read and o = 18 is the minimum natural value to achieve an error of 0.00%, results are shown for ERDE % 20 instead of ERDE % 5 . Simultaneously to this work, another alternative to ERDE has been proposed by a team that contributed to the eRisk 2017 task as well [68]. Their F latency score is based on multiplying the standard F 1 score by a factor that is based on the latency of a system, defined as the median number of posts the system has read before predicting the positive cases. In addition, they substitute the sigmoid cost function of ERDE o by a function that increases more slowly and calculates to a penalty of 0.5 for the median number of posts in the dataset. Because this score is also tied to the absolute number of read messages, the variance of the available messages per user in the eRisk data would lead to the same problems as described above for ERDE o .

LINGUISTIC METADATA
Augmenting the classification of text sequences with userlevel metadata was one of the main ideas in this team's previous work for the eRisk task at CLEF 2017 [63]. This section builds upon this previously described set of metadata features and is aimed to further describe and extend it. All textbased metadata features are extracted from a concatenation of the text and title field (see Section 3.1) of each message, apart from obvious exceptions like the average length of these two fields. All features were calculated separately for each document of a user and then either averaged or summed up as described below.

Word and Grammar Usage
Several features based on counts of specific words or parts of speech (POS) have already been used for this team's work at the eRisk 2017 task and have been examined in the corresponding paper [63]. As described in Section 2.1, effects of depression on word and grammar usage are wellknown and can include, for example, an increased usage of pronouns-especially personal pronouns-, the word "I" in particular, and verbs in past tense. Based on these previous findings, occurrences of the word "I" were counted separately in the text and title of each message in the dataset. In addition, past tense verbs, personal pronouns, and possessive pronouns were counted in the concatenation of text and title by utilizing the default POS tagger of the Python NLTK framework 8 . As an alternative to this approach, a total of 93 lexicon-based features can be obtained from the LIWC 2015 tool [20], [21]. Besides features referring to a specific POS as well, this also includes categories like emotions, informal language, or time orientations. LIWC also calculates four summary variables that represent the authenticity, emotional tone, confidence or leadership, and the amount of analytical thinking of a text. Section 4.4 describes which of these features have been utilized for the experiments of this work. For future work, this approach could likely be enhanced by utilizing more modern POS tagging approaches [69], [70] or a POS tagger that was trained specifically on a social media text domain [71]. The five final word usage features have also already been used for the participation in eRisk 2017 and aim to count 8. http://www.nltk.org/book/ch05.html, accessed on 2018-03-13 very specific, hand-picked phrases that could be a strong indicator of positive cases. They count the occurrences of the exact phrases "my depression", "my anxiety", and "my therapist" as well as names of some common antidepressants 9 and variations of the phrase "I was diagnosed with depression" 10 . These very explicit features are less aimed at finding early indications of depression but at predicting the obvious cases correctly, which is important for the given task as well. In contrast to the other metadata features, this count is summed up over all documents of a user to make this a strong feature even if only present in few or a single document.

Readability
Measuring the readability or complexity of written text is a well-established idea and various different measures exist, while most of them return a result that corresponds to the school years in the US needed to understand a text. The given dataset cannot really be used as an indicator whether depressed persons write more or less complex texts because of the general difference of text quality between the classes. Since the control subjects were chosen randomly, they often differ drastically from the depressed subjects, who might use reddit to discuss their problems or generally talk to other people. Messages of the control group often simply consist of news headlines, a single short sentence, or even a single word. Readability metrics can therefore help to distinguish messages containing discussions and explanations from such simple content that is unlikely to help with the identification of depression. Several standard measures for text readability have been calculated for the text content and the four of them with the highest correlation to the class label in the training data have been selected as metadata features, namely Gunning Fog Index (FOG) [72], Flesch Reading Ease (FRE) [73], Linsear Write Formula (LWF) 11 [74], and New Dale-Chall Readability (DCR) [75], [76].

Emotions and Sentiment
As sentiment analysis is focussed on extracting opinions, affects, and emotions from written texts [33], it seems natural that knowledge from this area can also be very useful to find emotional statements in the field of mental health text classification. Especially the emotions authors express towards their personal situation could be an important indicator. While it would be possible to use the output of any state-of-the-art sentiment classification model [77] as an additional feature, this work has focussed on the use of lexicons to quickly analyze the general helpfulness of sentiment features in this dataset. First of all, the already described LIWC tool includes two features for positive and negative emotions and separate features that indicate anxiety, anger, or sadness. In addition to this, the NRC Emotion Lexicon [78] and two general sentiment lexicons, 9. https://www.webmd.com/depression/guide/ depression-medications-antidepressants, accessed on 2018-03- 14 10. Only including the word "I" and an explicit diagnosis as in "I've been diagnosed with anxiety and depression" or "I was diagnosed with major depressive disorder" 11. Originally developed by the U.S. Air Force without any publicly available references namely the Opinion Lexicon 12 and the VADER Sentiment Lexicon [79], have been used. There also exist several other lexicons that have not been evaluated, for example from the World Well-Being Project at University of Pennsylvania 13 . The NRC Emotion Lexicon contains 14,182 words that can be flagged as positive or negative and as belonging to one or more of the emotions anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The VADER lexicon includes 7,517 terms (including emoticons) and their mean sentiment value based on the judgement of ten human annotators on a scale between -4 (extremely negative) to 4 (extremely positive). Finally, the Opinion Lexicon consists of two lists with 2,006 positive and 4,783 negative words. The corresponding counts or scores obtained from these lexicons for the eRisk 2017 dataset were again averaged over all documents of a user. Unfortunately, for this specific dataset no relevant correlation between these features and the class label could be observed. Indeed, the positive (depressed) class contains slightly more emotions and sentiments of all kinds, which might again indicate the general difference of text quality and content between the depressed subjects and the control group. As the emotion and sentiment features were of no use in this specific case, they were not included in the final set of metadata features used in the experiments of this work. Nevertheless, it can be assumed that they would be more meaningful when used with a text corpus that generally included more sentiment and emotion in both classes by using a control group that more closely resembles the target group. Table 3 summarizes the 17 metadata features that have been described above excluding the features obtained from LIWC. In addition to these, the ten LIWC features with the highest correlation to the class label in the training data have been selected. Because the LIWC lexicon includes several variations, misspellings, and abbreviations, it is accepted that some of these features already occur in the previously described feature set and therefore introduce a slight redundancy. The selected LIWC features are the number of function words, variations of the word "I" (e.g. including abbreviations containing "I" as well as "me" and "myself"), all pronouns, personal pronouns, verbs, words indicating a cognitive process, words with a focus on the present, the total number of lexicon words found, and the two calculated summary variables indicating analytical thinking and authenticity. To build the user-level metadata vector for the experiments described in Section 6, these features were again averaged over all documents of the same user. The described metadata features result in a 27-dimensional vector per user. The concatenated feature vectors of all users are standardized as described in Section 6.1 before being used as input to a classifier. The five counts of specific terms are transformed into boolean flags by representing a value above 0 as 1 and a value of 0 as -1, similar to the previous work using LSTM networks [63]. As the experiments will 12  show, this set of metadata features alone can lead to very good results on the eRisk 2017 dataset.

NEURAL NETWORK MODELS
The following sections are used to describe the neural network models that were used in the experiments of this work. All of these models are based on a document vectorization using neural word embeddings. The general concept of word embeddings and the specific models utilized in this case is described in Section 5.1. Afterwards, the following section is used to explain the type of network and the model architecture that was implemented for the experiments.

Word Embeddings
Neural word embeddings have become a popular and efficient way to model words and interactions between them for purposes like text classification tasks. They date back to the concept of distributed word representations [80] that, in contrast to local representations, do not handle each word separately with a single neuron, but use several neurons to represent a word and let each neuron be part of the description of several words. This enables distributed representations to learn general concepts of language instead of just independent word representations. One of the most important and still popular methods to train word embeddings was published by Google as word2vec [81], [82], which consists of two neural network architectures-namely the Continuous Bag of Words (CBoW) and the (Continuous) Skip-gram (SG) architecture. The concept of word2vec was developed further by Facebook and published as fastText in 2017 [83], [84], [85], which also directly offers text classification. While being based on the same two model architectures as word2vec, fastText represents words as bags of character n-grams and thus allows to obtain vectorizations even for unknown words. A different approach to learn word embeddings, GloVe, has been published by researchers of the Stanford NLP group [86]. GloVe aims to combine the advantages of local context window approaches like Skipgram with those of global matrix factorization models like Latent Semantic Analysis (LSA). Pre-trained word vectors obtained from large corpora are available for both fastText 14 and GloVe 15 . For this work, a 50-dimensional GloVe model trained on Wikipedia and the Gigaword 5 news corpus as well as a 300-dimensional GloVe model trained on Common Crawl were chosen. In addition, three 300-dimensional pre-trained fastText models based on similar corpora were used.
Finally, to also examine word embeddings that better fit the domain of reddit messages (or social media platforms in general), an own fastText model was trained on a dataset containing all reddit comments between October 2007 and May 2015 16 . The total dataset consists of about 1.7 billion messages that we preprocessed and tokenized in a way that preserves emoticons, punctuation, and words that include special characters (e.g. censored words). The preprocessing step also included to replace any references to reddit users (in the form of /u/<username>) by a generic phrase "ref user" to prevent any connections to actual users in the resulting word embeddings. Similarly, any reference to a subreddit (in the form of /r/<subreddit>) was replaced by the phrase "ref subreddit <subreddit>" to be able to learn a vector representation of them as well that can be regarded as their topic. No stemming or stopword removal of any kind was done. The resulting tokens of each message were lowercased (with the exception of emoticons) and separated with a space to enable fastText to properly treat punctuation 17 . Since the dataset also contains messages written in different languages than English and a sophisticated language detection classifier would have required too much time for so many documents, a simple language detection based on stopword counts was utilized: Only messages with more English stopwords than ones from other languages 18 were retained (thus also discarding messages without any stopwords). This resulted in a final training corpus of 1.37 billion messages and a total of 49.9 billion tokens. The C++ implementation of fastText was used to train a 300dimensional CBoW model without subword information and default values for all other parameters, which contains the 6 million unique tokens that occur at least five times in this corpus. Training took about 26 hours using 12 threads on an Intel Xeon E5-2687W 3.10GHz and needed about 17GB of RAM.
The characteristics of all utilized word embeddings are shown in Table 4. It also contains the amount of words each model includes of the 85,558 words that occur in writings of at least two users of the eRisk 2017 dataset after the same preprocessing and tokenization steps as described above. This set of words is used in the experiments described in Section 6.
As a qualitative analysis of the self-trained fastText model, it is possible to examine the nearest neighbors of some hand-picked exemplary tokens according to cosine similarity. Fig. 2 displays six word clouds with the corresponding example token in the middle and its ten (20 in the case of emoticons) nearest neighbors around it. These examples especially illustrate that the model trained on reddit messages is indeed able to identify similar emoticons and subreddits, which is both not possible using the pretrained fastText or GloVe models. The closest neighbors of the depression subreddit also include terms like "suicidewatch" and "mmfb" (make me feel better), illustrating relations that were learnt to terms besides other subreddits. It has also learnt similar embeddings for antidepressants like Zoloft. While this as well as similar words to the terms "depression" and "suicidal" can also be observed using the pre-trained models, their nearest neighbors seem slightly more medical and from a more neutral perspective (e.g. "ssri" or "sertraline" close to "zoloft", "melancholia" or "insomnia" for "depression", and "deranged" or "delusional" for "suicidal") than those of the model trained on reddit. Also, especially the fastText Crawl model returns neighboring terms like "depression.This" or "depression.What", which might indicate a preprocessing problem concerning punctuation.
To further investigate the extent to which the self-trained fastText reddit model has learnt a general model of the English language, the standard word analogy dataset [81] can be used as one indicator. Table 5 compares results on the word analogy dataset published for state-of-the-art models to the results of the new model trained on reddit messages. The dataset was originally created to evaluate the word embeddings of word2vec and consists of 8.869 semantic as well as 10.675 syntactic analogy questions: Given an analogy (e.g. Athens and Greece) and a third word (e.g. Oslo), word embeddings have to return the fourth word (Norway in this case) as closest vector to the result of vec("Greece ) − vec("Athens ) + vec("Oslo ) according to cosine similarity. While the results are far from the ones obtained by the state-of-the-art fastText models trained on Wikipedia and news articles, especially the result in the syntactic category illustrates that even the training on these much less formal documents has lead to a decent model of the English language.

Neural Network Architecture
Convolutional Neural Networks (CNN) [87] have been utilized to achieve outstanding results especially in the area of image classification and are generally viable for data with a grid-like structure [88]. Recently, studies have shown that they can also be used effectively for text classification tasks [89]. Fig. 3 displays the architecture of the simple CNN used for the experiments described in this paper, which is based on the one-layer CNN for sentence classification described by Zhang and Wallace [89]. Similar to this sentence classification network, it consists of only a single CNN layer but uses a total of 100 filters with a height of 2 and a width corresponding to the word vector dimensions. Concatenated   Rectified Linear Units (CReLU) [90] are used as activation function for the convolutional layer as well as for the fullyconnected layers, resulting in twice as many outputs due to the concatenation with the negated activation. 1-max pooling is used to obtain a single scalar from each filter, which results in a 200-dimensional vector due to the CReLU activation. This output is then propagated through three fully-connected layers with, again, CReLU activation, of which the first one applies dropout to its output. The fourth and final layer applies softmax to the output. As input, the network receives each document of a user individually in the form of the 100 first word vectors per document, while using zero-padding for documents with less than 100 words. This results in a 100 × 50 or 100 × 300 dimensional input matrix depending on the used word embedding as described in the previous section. The limitation to 100 words (or even less) is possible as the number of words per document ranges between 1 (when ignoring the empty documents) and 6,487 but has a mean of 34.58 according to the tokenization done for this work. As this results in a separate output for each document per user, the 98th percentile of these outputs is used as final prediction for the user. This value is chosen instead of the mean to give more weight to documents with a higher probability. In addition to the described model architecture, it has been tried to directly use the metadata features as a second input to the network. An approach similar the the previous one at eRisk 2017 [63], where the metadata features were fed through a fully-connected layer and then concatenated with the output of an LSTM layer, did not lead to better results than just using the text input. The same applies to the idea to use the final 1 × 50 dimensional vector before the softmax layer as additional input for a metadata classifier like a logistic regression. Since the results of networks using the metadata differed only marginally from the text-only network described above, only results of the latter will be reported in the following to reduce the complexity. Future work will be needed to explore better ways to directly merge the CNN output with the metadata. This could, for example, be done by implementing a dedicated fusion component into the neural network, similar to work done for gender identification based on texts and images at the CLEF 2018 PAN workshop [91]. In this work, a simple late fusion ensemble will show that the best results so far can indeed be achieved by combining these features.

EXPERIMENTS
This section is used to describe the experiments done based on the convolutional neural network and the metadata features as well as their results. Results are compared to the best published results during the eRisk 2017 task as well as other results obtained after the ground truth was released. The scores of each model are reported according to ERDE 5 , ERDE 50 , and F 1 , which are the official scores of this task, and also based on the newly proposed ERDE % 20 , ERDE % 50 , and F latency .

Experiment Setup
For the experiments conducted during this work, the same process used during the eRisk 2017 task was reproduced: The available test documents were processed in the same ten chunks that contain 10% of the writings obtained from each user. Training is done once on the full training dataset. Afterwards, test chunks are processed in sequential order, while the documents of the previous chunks are always used again. The only exception to this process is the model called "Meta LR Wait" in the following evaluation section, which is a logistic regression based on metadata features that was configured to only submit a prediction after the final chunk. Similar "waiting" models were also utilized by some teams during eRisk 2017 and can be interesting to evaluate the possible F 1 score, while neglecting the early detection aspect and therefore the ERDE o scores. Since the models based on metadata use features averaged over all documents of the same user, they were also calculated for each chunk separately, again using documents from earlier chunks as well. An additional parameter for the early detection models is the prediction threshold that determines whether a model is confident enough to predict a subject as positive (depressed) or whether it waits for more data. While these thresholds were based on cross-validation using the training data for this team's participation in the eRisk 2017 task and included the number of documents already read in multiple threshold levels [63], the experiments in this work are based on a single threshold value that achieved the best test result for the specific model. This is likely to lead to an overfitting on the specific test data but also allows to compare the best possible results of the utilized models. Generally, prediction thresholds between 0.5 and 0.7 lead to a balanced result in all scores, while higher thresholds can often maximize ERDE 5 but severely decrease F 1 . This fits the observations described in Section 3.2: Since the correct prediction of so few depressed test subjects actually has an effect on ERDE 5 , it is often best to submit fewer predictions overall and therefore simply minimize false positives. Negative (non-depressed) predictions were only submitted after seeing the final chunk. The models based on user metadata features all utilized the same logistic regression classifier. The 27 features described in Section 4.4 were first standardized to have a mean of 0 and unit variance, with exception of the boolean flag features that already have a value of either -1 or 1. The resulting scaled feature vector was then used to train a logistic regression classifier and later predict probabilities for the test subjects of each chunk. Table 6 displays the results achieved in this work in comparison to previously published results for the same dataset and task. The first three rows in this table represent the best results during the eRisk 2017 task and are therefore solely optimized based on cross-validation over the training data, while the next two results have been achieved after the ground truth was published. All results after these have been achieved as part of this work. The models corresponding to the name of a word embedding refer to a CNN using this embedding as input vectorization, the models named "Meta LR" refer to the logistic regression based on metadata, and the final four results were obtained by calculating the mean of the metadata probabilities and the neural network output. Although these outputs have not been calibrated (e.g. by using Platt scaling [92]), this simple late fusion ensemble lead to the best achieved ERDE o scores and recall. As expected, the best overall F 1 score could be obtained by waiting for the last chunk and only then submitting predictions based on the metadata LR. Interestingly, this model would still have achieved the seventh best ERDE 50 score in the eRisk 2017 task out of 30 submissions, which again illustrates how difficult ERDE o is to interpret because it is based on the absolute number of documents. The prediction thresholds have been chosen to represent the best possible ERDE o scores that still include a viable F 1 score. A second threshold has been reported for the self-trained fastText reddit model and the metadata LR to illustrate to which extent slightly different thresholds can have an effect on ERDE o scores. As already described, especially optimizing ERDE 5 often includes impairing F 1 score. The reported results contain the best scores published for this task so In addition to comparing the achieved results to previously published results for this task, Table 7 shows how the same models with the same prediction thresholds would have scored according to the newly proposed ERDE % o score as well as the F latency score from [68]. While the "Meta LR Wait" model is now scored equally bad in both criteria because it had to read all documents, the CNN scores now tend to be better than the ones obtained for the metadata models alone. Still, the best overall ERDE % o scores could be achieved by the same ensemble. The additional models with higher thresholds that were previously included to obtain a better ERDE 5 score (namely fastText reddit with p > 0.8 and Meta LR with p > 0.55) now result in the worst overall ERDE % o scores next to the waiting model. This again indicates that especially optimizations of ERDE 5 do not necessarily mean a better classification result. The 50dimensional GloVe model achieves the best F latency score, which is also better than the best score reported in the original paper (0.389) for the same dataset [68].

CONCLUSION
This work has been used to examine the currently popular ERDE o metric for early detection tasks in detail and has shown that especially ERDE 5 is not a meaningful metric for the described shared task. Only the correct prediction of few positive samples has an effect on this score and the best results can therefore often be obtained by only minimizing false positives. A modification of this metric, namely ERDE % o , has been proposed that is better interpretable in the case of shared tasks that require information to be read in chunks. Exemplary scores using this score have been shown in comparison to ERDE o scores for the experiments in this work.
Previous experiments for the eRisk 2017 task for early detection of depression have been continued by examining additional user-level metadata features and evaluating a convolutional neural network as text-based depression classifier. State-of-the-art results have been reported for the eRisk 2017 dataset using these two approaches. A new fastText word embedding has been trained on a large corpus of reddit comments. The analysis of the resulting word vectors has shown that the model has learnt some features specific to this domain and is viable for general syntactic questions in the English language as shown based on the standard word analogy task.
As the results presented in this paper are optimized to obtain the best performance on the eRisk 2017 task for comparison to previously published results and among these models, future work will have to show how these models perform on yet unseen data. This has first been done during the eRisk 2018 task [93], which used the old dataset as training data and contained 820 new test subjects. In addition, eRisk 2018 contained an additional task aimed at the early detection of anorexia that this team has also participated in. The five submitted predictions achieved the best F 1 and ERDE 50 scores in both tasks and the CNN without metadata in particular achieved the best results in the new anorexia task [94]. The same working notes paper for this second participation has also been used to evaluate the modified ERDE % o metric for all participants and again shows how especially the original ERDE 5 metric favors systems that correctly predict test users with only few documents in total regardless of their overall performance.
As the detailed look at the current ERDE o metric has shown, one priority of future work in this area should be to agree on a new metric for early detection tasks like eRisk. Ethical issues in this area of research have been reviewed and should find more attention as well. Possibilities to publish the fastText model trained on reddit comments still have to be examined. Concerning the models presented in this work, additional experiments will be necessary to find better ways to integrate the metadata features directly into the neural network. On the other hand, utilizing ensembles of more than just two models and calibrating the resulting probabilities seems promising. Combining word embeddings of two models in a single neural network has also not been evaluated yet. Another possible improvement would be to use recently published language modeling methods like BERT as input for the network and to compare a selftrained model using this approach to the fastText word embeddings of this work.

ACKNOWLEDGMENTS
The work of Sven Koitka was partially funded by a PhD grant from University of Applied Sciences and Arts Dortmund, Germany.