Suicidal Ideation Cause Extraction From Social Texts

Suicide has become a major public health and social concern in the world. Suicidal ideation cause extraction (SICE) in social texts can provide support for suicide prevention. This article summarizes the research on suicidal ideation causes (SICs) through the use of psychological and sociological analysis. Then, a social text-based SIC dataset is constructed and analyzed statistically for various features. A CRF model is provided along with Char-BiLSTM-CRF, which uses concatenation of word embeddings and character embeddings as word representation inputs. Then, the effects on the task are explored by the word (W), part of speech (POS), dependence relationship (DP), suicidal psychology (PCS), emotion (ET), and language (LG) features in the CRF model. The experiment shows that the word features worked best. POS and DP can be somehow covered by word features. PCS, ET and LG features can improve the effect of SICE. It also shows that Char-BiLSTM-CRF is better than CRF in general, but CRF still has advantages in terms of precision. Adding character embeddings and CRF layers can significantly improve the extraction using Char-BiLSTM-CRF. The experiment also compared three-word embeddings with Word2vec, ELMo and BERT. Compared with Word2vec, the F-value of ELMo is increased by approximately 5% on average, and compared with ELMo, the C_F and E_F of BERT are increased by 3.5% and 2.3%, respectively. Finally, the challenge of SICE is discussed based on the experimental results of Char-BiLSTM-CRF.

In detecting suicidal ideation, traditional methods mainly rely on professional knowledge among psychologists and self-reported questionnaire surveys. A large number of suicidal ideation measures and evaluation tools and scales have been developed. These methods are effective and easy to operate, but are prone to false negatives caused by participants' deliberate concealment, coupled with their intrusiveness, high cost, poor timeliness, and difficulties with continuous tracking. They are difficult to carry out on a largescale and over a long-time. With regard to the analysis of the SIC and the suicide cause, existing work has mainly focused on theoretical and empirical research. However, problems such as geographical bias, limitations of representation, short time span, deliberate concealment of subjects, and the reliability and validity of the scale limit empirical research.
The rise of social media has provided new opportunities for psychological research and new channels for monitoring suicidal ideation and cause analysis. Individuals in psychological crisis often show distrust and resistance to traditional mental health services and institutions. Research shows that individuals in psychological crisis tend to seek help from informal resources, such as social media, rather than from psychological experts [4], [5]. A large amount of research has confirmed that it is feasible and effective to use social media [6], [7], such as Twitter, Facebook, and Weibo, to predict suicidal ideation, and a small number of researchers have also started to use social media data to analyze the causes of mental health problems [8] and have even carried out suicide interventions using social media platforms [9].
Using the behavior and content posted by users on social media platforms to detect users' suicidal ideation helps make timely interventions and reduce the risk of suicide. The intervention measures must be user-centered, which requires psychological experts to read the posts on social media (referred to as ''social text'' in this paper), and obtain the SIC, as shown in Fig. 1.
However, due to the massive amount of social media content, fast updates, and the large number of users, it is difficult for psychologists to obtain the SIC through manual reading.
Suicidal ideation cause extraction (SICE) from social text can help psychological experts to get better and faster understanding of the intervention objects and reduce their workload. At the same time, it also provides support for statistical analysis of SIC using large-scale data and allows comparing SIC in different countries, regions, populations and time periods. However, according to our systematic review process, there is no relevant research in either the field of psychology or computer science that encourages in-depth research on this issue.
Based on Weibo data, this paper constructs a manual annotation dataset of SIC and discusses the methods of SICE from social text. Then, the feasibility and effectiveness of SICE models are demonstrated and tested. The main contributions of this paper are summarized as follows: 1. defining the problem of SICE by summarizing the present research results of psychological and sociological analysis of SIC, 2. constructing a social text-based SIC dataset by manual annotation based on a suicidal ideation recognition dataset constructed by the Institute of Psychology of the Chinese Academy of Sciences, 3. exploring the effects of word, part of speech, word dependence, suicidal psychology, emotion, language and other features in a CRF-based SICE method, 4. constructing a neural network-based model Char-BiLSTM-CRF for SICE, which integrates BiLSTM (bidirectional long short-term memory network) and CRF, and concatenates both word-level and character-level embeddings as word representation input, and 5. exploring the performance of the CRF model and the Char-BiLSTM-CRF model and compare three-word embeddings in Word2vec, ELMo and BERT in the Char-BiLSTM-CRF model.
The paper is organized as follows: Section I introduces the research background of SICE based on social media, Section II summarizes the related works, Section III introduces the strategy for constructing the suicidal ideation annotation dataset and the statistical analysis, Section IV and Section V describe the models for SICE from social text, including a traditional sequence labeling model called CRF and neural network-based model called Char-BiLSTM-CRF, and Section VI presents the experimental analysis. Conclusions and futures works are described in the last section.

II. RELATED WORK
At present, the analysis of SIC mainly comes from the field of psychology or sociology, and the methods are mostly case analysis or empirical analysis. There is a correlation between emotion and suicidal ideation, and the extraction of emotional cause is a concern among researchers.

A. THE CAUSE OF SUICIDAL IDEATION
Suicidal ideation is associated with many factors, such as population economics and psychosocial behavior. It involves the individual's internal and external environment, including the direct or indirect effects of biological, genetic, psychological, social, environmental, cultural, external pressure, and others. The specific factors include academic studies, interpersonal relationships, environment, emergencies, race and religion, mental state, marital status, financial status, health issues and others. The selection principle varies according to the researcher's perspective (biological, social, psychological, environmental) and the characteristics of the individuals being studied.

1) AGE
Kok et al. [10] explored the factors of suicide from the perspective of adolescents (ages [15][16][17][18][19][20][21][22][23][24] and summarized them as heterosexual relations, race and religion, academic pressure, competitive pressure, global economy, family education, original family (single parent family), and others. Chatterjee and Basu [11] explored the internal and external factors of suicidal ideation among female Indian college students and summarized four major factors, such as academic factors, interpersonal factors, environmental factors and emergency factors. Luo et al. [12] analyzed suicidal ideation among China's urban and rural elderly population based on data from the ''Tracking Survey on the Status of China's Urban and Rural Elderly Population'' from the 2010 China Aging Science Research Center (CASRC) and concluded that sociodemographic variables, physical health, mental health,, and family were important factors leading to suicidal ideation in the elderly.

2) REGION
Maskill and Hodges [13] summarized the causes of suicide in New Zealand as marital status, family composition, education level, religion, socioeconomic status, occupational status, mental disorders, substance use disorders, physical diseases, and others. In the analysis of suicide in Australia, Hassan [14] explained suicide as an act involving a complex relationship between psychiatric and sociological factors and summarized eight categories of SIC, including psychosocial and social environmental factors, marital status (single, divorced), economic status, occupation (low income, low autonomy, low training opportunities), immigration and race, relationship issues (unhappy love, family and marriage issues, guilt), selfworth issues (economic and unemployment issues, feel life is a failure) and health problems.

3) COMPREHENSIVE PERSPECTIVE
Chesney et al. [15] systematically reviewed 407 suicide articles at a macro level and attributed suicide cause to depression, mental health causes such as schizophrenia, eating disorders, learning disabilities, loneliness, and childhood behavioral disorders (behavioral disorders and oppositional defiant disorder, attention deficit, hyperactivity), dementia, personality disorders, substance use disorders, and other factors. In addition, some causes of suicide are summarized from the conclusions of various media, psychological counseling and treatment platforms or institutions. Mental Health Daily [16] summarizes 15 causes of suicide: mental illness, painful experience, bullying, personality disorders, substance abuse, eating disorders, unemployment, social isolation/loneliness, interpersonal problems, genetics/family history or family with a history of suicide, existence of crisis/value deficiency, terminal illness, chronic pain, financial problems, and side effects of drugs including various prescription drugs such as antidepressants. Suicide.org (Suicide Survivors Forum) [17] lists 29 causes of suicide in detail, including: the death of a loved one, divorce, separation or break up of a relationship, a serious loss, such as loss of a job, house or money, disease or serious illness, a terminal illness, intense emotional pain, and others.
In short, as an extreme and complex behavior, suicide is caused by many factors [18], [19]. The traditional research on SIC can be summarized into two areas: theoretical research and empirical research.
Theoretical research mainly explores the intricate relationships among various biological, psychological, and sociological factors, and the main models used include the Diathesis-stress Model [20], the Aggregation Optimization Model [21], the Stress-Vulnerability Model of Suicidal Ideation and Behavior [22], and the Biopsychosocial Model of Suicidal Ideation [23]. Many theories have been used to study SIC, such as psychoanalytic theory (Sigmund Freud, 1895), biological theory [24], sociological theory [25], escape theory [26], evolutionary theory [27], attachment theory [28], theory of personality development stages [29], interpersonal theory of suicide [30], strain theory of suicide [31], and others. Most of the theoretical studies are done from the perspective of their respective disciplines, with narrow focus and lack of empirical analysis.
Based on the scale of suicidal ideation, empirical research is carried out mainly by questionnaires and statistical methods to analyze logistic regression relations among single and multiple factors, and the relationship between suicidal attitude and factors that influence it is discussed from the sample data [32], [33]. In recent years, some scholars have also used structural equation modeling to overcome the complex interaction between latent variables and multiple variables that cannot be analyzed by other statistical methods [34], [35]. In the empirical study, there are still some problems, such as regional bias in the sample, limited representativeness, short time span, deliberate concealment by subjects, and the reliability and validity of the scale.
Traditional studies on SIC, both in theoretical and empirical research, obtain the cause from questionnaires, interviews, or researchers' analysis of patients. Although some researchers also used social media data to study SIC [8], [36], the extraction of SIC is still based on manual work. At present, according to our systematic review process, there is no research that uses computer technology to extract SIC automatically from notes (diaries, testaments, microblog posts, etc.) posted by patients, and it is difficult to expand VOLUME 8, 2020 the selection of patient samples and maintain continuous tracking.

B. DETECTION OF SUICIDAL IDEATION BASED ON SOCIAL MEDIA
In recent years, a great amount of progress has been made in the detection of suicidal ideation from social media data by utilizing computer technology, which are mainly based on related knowledge bases (Emotional Dictionary, Linguistic Inquiry and Word Count (LIWC), WordNet, HowNet, etc.) and various features (user attributes, user behavior, published content, user emotion or psychology, etc.) by training a machine learning model with labeled samples.
The features used in related works can be classified into four categories.

1) ATTRIBUTE FEATURES
The attribute features, which are usually filled in when registering for social media, describe the basic personal information about social media users, such as gender, age, region, nickname, occupation, picture, etc. [37]- [39]. They also include some basic user settings, such as whether users enable private messages, whether users allow others to leave messages, whether users enable the geographical indicator on their account, whether users include ''me/myself'' in their self-description, etc. Guan et al. [40] used attribute features, such as the user's gender, username length, number of favorites, number of fans, and self-description text length, to detect the user's suicidal ideation, and found that there was a correlation between these attribute features and suicide risk.

2) BEHAVIORAL FEATURES
Behavioral features describe the behavior of users, including posting, forwarding, commenting, liking, following and other behaviors. Previous studies have shown that users with different degrees of suicidal tendencies, users who have committed suicide, and those without suicidal ideation behave differently when using Weibo [41], [42]. The common behavioral features used for suicidal ideation detection include the average number of days users stay active, the average number of comments on a single post, the average number of times a user's single post is forwarded, the average number of likes on a single post, the total number of interaction posts, the frequency of activities at night, the social activity (the number of friends /fans), and other factors. [40]. For example, Huang et al. [6] conducted statistical analysis on the behavioral features of users with and without suicidal ideation and found that the post type (original, forwarding), post time and social relationships mentioned (@ others) are effective features for suicide risk detection.

3) CONTENT FEATURES
Content features mainly refer to user-generated content. A large number of studies have found that users with different mental health and suicide behaviors have different language expressions and diction styles. For example, Sueki [43] studied suicide-related posts and suicide behaviors of Twitter users and concluded that some specific phrases were closely related to users' suicidal ideation, such as ''want to commit suicide'' and others.

4) EMOTIONAL AND PSYCHOLOGICAL FEATURES
Emotional and psychological features describe the emotions, feelings or psychology in the posts, not only involving positive and negative emotions, such as joy, anger, sadness and happiness but also including friendliness, aggression, guilt, compunction, insecurity, fear, self-pity, self-mockery and other fine-grained emotions and personality traits. Studies have found that emotional and psychological activities and personality traits have a strong correlation with suicidal ideation or suicide behavior. For example, a mild manic state is related to multiple suicide attempts [51], and people with high suicide risk have a higher sense of guilt or shame [52], [53] and have a higher proportion of posts with anger, sadness, fear, disgust and loneliness [7].
Some researchers also use emotional features and other psychological features with personality features in the detection of suicidal ideation. Ren and Kang [54] used blog data to analyze the relationship between emotional features (such as anxiety and sadness) and suicide crises and used an emotional topic model to detect potential emotions in blogs to improve the detection of suicide risk. Huang et al. [6] and Lv et al. [46] used and expanded an emotional dictionary to detect suicide risk among Weibo users. There are several similar studies to detect suicidal ideation or suicide risk in text using emotional dictionaries or by analyzing text emotion. Burnap et al. [44] used SentiWordNet to study twitter posts, and Aladağ et al. [48] established an emotional matrix to study Reddit posts.

5) SUICIDAL IDEATION DATASET
The generalized suicidal ideation dataset includes questionnaires, suicide notes, suicide blogs, electronic health records, online social texts [55], and other information, but the suicidal ideation dataset in this article refers to a collection of social texts expressing suicidal ideation or suicidal tendencies. It mainly comes from Reddit, Twitter, ReachOut, Sina Weibo and other social network platforms or forums.

a: REDDIT
Reddit is an online forum community, and its subreddit called ''Suicide Watch'' is intensively used for further annotation of positive samples. As a control set, most of the samples without suicidal ideation come from other subreddits. For example, Ji et al. [56] released a dataset containing 3,549 suicidal ideation posts. Shing et al. [57] released the UMD Reddit suicidal ideation dataset, which included 11,129 users and 1,556,194 posts, and sampled 934 users for further annotation. Aladağ et al. [48] used Google Cloud BigQuery to collect 508,398 posts and manually annotated 785 of them. In addition, In CLPsych2019 shared task, which is to predict a user's suicide risk level. The dataset comes from the ''Suicide Watch'' from 2005 to 2015. The training set includes more than 31,553 posts by 496 users, and the test set contains 9,610 posts by 125 users.

b: TWITTER
Twitter is quite different with Reddit in post length, anonymity, communication and interaction methods. Coppersmith et al. [58] collected a dataset of Twitter users with suicidal ideation and depression. Ji et al. [56] collected 594 tweets with suicidal ideation from 10,288 tweets and obtained an unbalanced dataset.

c: REACHOUT
The ReachOut forum is a peer support platform provided by an Australian mental health care organization. The CLPsych2017 sharing task is to identify the self-harm risk level in posts [59], and the official dataset comes from the ReachOut forum. The training set contains 65,756 posts, including 1,188 manual annotations. The test set includes 92,207 forum posts, including 400 manual annotations.

d: SINA WEIBO
Sina Weibo is a Chinese social network platform. The Institute of Psychology of the Chinese Academy of Sciences collected 65,352 posts and recruited 6 graduate students in psychology to manually annotate these posts for suicidal ideation [9]. They generated a suicidal ideation dataset consisting of 6,251 users and 13,013 posts. This dataset does not involve users' personal data, and the posts are manually annotated only according to the words or sentences in the posts. In this paper, we manually annotate an SIC dataset based on the Sina Weibo suicidal ideation dataset mentioned above.
Related works show that it is feasible to use social media to detect suicidal ideation. Although these models can not completely replace the detailed examination and professional evaluation of psychological experts, they can still be used as a preliminary screening tool to help detect individuals with potential suicide crisis on a large scale to save the efforts of psychological counselors and improve the efficiency of large-scale monitoring of suicidal ideation.

C. EXTRACTION OF EMOTIONAL CAUSE
There is a strong correlation between emotion and suicidal ideation and their cause. In recent years, some researchers shifted their attention from emotional text analysis to the extraction of emotional cause. The widely used methods are reviewed below.

1) KNOWLEDGE-BASED AND RULE-BASED METHODS
Following Talmy's [60] viewpoint, emotion is a kind of internal response triggered by understanding external events. Lee et al. [61] and [62], Chen et al. [63], Neviarouskaya and Aono [64], Li and Xu [65] extracted the events that trigger an emotion through linguistic clues. At the same time, some scholars have proposed the commonsense-based method (CB) [65], [66], which constructs rules by combining a linguistic model, psychological knowledge and other domain knowledge.

2) FEATURE-BASED MACHINE LEARNING METHODS
These methods consider emotion cause extraction to be a candidate sentence or clause classification problem [63], [65], [67], [68] or a sequence labeling problem [69], [70]. The classification problem is to classify whether the candidate sentence or clause is the cause through its own features and contextual features. Compared with the classification model, the sequence labeling model can not only conveniently add intrasentence features and contextual features but also consider the sequence features between the cause clauses. Various features are used for emotional cause extraction including word features, word category features (such as the categories of words in the LIWC or Emotional Dictionary), POS features, grammatical features, word embeddings features, distance from emotional words, context features and linguistic rule features, among other.

3) DEEP LEARNING METHODS
Knowledge-based and rule-based methods have several disadvantages such as the heavy workload of constructing the knowledge and rules, the limited coverage, and the poor adaptability of different corpuses, whereas feature-based machine learning methods require rich experience for feature selection. Gui et al. [71] and Mu [72] et al. used deep learning technology to obtain the semantic representation through a word embedding model called Word2vec and then measured the effect of words for detecting emotional cause through an attention mechanism. It captures semantic continuity by integrating contextual semantic information. Xia et al. introduced a relative position augmented embedding learning algorithm, and transformed the task from an independent prediction problem to a reordered prediction problem where the dynamic global label information is incorporated [73]. To consider the relationships among sentences, they also proposed an emotion cause extraction framework (RTHN) (RNN-Transformer converter hierarchical network) [74]. To avoid manually annotating emotional VOLUME 8, 2020 clauses, they also proposed a joint extraction model of both emotional cause clauses and emotional clauses based on hierarchical BiLSTM and interactive multitask learning [75] and a joint extraction model based on a 2D square matrix.
Other research on works of emotional cause extraction based on deep learning also include Tang et al. [76], Fan et al., Song et al. [77], Bi and Liu [78], and Yu et al. [79].
Although SICE is similar to emotional cause extraction, there are still many differences. First, in emotional cause extraction, the exaction object is usually events with sentence-level or clause-level granularity, and the location of the emotional cause is usually closely related to the position of emotional words. In SICE, the extracted objects include events, emotions, psychophysiological disorders, and other, and the extraction granularity is smaller, such as word-level and phrase-level. Second, although there is a Chinese Suicide Dictionary [46] built by the Institute of Psychology of the Chinese Academy of Sciences, the suicide words are more complex, and their relationship with SICE is not clear. Third, emotional words mainly describe the emotional tendency and are independent of the emotional cause. However, it is more complicated with suicide words because they cover a wider range of words, such as suicidal tendency or suicidal ideation words (such as death, to die, not want to live, give up, want to commit suicide), psychological words (such as annoyance, resentment, regret, sadness, disappointment, loneliness), physiological words (such as illness, neurasthenia, schizophrenia, cancer, malignant lymph nodes cancer), interpersonal relationship words (such as mom, love you, friends, family, parents), and others. What's even harder is that some of these words can be cause words or clue words themselves.

III. CONSTRUCTION OF THE SIC DATASET
To our knowledge, there are no publicly available corpus resources for SIC in social texts, and we constructed an SIC dataset based on our proposed annotation scheme under the guidance of psychological professionals.
We constructed an SIC dataset based on the Weibo suicidal ideation dataset [9] introduced in Section II-B-5, which contains only Chinese posts. Since this dataset is a collection of posts with suicidal ideation that were manually annotated, we no longer know whether the posting users have suicidal ideation or not.

A. DATA PREPROCESSING
The preprocessing of the dataset mainly includes data cleaning and data transformation. Data cleaning includes removing post ID, @username, illegal characters, hyperlinks, emoticons, and non-Chinese characters. The data transformation mainly refers to the transformation of Chinese-Simplified and Chinese-Traditional text. The data preprocessing also includes Chinese word segmentation, part-of-speech tagging, and dependency parsing, and LTP was used, which is a language cloud platform from Harbin Institute of Technology. Although there are similar tools such as jieba word segmentation, THULAC, and others, LTP has more complete functions than other tools, including Chinese word segmentation, part-of-speech tagging, named entity recognition, dependency syntax analysis, semantic role tagging, and others. It is a popular and widely used platform for Chinese text processing.

B. ANALYSIS OF SUICIDAL IDEATION CAUSE
Manual annotation of the SIC dataset is based on the main classifications of ''Chinese Classification and Diagnostic Standards for Mental Disorders'' (CCMD-3) [80] and other previous research. Under the guidance of Professor Renheng Wu, executive director of the Jiangxi Psychological Society, and three other psychological professionals, the SIC was classified into two groups: individual internal cause and external environmental cause. The internal causes included physical factors (including physical diseases, eating disorders, nonorganic sleep disorders, nonorganic sexual dysfunction, genetic factors, etc.) and psychological factors (three types according to degree: general psychological problems that require psychological consultation, severe psychological problems that require psychiatrist treatment, and psychiatric patients who need psychiatrist intervention). External cause included chronic stress (including economic stress, work stress, academic progression pressure, and interpersonal stress), stress events and emergencies (including severe or acute stress disorder, acute stress psychosis, posttraumatic stress disorder, adaptation disorder, sharp and severe mental strikes, and traumatic psychological events), psychoactive substances (including alcohol, opioid, cannabis, hypnotic, anxiolytic, anesthetic, stimulant, hallucinogen, tobacco, etc.), and culture (including religion, race, immigration, etc.). Because the boundaries between some classes are not obvious and correlated, the annotation is subjective. For example, mental health problems such as depression have both psychological factors and physiological factors of the brain's organic changes.

C. MANUAL ANNOTATION SCHEME
Because of the informality and colloquialism of language in social text, the SIC is usually expressed implicitly, that is, there are no linguistic clue words such as ''because'', ''as'', ''so'', which increases the difficulty of annotation work.
The annotation scheme we proposed adopts the principle of ''Terse and concise, No reasoning, Continuity of cause words, and Unique category''. ''Terse and concise'' refers to annotating only the smallest string that can reflect the SIC. ''No reasoning'' requires that the annotation is based on literal meaning only, without guessing or reasoning, and without considering indirect causes. ''Continuity of cause words'' requires that the words covered by the same cause must be continuous and uninterrupted. ''Unique category'' means that each cause belongs to only one cause category. The tagging symbol refers to the classical tagging method of 4-tag (B: begin, the first part of the cause; E: end, the last part of the cause; M: middle, the middle part of the cause; S: the cause only contains a single word). The annotation work was completed by four graduate students who had received psychological training and passed the test. The annotation results were checked, and inconsistent annotations were discussed and voted on by the group. Annotations that remained controversial were sent to a psychological expert for arbitration. Consistency in detection of the cause annotation adopted a standard Kappa, and the average Kappa value between two annotators was 0.617. This finding showed that there was a difference in the artificial judgment of the SIC, and informal expressions in Chinese posts increased the complexity and ambiguity of semantic understanding.
In the end, 5,994 posts with causes were annotated in 13013 posts. Examples from the SIC dataset are below: Ex. in Fig. 2 (statistics in terms of words). In the expression of an SIC, the verb (v) accounts for the largest proportion, accounting for 33.5%, whereas the other POS ranked in the order of adjective (a), noun (n), and adverb (d).
Comparing the POS in an SIC with that in a non-SIC, it was found that the adjectives (a) showed the greatest difference (except for abbreviations (j) and conjunctions (c)). The former was very rare, whereas the latter was a function word), accounting for 20.8% (PinS) and 4.8% (PinN) respectively, and the value of PinS/PinN was 4.3. Generally, the distribution of POS in the SIC and non-SIC was quite different.

3) THE DISTRIBUTION OF DEPENDENCY IN SIC
The analysis of dependency in SIC can help us understand the expression of the SIC and the structural information in the phrases or clauses used. The statistical results are shown in Fig. 3 (statistics in terms of words). Fig. 3 shows that in SIC, the verb-object phrase (VOB) and coordinate phrase (COO) account for 24% and 22.2%, respectively. By comparing dependency in the SIC with that in non-SIC, we found that the biggest difference is the core relationship (HED) (except the indirect object (IOB) and right adjunct (RAD)). The former was very rare, whereas the latter was a function word) that accounted for 4.8% (DPinS) and 2.1% (DPinN) respectively, and the value of DPinS/DPinN was 2.3, followed by verb object phrases (VOB), accounting for 24% (DPinS) and 11.2% (DPinN), and the value of DPinS/DPinN was nearly 2.2.

4) THE DISTRIBUTION OF EMOTIONAL WORDS IN SIC
There is a strong correlation between emotion and suicidal ideation. Negative emotion is one of the important causes for suicidal ideation. The statistics of the distribution of VOLUME 8, 2020 emotional words in SIC is helpful to understanding the correlation between them.
In our study, the National Taiwan University Sentient Dictionary (NTUSD) and Dalian University Technology Sentiment Dictionary (DUTSD) are employed as sentiment dictionaries. NTUSD divides 11,086 words into 2,810 positive emotion words and 8,276 negative emotion words. The DUTSD included 11,107 positive emotion words and 10,713 negative emotion words after removing the emotional words with polarity ambiguity. The distribution of emotional words in SIC is shown in Table 2. Table 2 shows that the distributions of emotion words in the DUTSD and NTUSD are basically the same in the SIC dataset. In general, the proportion of emotion words in the SIC was higher than that in the non-SIC. In terms of phrases in the SIC, the number of negative emotion words was significantly larger than positive emotion words. For example, by using DUTSD, the proportion of positive emotion words was 2.4%, whereas negative emotion words was 3.3 times higher than positive emotion words(nearly 7.9%). Using the NTUSD, the proportion of positive emotion words was 4.5%, whereas negative emotion words was 5.3 times of positive emotion words (as high as 23.9%).
There are some differences between the distribution of emotional words in SIC and non-SIC, especially negative emotion words. Specifically, negative emotion words in SICs were 7.2 times (based on DUTSD) and 5.6 times (based on NTUSD), respectively, more than those in non-SIC.

5) THE DISTRIBUTION OF LINGUISTIC FEATURES IN SIC
The analysis of the distribution of language features in SICs is helpful for understanding language style and the diction used in the expression of an SIC. Based on the Simplified Chinese LIWC (SC-LIWC), which is a translation and revision of LIWC by Taiwanese scholar Jinlan Huang et al. [81]  under the authorization of Pennebaker, the founder of LIWC, we statistically analyzed the distribution of language features. SC-LIWC includes 71 word categories, such as emoticons, first-person singular and plural.
In general, the distributions of SC-LIWC in SICs and non-SICs were quite different. According to the definition of SICs, the SC-LIWC words with higher difference rankings, such as ''sad'', ''anger'', ''negemo'' and ''anx'', belong to psychological factor IPS, which is an internal cause of suicidal ideation.

6) THE DISTRIBUTION OF SUICIDE WORDS IN SIC
For emotional cause extraction, the causes are usually located by emotional words, and emotional cause clauses usually appear near the emotional words. A simple idea is that there could also be a similar relationship between suicide words and SICs. Suicide words indicate the tendency for suicide, and the SIC is the cause of this tendency. Studying the distribution of suicide words around the SIC can help to further understand this relationship.
Based on the Chinese Suicide Dictionary developed by the Institute of Psychology, Chinese Academy of Sciences, the distribution of suicide words around SICs was calculated, as shown in Table 4. The distribution showed that suicide words appeared in 67.9% of phrases in SICs, and 64.3% of them had one suicide word, whereas approximately 87.9% of the posts contained suicide words outside the SICs, and even as many as 68.8% of the posts contained at least two suicide words outside the SICs. This finding is quite different from   emotional cause extraction, where emotional cause clauses and emotional words are usually separated.
The above statistics showed that the distinction between suicide words and SICs was not clear. Further analysis showed that suicide words in the Suicide Dictionary included not only the words that described suicidal ideation but also the words (cause) that led to suicidal ideation, such as ''depression'', ''illness'', and ''cancer'' and even some interpersonal words. This finding showed that it was difficult to extract SIC by suicide words, and it also reflected that the SICE was more challenging than emotional cause extraction. VOLUME 8, 2020

7) THE DISTRIBUTION OF CAUSE CLUE WORDS
Clue words such as ''because'', ''as'', and ''make'' are good indicators for detecting the SIC. It has been shown in emotional cause extraction that linguistic clue words play an important role in the connection between emotional keywords and emotional causes. Researchers use linguistic clue words to construct linguistic rules to extract emotional causes in text [63], [69]. The cause clue words used in this paper are shown in Table 5. The cause clue words were divided into four categories according to grammar and function: conjunctions and prepositions, causative-verbs, perceptionverbs, and other words. In the SIC dataset, the closest distance between cause clue words and cause phrases was calculated. The distance was based on the number of words. It was the closest distance between the cause clue word and the word that began or ended the cause phrase, that is, the distance from the word marked ''B'', ''E'' or ''S'', as shown in Table 6. In the SIC dataset, 1,187 posts (19.8%) did not have any cause clue words, and these posts contained 1,958 cause phrases, accounting for 14.2% of the total number of cause phrases.
Although 80.2% of the posts in the SIC dataset had cause clue words, it was found that many of them had different uses and did not always express the cause. For example, in the example of the cause clue word ''make'' in the SIC dataset given below, only (make or let) in the first example is a clue word that truly expresses the cause.
Ex. 3 To summarize up, our research is similar to emotion cause extraction in that both of them are extract causes from the text and use statistical analysis, and some of the results are similar in the literature [68]- [70], [72], [82]. For example, the features that have significant differences between cause phrases and noncause phrases are verbs (v), adjectives (a) and nouns (n) in POS, and core HED (verb) and verb object phrases (VOB) in dependency [82]. However, there are still some differences or challenges in SICE, which include the following three issues.

a: THE GRANULARITY OF EXTRACTED OBJECTS
The extracted objects of emotional causes are usually events, and the granularity is usually sentence-level or clause-level. The location of the emotional cause is usually close to the emotional words. However, the extracted objects from SICs include more fine-grained cause phrases or words, such as events, emotions, psychophysiological disorders, etc.

b: THE ROLE OF EMOTIONAL WORDS AND SUICIDE WORDS
First, emotion dictionaries have much bigger vocabularies than suicide dictionaries. Although different emotion dictionaries contain different emotional words, the definition of emotional words is relatively clear. Most emotional words are annotated with word-level granularity. However, suicide dictionaries are very rare at present, and there are far more words used for suicidal ideation than a word-level expression. The suicide dictionary built by the Institute of Psychology of the Chinese Academy of Sciences aims to detect suicidal ideation, which includes a large number of words, phrases and even sentences.
Second, emotional cause extraction usually involves extracting emotional causes around emotional words. Emotional words and emotional causes do not overlap, and the boundaries are clear. However, suicide words include words not only describing suicidal ideation but also leading to suicidal ideation, and even some interpersonal relationship words, which have unclear boundaries with suicidal ideation and more complex relationships.

c: THE EFFECT OF CAUSE CLUE WORDS
The dataset in emotional cause extraction generally uses more formal news text data, such as literature [71], and there are clear syntax rules between the clue words, emotional words and emotional causes. However, in addition to the different roles of suicide words and emotional words discussed above, the relationship between cause clue words and SICs also become ambiguous because of the informal expression of posts. In addition, nearly 20% posts that contain SICs have no cause clue words. There is a large amount of ambiguity in the use of cause clue words, which makes it even more difficult to analyze SICs through clue words or syntactic analysis.

IV. CRF MODEL AND FEATURE ENGINEERING
As the types of SICs are complex and interrelated, to simplify the problem, only the SIC is extracted in our work, without distinguishing its specific type.
The task of SICE can be regarded as a sequential labeling problem. At present, the classic sequential labeling models are mainly the Hidden Markov Model (HMM), the Maximum Entropy Markov Model (MEMM) and the Conditional Random Field (CRF), among others. CRF does not have the strict independence assumptions like HMM, and it can accommodate arbitrary context information and has flexible feature design. Compared with MEMM, CRF calculates the conditional probability of the global optimal output node so it also overcomes the shortcomings of label bias in the MEMM model. Therefore, CRF is widely used in sequential labeling problems and performs well. This paper attempts to use CRF to extract SICs.
Given the text, x = {x 1 , x 2 , . . . , x t }, x i (i = 1, . . . , t) is the word (including punctuation) that make up the text, t is the number of words in x, and the label sequence y = {y 1 , y 2 , . . . , y t }, is the label of the SIC on x, and its calculation is shown in formula (1). where, Feature function, s l (y i , x, i), only depends on the current position, which is called a state feature, that is, the cause labeling sequence is only related to the label of the current word in x, and the features of x (such as POS, syntax, etc.) are considered. Feature function, t k (y i−1 , y i , x, i), depends on the labels of the current and previous position, which is called a transfer feature. The labeling sequence, y, which makes P(y|x) the largest, is the final labeling result.
The CRF model takes the definition of the feature, the selection of the feature function and the design of the feature template into consideration. In this paper, the open source tool CRF++ (https://www.softpedia.com/get/Science-CAD/ Taku-CRF.shtml) was used. There are four main parameters that need to be set in the experiments. The control variable method is used and after many experiments, the parameters with better experimental results are selected based on the verification dataset. The final main parameter settings and meanings are as follows.
1. -a(CRF-L2 or CRF-L1): This parameter is for the normalized training algorithm, and the CRF-L2 training algorithm is finally chosen.
2. -f(NUM): This parameter means that the feature with a frequency less than NUM is ignored in the training process, and the final choice is f=4.  Word (WD). The WD of x i refers to the word itself in the x i position, which is recorded as W i .
Part of speech (POS). The POS of x i refers to the POS of the word in the x i position, which is recorded as P i .
Dependency (DP). The DP of x i refers to the word that x i depends on and the dependency, which is recorded as D i .
Psychological Characteristics of Suicide (PCS). The PCS of x i refers to whether x i belongs to ''suicide words'', which is recorded as S i . The Chinese Suicide Dictionary published by the Institute of Psychology of the Chinese Academy of Sciences is adopted here.
Emotional Features (ETs). The ET of x i refers to the emotion polarity of x i , which is recorded as E i = E ni , E pi , where E ni and E pi , respectively, indicate whether x i is a negative or positive emotion word. Emotional polarity was determined by the NTUSD Emotion Dictionary.
Linguistic Features (LGs). The LG of x i refers to the word category of x i in the SC-LIWC, which is recorded as L i . There are 71 word categories in the SC-LIWC. According to the statistical analysis in Section III-D, the categories with the top five LIWCinS/LIWCinN values were adopted, such as ''sad'', ''anger'', ''health'', ''negemo'' and ''anx''. The LG of x i only considers whether x i belongs to these five categories. Since x i may belong to multiple categories at the same time, L i is expressed as L si , L ai , L hi , L ni , L xi , which, respectively, indicates that x i belongs to ''sad'', ''angle'', ''health'', ''negemo'' and ''anx''.
In addition to the abovementioned six atomic features, we also combine these atomic features in two ways: the same types of features are combined in a given window to form the context of the feature and the different types of atomic features are combined to form mixed features.
For example, given a window length of 5, the context of , and more relations. Similarly, the POS includes unitary POS (P i−2 , P i−1 , P i , P i+1 , P i+2 ), binary POS, ternary POS, and more, as do the other types of features.
Consider that POS, dependency, and word category (whether positive or negative emotion words, whether suicide words, what kind of type it belong in the LIWC) are dependent on the word itself, and compared with the POS and dependency, the ambiguity of a word category is relatively small. Therefore, the mixed features only consider the combination of word, POS, and dependency.
Based on the experiments, the candidate features we finally designed are shown in Table 7.

B. FEATURE SELECTION STRATEGY
The feature selection strategy based on the greedy method was adopted for the CRF. First, we selected the feature type that resulted in the best single type features. Then, we added the remaining five types of features and selected the feature type with the best results the combination of all six types of features was examined.

V. CHAR-BiLSTM-CRF MODEL
The Char-BiLSTM-CRF model did not require feature design and achieved good results in sequential labeling tasks, such as part-of-speech tagging, chunking, and named entity recognition. Therefore, we attempted to use this model to extract the SICs. Char-BiLSTM-CRF [83]- [85] uses a three-layer architecture neural network model, as shown in Fig. 5.

A. CHAR LAYER
The Char layer extends the input of the word embeddings, and the char embeddings and the word embeddings are concatenated as the input layer of this model, which can represent the internal structure of the word, making up for the deviation in word embeddings caused by Chinese word segmentation or insufficient training data.

B. BiLSTM LAYER
The BiLSTM layer can capture the long-distance dependence of words in a sentence, and can learn more about the context of the words.

C. CRF LAYER
The CRF layer tries to optimize the output label sequence. The label of the current word is not only affected by the features of the word itself and the context but also by the label of the context. For example, the label ''M'' must be preceded by ''B'' or ''M'', and the label of ''E'' must be preceded by ''M'' or ''B''. These labels are mutually constrained. BiLSTM can't fully consider the dependence between labels, which may result in irregular label sequences. The CRF layer can learn the label transition probability of the entire sentence and fully consider the dependencies between the context labels.
The hyperparameter settings used in the Char-BiLSTM-CRF model in this paper are shown in Table 8. We also compared three-word embedding models: Word2vec, ELMo and BERT. Word2vec is a kind of distributed, low-dimensional, and dense word embedding representation that can fully consider the context of words and map words with similar semantics to a close position in vector space. In view of Word2vec's insufficient understanding of the contextual meaning of words, ELMo achieves representation of different representations of the same word in different contexts through pretraining and fine tuning. BERT further enhances the generalization ability of the word embedding model, fully expresses the character level, word level, sentence level and even the relationship between sentences, and better solves the problem of polysemy that Word2vec cannot solve.
In this article, Word2vec is pretrained on Chinese Wikipedia, ELMo is pretrained on 13,013 posts with suicidal ideation and uses Word2vec for initialization. The dimensions of the vectors trained by both Word2vec and ELMo are set to 300. The main training parameters of the ELMo model are shown in Table 9. BERT is pretrained by Google on 3.3 billion words and 2.5 billion Wikipedia and 800 million text corpora. The word vector dimension is set to 768. At the same time, when using BiLSTM to obtain the char embeddings to concatenate, parameter dim_char is set to 256.

VI. EXPERIMENT ANALYSIS A. DATASET AND EVALUATION METRICS 1) DATASET
After manual annotation of the suicidal ideation dataset in Section III-C, a total of 5,994 posts containing SIC were obtained (subsequent work will not take posts without SIC into account). They were randomly divided into the training set, the validation set, at an 8:1:1 proportion. The distribution of posts and labels in the SIC dataset are listed in Table 10.
It can be seen from Table 10 that the dataset has a label category imbalance problem, and the number of labels ''O'' is much larger than the number of other labels. This experiment has not yet dealt with this imbalance problem, which will also be one of the follow-up research projects.

2) EVALUATION METRICS
We used precision, recall, and F1-score as the evaluation metrics. Because an SIC may contain more than one word, to investigate the performance of the SICE model in detail, this paper uses four methods in two categories to determine whether the results of model extraction are correct.   is correctly labeled. The corresponding precision, recall and F1 values are recorded as E_P, E_R, and E_F, respectively.
b. Fuzzy Match: Unlike the exact match method, the fuzzy match method does not distinguish the label category, such as B, M, E, or S. The fuzzy match is equivalent to the evaluation of binary classification problems. The corresponding precision, recall and F1 values are recorded as F_P, F_R, and F_F, respectively.

B. EVALUATION RESULTS AND ANALYSIS
The evaluation results from the CRF and Char-BiLSTM-CRF using the SIC dataset are shown in Table 11.

1) FEATURE ANALYSIS
Word features should be used as the most basic feature. In the CRF model, the effects of various types of atomic features on the task are validated. The word feature, W, performs best, and the C_F value reaches 0.661, which is followed by the POS features and the dependency features, DP, and their C_F value are 0.659 and 0.455, respectively. In general, the word feature, W, is the best and most basic feature, because the expression of SIC depends more on the vocabularies themselves.
The expert knowledge-based features can effectively improve the performance of the CRF model. Both ''W + ET'' and ''W + LG'' are 0.8% higher than W's C_F. When examining combinational features, the best combination of atomic features does not produce the best results. For example, the combined features of ''W + POS'' and ''W + DP'' perform obviously worse than the combined features of ''W + ET'' and ''W + LG''. This finding shows that the POS features and DP features can be somehow covered by the word feature itself. This finding is easy to explain in that the POS of a word and its dependency in a sentence depend more on the word's meaning itself. On the other hand, the linguistic features (LGs) and emotional features (ETs) using the LIWC dictionary and the emotion dictionary perform better than POS and DP because they have expert knowledge, which also shows that the SIC is closely related to the expression of emotions or feelings. Moreover, the performance of ''W + ET'' and ''W + LG'' are almost the same, but the performance of ''W + ET + LG'' and ''W + LG + ET'' declined based on ''W + ET'' and ''W + LG'' respectively, which shows that in SICE, the LGs and ETs are basically equivalent.
The suicide dictionary has an auxiliary effect on the CRF model. Compared with ''W + ET'', after adding the PCS features, the C_F value increased by approximately 1%, which indicates that although the suicide words are not well-distinguished in terms of SIC and non-SIC phrases, they are still good indicators for SICE. In addition, based on combining all the features, ''W + ET + PCS + DP + LG + POS'', removing PCS significantly reduced extraction performance, especially recall, which dropped by 1.9% on average.
Dependency features can improve the performance of CRF models. Compared with ''W + ET + PCS'', the combined feature ''W + ET + PCS + DP'', produced no significant improvement in C_F, but produced a 0.3% -0.7% improvement in P_F, E_F and F_F. In addition, based on all features ''W + ET + PCS + DP + LG + POS'', precision, C_P, was basically unchanged after removing DP. However, recall, C_R, decreased by 2.3%. This finding shows that adding DP can improve the performance of SICE by capturing the SIC and the dependencies between SIC and other words.
We also tried to add features used in the CRF model to the Char-BiLSTM-CRF model in two ways, adding features in the CRF layer and adding features in the Char layer, but the performance was not significantly improved, indicating that the Char-BiLSTM-CRF model had a better understanding of semantics and how to make full use of feature engineering in deep learning models remains to be further studied.

2) MODEL ANALYSIS
Compared with the CRF model, the neural network-based models are shown in Table 11 (the default word embeddings is Word2vec). Although the neural network based models perform well in a variety of tasks, they do not have a stability advantage if the model structure is not optimized for a specific task. For example, the basic BiLSTM in Table 11 is not better than the CRF. Specifically, compared to BiLSTM, the C_P, C_R, and C_F of the CRF were 9.2%, 2.4% and 5.4% higher, respectively.
The improved Char-BiLSTM-CRF model has more obvious advantages: (1) The Char-BiLSTM-CRF model is generally better than the CRF model, but CRF still has advantages in precision.
The Char-BiLSTM-CRF model performed better than the CRF model on four harmonic average values of C_F, P_F, E_F, and F_F, and it mainly improved recall. For example, the C_R, P_R, E_R and F_R of the Char-BiLSTM-CRF model were 4.8%, 4.7%, 5.1%, and 5.2% higher than that of the CRF model, respectively. However, the experimental results also showed that the C_P, F_P and P_P of the Char-BiLSTM-CRF model were 0.2%, 0.5%, and 1.1% lower than that of the CRF model, respectively. This finding showed that although the neural network-based model performed well on classification tasks, the CRF still had a competitive advantage on sequence labeling tasks, especially in situations where training samples were very limited, such as the SICE task.
(2) Adding character embeddings can significantly improve the performance of SICE.
From the comparison of the evaluation results of Char-BiLSTM with BiLSTM and Char-BiLSTM-CRF with BiLSTM-CRF, adding the character embeddings made the F1 value of each evaluation metric increased by approximately 2% on average, and recall increased by approximately 4% on average. It showed that compared with using only word embeddings, adding character embeddings enhanced the semantics of words. Therefore, it significantly improved the performance of SICE because with informal expressions, such as Weibo posts, the accuracy of word segmentation is reduced, and the text representations in terms of words becomes less accurate due to word segmentation errors and the increase in unknown words. The use of character embeddings can effectively alleviate this problem.
(3) Adding the CRF layer can optimize sequence labeling results.
Comparing BiLSTM with BiLSTM-CRF and Char-BiLSTM with Char-BiLSTM-CRF, after adding the CRF layer, the precision of C_P and P_P based on taking a complete SIC as the evaluation granularity increased significantly. In particular, the complete match precision, C_P, increased by 8.9%, and C_F increased by 6.1%. This finding showed that CRF better captured the dependency relationship between adjacent labels through learning transition probabilities, and made up for the poor performance of sequence detection by BiLSTM and Char-BiLSTM in the SICE.
(4) After replacing the embedding of Word2vec with the well-trained ELMo and BERT, the Char-BiLSTM-CRF model achieved better results. Compared with the Char-BiLSTM-CRF using Word2vec, the C_F, P_F, E_F and F_F of Char-BiLSTM-CRF (ELMo) increased by 4.7%, 5.6%, 5.4% and 5.7%, respectively. The explanatory ELMo changed dynamically according to the context, which can better understand the contextual meaning of words to effectively identify the cause of suicidal ideation. In the same way, compared with the Char-BiLSTM-CRF model using Word2vec, the C_F, P_F, E_F and F_F of Char-BiLSTM-CRF (BERT) increased by 8.2%, 5%, 7.7% and 5.9%, respectively. Based on ELMo, the C_F and E_F of Char-BiLSTM-CRF (BERT) increased by 3.5% and 2.3%, respectively. This finding indicated that BERT was more advantageous with the exact match and BERT understood the semantics of words better than ELMo, especially when dealing with polysemy.

3) DEEPER-ANALYSIS OF THE CHAR-BILSTM-CRF MODEL
By analyzing the results of Char-BiLSTM-CRF, it was found that the model performed better in cases such as the short cause, concise expressions and abstract SICs. Due to the greater commonalities between these expressions, there were more samples with similar semantics and structure in the dataset, and the model could be fully trained, such as the VOLUME 8, 2020 SICE can help psychological consultants or researchers quickly understand the causes of psychological crisis through social media and provide a basis for intervention or decisionmaking. In this paper, based on texts with suicidal ideation in Chinese posts, the definition of the SICE task is given based on related works on SIC and the suggestions of psychological experts. To generate the SIC dataset, we manually annotated the causes in the posts with suicidal ideation collected and annotated by the Psychology Institute of the Chinese Academy of Sciences. We proposed two models, the CRF model and the Char-BiLSTM-CRF model, to extract SICs in posts. We also analyzed the role of various features in the CRF model in detail and compared the traditional CRF model based on feature engineering and the Char-BiLSTM-CRF model based on a neural network. Finally, examples are given to illustrate the challenge of SICE.
The experiment shows that although SICE needs to be further improved, the precision and recall of both proposed models can still reach 0.933 and 0.798. This finding shows that it is highly feasible to monitor large-scale psychological crises and analyze the cause of the crisis through social media.
At the same time, we also see that SICE from social texts still faces great challenges, and we need to conduct more in-depth research on data annotation, the extraction model and making better use of expert knowledge or common sense. The future works we will focus are listed as follows: (1) Automatic or semiautomatic data annotation remains to be explored. This article uses manual annotation, which is time-consuming, labor-intensive, and subjective. Exploring automatic or semiautomatic annotation methods and few shot or zero shot learning for SICE will be one of our next research directions.
(2) The generation of suicidal ideation has strong timeliness and is comprehensive. It, needs more consideration of timeliness and other factors.
In this research, we only extracted SICs from a single post. However, the generation of suicidal ideation is often gradual and for many reasons. Therefore, how to utilize more information posted by users, how to investigate the time dimension, and how to comprehensively consider the potential causes are the second direction for future research.
(3) Introduce expert knowledge and common sense into the model.
As seen from the examples in Section VI-B-3, ''smoking'', ''father's illness'', ''Zoloft'' (an antidepressant drug), ''buried in marriage'', ''gastrorrhagia'' and other causes need to be connected with depression, disease, family relationships, etc. With the help of expert knowledge or common sense, extraction would be improved. At the same time, the specific cause should be further generalized and abstracted to solve the problem of insufficient model training through expert knowledge or common sense.
(4) Optimize the model to deal with the samples with long sentences, irregular sentences, long causes and mixed causes.
The examples in Section VI-B-3 also show that when the sentence is long and irregular or the cause is long and has multiple reasons, the model's ability to capture the SIC decreases. The reason is that the semantics of the sentence or SIC cannot be clearly detected. It may be because the deep learning model with character-level embeddings or word-level embeddings cannot fully learn the semantics at the sentence level or phrase level. Therefore, making targeted improvements to the model is a direction to improve the extraction. RENQIANG LIU received the B.S. degree in educational technology from Gannan Normal University, Ganzhou, China, in 2017. He is currently pursuing the M.S. degree in computer technology with the Jiangxi University of Finance and Economics, Nanchang. His current research interests include data mining, machine learning, and social media processing. VOLUME 8, 2020