An Effective Emotional Expression and Knowledge-Enhanced Method for Detecting Adverse Drug Reactions

,


I. INTRODUCTION
More than 50 million posts are published every day according to Twitter's official reports. Therefore, Twitter provides rich large-scale multimedia data for various research opportunities [1] involving ADR detection, which focuses on automatically classifying ADRs (positive and negative) given the post content. ADR detection from social texts is an important task for discovering ADRs [2] due to the limitations of clinical experiments. Since ADRs may be exposed when people share The associate editor coordinating the review of this manuscript and approving it for publication was Kin Fong Lei . their feelings concerning taking medication on social media, social texts may contain more time-effective and a wider range of ADRs. However, due to the colloquialism of social texts and the sparseness of posts including descriptions of ADRs or drugs, some approaches that perform well in other written biomedical texts such as PubMed cannot be directly used in social texts. Hence, researchers have attempted to find ADRs in social texts. Text mining and partially supervised learning methods [3] are integrated to classify ADR (positive instances) and non-ADR messages (negative instances), and researchers employ various features such as word embedding [4], position feature [5] and medical knowledge [6] to VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ promote the whole performance of their methods. Moreover, researchers utilize attention mechanisms [7], transfer learning [8], co-training learning [9], broad learning [10] and multitask learning [11] to learn these deep dominant features [12]. Medical resources and emotional score are merged into features that represent the semantic meaning of the text segments of different methods. However, it is difficult to automatically capture the semantic representation of short social texts. Thus, it is more important for short social texts to enhance the ability of information representation. People often express their abundant emotions and feelings in social media posts. Therefore, the innate emotional elements that are implicated in social texts are an important cue in detecting ADRs. Some studies introduce the emotional analysis of social texts collected by the crawler method [13]. In addition, term frequency-inverse document frequency (TF-IDF) as an emotional feature [14] is used for the ADR detection task. The score of emotional words [15] is exploited to find posts containing ADRs from social texts. However, the extant experimental results show that word-level emotional scores are insufficient for capturing richer emotional expression [16]. Moreover, researchers [14], [17] have suggested that sentimental analysis is effective in extracting ADRs from social texts. In fact, some emotions are often implied in the whole semantic representation of posts. For instance, one post stated, "I reaaaallly need to take my Paxil, but it makes me feel so delirious and just messed up"; the obvious negative emotions may be found even if the emotional words do not exist in the post. The whole emotional representation of posts may contribute to further distinguishing between ADR and non-ADR posts.
Moreover, a standard medical knowledge base such as the Unified Medical Language System (UMLS) has been employed in prior studies to detect mentions of ADRs. Adverse event entities have been extracted from patient forums [2] using drug safety databases [18] such as MedEffect and COSTART. Recent studies have also adopted MedDRA and SIDER to better understand and match users' expressions of drugs and ADRs in social media [19]. These methods only utilize the medical resource to supplement features rather than enhance information representation for short social texts, but they overlook the characteristics of insufficient information representation due to text length limitation.
To tackle the aforementioned limitations, we propose an effective emotional expression and knowledge-enhanced method, which integrates the word-level emotional score and sentence-level emotional context information. Moreover, the proposed model enhances the potential relationship between drugs and adverse reactions via medical resources. Inspired by BioBERT [20], a pre-trained biomedical language representation model for biomedical text mining, we pre-train a new BERT using a large-scale sentimental analysis corpus to extract sentence-level emotional context information from tweets. The word-level emotional score is calculated by the sentimental dictionary and regarded as the weight coefficient of the subsequent input. The tweets considered for discovering ADRs generally contain at least one drug name. Then, posts with co-occurrence drug names and adverse reactions are the main objectives in the extraction of adverse reactions. Medical resources such as MedDRA and DrugBank [21] facilitate the construction of co-occurrence pairs for drugs and their adverse reactions. In the paper, we mainly build the drug-ADR co-occurrence pairs dictionary via MedDRA and supplementary ADR, crawling the drug-related data from the drug website. In addition, the extracted drug-ADR pairs generated by the established co-occurrence dictionary as the cooccurrence sub-sentences are fed into the model, contributing to focusing on the key drug names and adverse reactions, thus improving model performance. The experimental results demonstrate that the drug-ADR co-occurrence pairs increase the recall rate while guaranteeing precision as much as possible. In addition, the word-level emotional score and sentencelevel emotional context information help the model promote its overall performance.
The main contributions of the paper are summarized as follows: • The emotional context information extracted by our pre-trained BERT is introduced into our neural network architecture. This contributes to extracting the positive or negative emotions for distinguishing ADR posts from non-ADR posts, resulting in the promotion of the whole performance. Furthermore, the word-level emotional score as the weight coefficient of words contributes to discovering the ADRs associated with potential emotional words.
• The co-occurrence sub-sentences generated by the drug-ADR co-occurrence dictionary clearly specify on what the model should focus. These co-occurrence pairs improve the accuracy of positive example classification and lead to an increase in recall rate.
State-of-the-art results are obtained on two real-world Twitter datasets (PSB2016 and SMM4H, with F1-scores of 72.64% and 64.98%, respectively) compared to other methods in pharmacovigilance.

II. RELATED WORK
Social texts contain not only abundant emotions but also people's feelings after taking medicines. Researchers use social texts to conduct emotional analysis and detect ADRs.

A. SENTIMENTAL ANALYSIS IN SOCIAL MEDIA
Sentimental analysis involves various research fields such as product recommendation [22], flight service [17] and opinion mining [23]. The data used in sentiment analysis are collected from online networks such as micro-blogs [24] and health forums [25]. The methods for sentiment analysis are roughly divided into the pattern-and machine learning-based approaches. Researchers extract a small number of features from domain knowledge [26] and contextual semantics [27] to train their classifier. Although these methods achieve good results on different corpora, their limited by domain dependence. Other researchers have recently turned to studying sentimental analysis via machine learning-based [3], [28] methods. Combining CNN and LSTM [2], an attention mechanism [29], BERT [30] and Emoji embedding [1] are successively applied in the sentimental analysis research, greatly improving performance. Researchers also find a potential relationship between ADRs and emotional analysis in social texts [31]. They employ such features as emotional score and emotional word frequency [2]- [4] to classify and extract ADRs. Moreover, researchers also analyse in depth the contribution of sentimental analysis to ADRs [14], [16], [32]. Therefore, deep potential emotional analysis features may enhance the performance of the detection of ADRs from social texts.

B. AUTOMATIC ADR DETECTION FROM SOCIAL TEXTS
In addition to traditional feature-/kernel-based approaches [33], [34], several neural models are proposed to detect ADRs from social texts in PSB Tasks 1 and 2 [4], including embedding-based models, semi-supervised CNN-based models [35] and RNN-based models [36]. Recently, attentive RNN [29], [37] has also been used to improve the performance of identified ADRs. Multi-head self-attention with various features [38] has some advantages over CNN, CRNN and CNN with an attention mechanism on ADR tweet classification. Transfer learning [8], co-training [9] and multi-task learning [39] are adopted to extract ADRs, classify tweets mentioning ADRs and normalize ADRs concept, and multitask learning achieves the state-of-the-art result. With BERT performing well in many NLP tasks, researchers introduce the knowledge base and conditional random field (CRF) into BERT for the automatic classification of ADRs (text classification) and extraction of ADRs (NER) on SMM4H Shared Task 2019 [19], respectively.

III. METHODS
In this study, the collected social texts for detecting ADRs usually contain at least one drug name, which co-occurs with some symptoms regarded as ADRs, which is different from other text classification datasets. Therefore, the drug-ADR co-occurrence sub-sentence as the auxiliary sentence is fed into basic BERT to enhance sentence representation (Section A). Social media posts contain abundant emotions and feelings. Hence, emotional elements are an important cue for detecting ADRs. The sentimental score of words multiplies the output features of basic BERT, and the product is fed into a transformer component to further extract a deep representation of sentences with the co-occurrence drug and ADRs (Section B). Moreover, our pre-trained BERT, which is obtained via pre-training a large number of emotional analysis corpus collected from Twitter, is employed to fully express emotional information (Section C). Finally, the concatenation of the output of the transformer and the [CLS] output of our pre-trained BERT are used as the input of the convolutional neural network, and the final classification result is obtained via Softmax operation (Section D). The architecture of our model is illustrated in Figure 1.

A. INPUT OF BASIC BERT
The input of basic BERT consists of the masked tweet and the drug-ADR co-occurrence sub-sentence. The drug-ADR co-occurrence sub-sentence is employed to enhance sentence representation, focusing on tweets containing drugs and ADRs. The reason for building the co-occurrence subsentence is that the tweets for detecting ADRs are collected from a large number of social media posts according to a pre-defined drug dictionary, and drugs usually co-occur with some symptoms regarded as ADRs in positive tweets. Therefore, first, a co-occurrence dictionary using mainly the Med-DRA database containing approximately 1,430 drugs and their known side effects is constructed to extract drug-ADR co-occurrence pairs from tweets. Second, the drug name list provided by MedDRA does not fully contain the drug name list used for the experimental data when we analyse the experimental data. Hence, we crawl ''more common'' and ''less common'' content from the drug site (www.drug.com) in the form of ''https://www.drugs.com/sfx/#drug-sideeffects.html'' (where #drug will be replaced with the actual crawling drug) to obtain the drug-ADR co-occurrence pairs as a supplement to the co-occurrence dictionary. Eventually, a list of 1494 drugs and their adverse reactions are obtained, and more than 33,000 drug-ADR co-occurrence pairs are extracted, as shown in Figure 2. Then, co-occurrence pairs are obtained from tweets using the above-mentioned dictionary. After the drug name is accurately found, we use the greedy algorithm to match the maximum words, which appear in the corresponding ADR part of the co-occurrence word in a tweet. For instance, ''fluoxetine #ac(h)e'' and ''citalopram #ac(h)e'' are extracted from the tweet, namely, ''@notquitereal yeah, I mean, fluoxetine made me feel like shit, and citalopram makes me feel ac(h)e, so worth considering if you ever have to''. The sub-sentence is represented as ''fluoxetine #_ac(h)e, citalopram #ac(h)e'', as taking Fluoxetine or Citalopram can cause headaches.
The final input of basic BERT is denoted as Here, s 1 is a piece of text corresponding to a social media post consisting of a sequence of n words, and each w i represents a word in the vocabulary of size V. Moreover, s 2 is a sequence of m drug and its corresponding ADR co-occurrence pairs, called the co-occurrence sub-sentence, and each d m and co p m represents a drug and its pth ADR in the vocabulary of size m + Diff ( where Diff denotes the number of co-occurrence pairs after removing the repeated pairs in the experimental dataset.

B. WORD-LEVEL EMOTIONAL SCORE AND TRANSFORMER
Social texts usually contain more or less positive or negative emotions. Researchers utilize emotional features for such social NLP tasks as sentimental classification [40] and public VOLUME 8, 2020  opinion analysis [41]. Therefore, the emotional score of each word is calculated using SentiWordNet3.0 and described as follows.
where Score neg.
dict and Score pos.
dict are negative and positive scores, respectively, in SentiWordNet3.0. Then, the product of Score(w) and the sequence output of BERT are fed into the transformer component to further enhance the semantic representation of tweets. The output of the transformer component serves as a partial input to the downstream model.

C. SENTENCE-LEVEL EMOTIONAL CONTEXT
The innate emotional elements that are implicated in social texts are useful for social NLP tasks [32]. Nevertheless, it is insufficient that only word-level emotional scores are used to capture richer emotional expression due to some emotions implied in the whole semantic representation of posts. Hence, the emotional context information is extracted from tweets via BERT to compensate for the deficiency of word-level emotion. However, the performance of BERT mainly depends on the size and quality of the corpora on which they are pretrained. Since BERT provided by Google is designed as a general-purpose language model that is pre-trained on the English Wikipedia and Books Corpus, the texts are more official and almost unemotional. Conversely, the datasets of our task consist of tweets, which contain richer emotion. Moreover, users' inputs are freer, more irregular and dirtier on twitter than in official texts, resulting in grammatical errors, spelling mistakes and the manual abbreviation of words. Therefore, basic BERT designed for general-purpose natural language understanding is not suitable for extracting emotional context information from tweets. To obtain the required sentence-level emotional context, we pre-train our BERT using the Sentiment140 dataset (https://www.kaggle.com), which contains 1,600,000 automatically tagged tweets (half positive and half negative). Then, the final hidden output of the BERT is taken as the sentence-level context information, which is also the partial input of our downstream model. Finally, the output of the max pooling operation is fed into the output layer, denoted as h o = max pooling(h). The final vector can be regarded as a high-level representation of the tweet and is used as a feature for the ADR classification task: where W class and b class are learnable parameters. The imbalance problem is a general problem in social NLP tasks. Therefore, according to the results of Wang et al. [28] and our analysis on the number of positive and negative examples, the imbalance ratios of (the number of negative examples vs the number of positive examples) both datasets are approximately 10:1 [43]. Inspired by Lin et al. [44], the balanced factor is used to make the model more focused on the unbalanced positive example. The loss function for ADR detection is described in equation 3: where S + and S − are the number of positive and negative examples, respectively, andγ is a balanced factor.  [19], which is also an extension of the PSB2016-Task1 dataset, consists of approximately 15,000 annotated tweets as training data and 9000 annotated test tweets, respectively. These tweets related to drugs prescribed for chronic diseases and the prevalence of drug use were annotated by two domain experts under the guidance of a pharmacology expert [2]. Both experimental corpora only provide the tweet and user IDs but do not allow for the sharing of actual raw tweet text for the purpose of protecting user privacy. Hence, we have to re-crawl the original texts using the tweet and user ID via Twitter's Service Streaming API; only 6,700 (61.9%) and 17,000(70.8%) tweets are still publicly available in PSB2016 and SMM4H2018, respectively. The dataset and source code of PSB2016 and SMM4H2018 is available at https://github.com/dllzg2012/ Co-Senti-BERTCNN.git.

B. DATA FOR CROSS-VALIDATION
There may be no positive examples in the training, validation or test sets when generating cross-validation data on R ADR = TP ADR TP ADR + FP ADR (5) F1 − score ADR = 2 * P * ADR R ADR P ADR + R ADR (6) where TP ADR is the number of true ADR tweets, FN ADR is the number of false non-ADR tweets, FP ADR is the number of false ADR tweets, and M and N are the number of ADR and non-ADR tweets, respectively. I (P ADR , P non−ADR ) is described as shown in equation 8.
To demonstrate the effectiveness of our proposed model, we compare it against multiple baseline methods and stateof-the-art approaches for the ADR classification task.

1) TextCNN
This is the classic convolutional neural network model for performing the sentence classification task. TextCNN [42] consists of input, convolution, max pooling and full connection, and a Softmax (output) layer. This serves as a downstream model for our entire model.

2) BiLSTM AND BiGRU
Bidirectional long short-term memory networks (BiLSTM, with LSTM as the basic RNN unit) and bidirectional recurrent neural network (BiGRU, with GRU as the basic RNN unit) as a natural language process model are applied for the pharmacovigilance task [45].

3) CRNN AND CNNA
CRNN and CNNA [46] are both proposed by Trung Huynh et al. for ADR classification. CRNN is a convolutional neural network concatenated with a recurrent neural network with GRU as the basic RNN unit and RLU for the convolutional layer. CNNA is a CNN integrated with an attention mechanism.

4) SEMI-MULTI-CNN
Lee et al. [35] trains multi-model with self-collected various tweets and then uses majority vote to classify ADR and non-ADR tweets.

5) MT-ATTEN-COV
This is a state-of-the-art model for preforming ADR-related tasks on the PSB2016 corpus. MT-Atten-Cov [39] is a multitask neural network model, learning ADR-classification, ADR-labelling and ADR-indication tasks with different levels of supervision collectively.

6) BERT+KNOWLEDGE
This is a state-of-the-art model for performing the ADR classification task on SMM4H [19]. The model builds <drug, ADR> pairs, generates binary features and then integrates the features with the output of BERT.

7) BERTCNN
This is our base model for the ADR detection task integrating BERT and CNN. We use the output of BERT as the input of the TextCNN.

8) CO-SENTI-BERTCNN
This is our proposed framework for ADR classification with the concatenation of the drug-ADR co-occurrence subsentence and sentence-level emotional context information.
We use the output of our pre-trained BERT as the sentencelevel emotional context information and the word-level emotional score as the weight of the influence of words' overall classification. Moreover, drug-ADR co-occurrence subsentences allow the model to pay attention to the dominant part of tweets for distinguishing between ADR and non-ADR.

C. PERFORMANCE COMPARISON WITH OTHER EXISTING METHODS
To show the validity of the proposed model, we report the results on official divided data and our divided crossvalidation data of SMM4H. Table 1 demonstrates the performance comparison between our Co-Senti-BERTCNN method and other state-of-the-art methods on the PSB2016 or SMM4H corpus. Note that in our experiments, the number of tweets crawled down is not consistent with the number of tweets used by the existing methods, but the proportion of positive and negative examples remains basically unchanged. Therefore, the results are comparable to some extent. First, on PSB2016, TextCNN only achieves 42.74% precision, 50.00% recall, 46.08% F1-score and 0.7127 for AUC. The recall obtained by CNNA is increased due to the attention mechanism that contributes to focusing on positive ADR tweets. However, Semi-Multi-CNN achieves state-of-the-art results in 2017, owing to a variety of tweets, which are useful for improving the precision rate.Furthermore, MT-Atten-Cov achieves the state-of-the-art performance for employing the attention mechanism and multi-task learning in 2018, mainly because the ADR labelling task in MT-Atten-Cov contributes to promoting the recall rate, resulting in an improvement of the F1-score. Compared with MT-Atten-Cov, the proposed method reduces the precision by 2 percentage points, while the recall rate increases by 4 percentage points, which makes the F1-score reach 72.64%. The reasons may be that the cooccurrence sub-sentence helps the model focus on the positive tweets, and sentence-level context information is useful for promoting precision; namely, the two components balance the overall performance. Second, on SMM4H, the proposed model gains 0.6373, 0.6628 and 0.6498 in precision, recall and F1-score, respectively. However, compared with  BERT+Knowledge, the model decreases by 2.6% in recall and increases by 3% in precision. We suspect that the cooccurrence sub-sentence introduces noise when the dataset contains more non-ADR tweets (note that more noise data and less positive tweets are contained in SMM4H [28]). Nevertheless, sufficient emotional expression (word-level emotion score and sentence-level emotional context) promotes the performance of the F1-score.
To verify the generalization of our method, the performance comparison between our Co-Senti-BERTCNN method and other baseline methods on the cross-validation data of SMM4H is shown in Tables 2 and 3. From Table 2, we observe that BERTCNN achieves the best recall rate and AUC value, whereas Co-Senti-BERTCNN obtains the best precision and F1-score. Moreover, the precision and recall rate of our method are close to each other, while the recall rate of BERTCNN is higher than precision. This suggests that it will reduce the recall rate and improve the precision rate to achieve better overall performance if our method is generalized and tells us that emotional expression has a certain impact on the recall rate in generalization. Table 3 presents the results of each fold using our method, and we find that on 5-fold cross-validation data, the maximum difference is 13.58, 9.07 and 3.95 percentage points in precision, recall rate and F1-score, respectively. This shows that on the premise of keeping the positive-to-negative ratio unchanged, our method fluctuates greatly in precision, followed by recall rate, and remains unchanged in F1-score and AUC. Therefore, the F1score of our method is stable in generalization performance.

D. THE EFFECT OF WORD-LEVEL EMOTIONAL SCORE, SENTENCE-LEVEL EMOTIONAL CONTEXT AND CO-OCCURRENCE SUB-SENTENCE
The effect of three key components on the performance of our model is investigated through the PSB2016 and SMM4H datasets, namely, word-level emotional score (WEmoS), sentence-level emotional context (SEmoCTX) and the drug-ADR co-occurrence sub-sentence (CoSen) mentioned in sections III.A, III.B and III.C, as shown in Table 4. BERTCNN feeds the final hidden state of the BERT provided by Google into the downstream TextCNN, which is regarded as the baseline. Then, three key components are gradually introduced into the baseline model.
When CoSen is added into the baseline, the recall rate on PSB2016 and SM44H increases obviously first, while the precision decreases by at least 10%. The result shows that CoSen can truly help the proposed model focus on the ADR tweets, but it misleads the model to concentrate on the tweets containing co-occurrence pairs to a certain extent. Then, SEmoCTX is also introduced into the baseline, the recall rate decreases by 10%, and the precision increases by 10% on SMM4H. However, SEmoCTX can help our model promote 15% precision and 1% recall increases on PSB2016. The  result shows that SEmoCTX mainly contributes to improving precision. Moreover, SEmoCTX on the dataset with a little noise and relatively balanced data has some advantages over that on other datasets. This implies that SEmoCTX contains abundant global context information, which can compensate for the limitations concerning misleading, improving overall performance. Finally, WemoS highlights the dominant emotional word to further balance the precision and recall, which can lead to a small increase in the recall rate. As a result, F1score is promoted.

E. DIFFERENT PERFORMANCE OF DIFFERENT BALANCED FACTORS
The experimental datasets are unbalanced,, and the loss function of the proposed model introduces a balanced factor γ . This section demonstrates the performance of different balanced factors γ . The experiments are conducted on PSB2016 and SMM4H using our method when gamma is 1, 2, 3, 4 and 5, as shown in Figures 4 and 5. The model obtains the best F1-score when γ = 3, which is equivalent to a positive-to-negative ratio of 1:3. A similar conclusion is reached by Liu et al. [47].
First, on PSB2016, Figure 4 shows that the proposed model obtains the best precision when gamma is equal to 2. Furthermore, the best recall rate and AUC are achieved when gamma is 3. Second, on SMM4H, Figure 5 shows that when gamma is set to 1, the model obtains the best precision but achieves the best recall rate and AUC when gamma is equal to 4. The proposed method obtains different results when we set different gammas on the two datasets due to the inconsistency of the positive (ADR tweets) and negative (non-ADR tweets) proportions of the two datasets (PSB2016 is 1:9.6 and SMM4H is 1:16) because gamma itself serves to balance the proportions of positive and negative examples.

VI. ERROR ANALYSIS
To quantitatively analyse the effect of emotional expression on ADR classification, we download the code of the combined CNN and LSTM model in [48] from https://github.com/pmsosa/CS291K and put the test set into the emotional classification model to obtain the corresponding emotional label. As shown in Table 5, the sentimental labels of the second, 5th and 7th tweets are gained by the model, as well as other tweets for which our best method had prediction errors or there are prediction disagreements between our method and the baseline BERTCNN.
In the first section of Table 5, we show two examples, for which our method (Co-Senti-BERTCNN) and the baseline have a disagreement in their predictions. The proposed method predicts the first tweet as a true ADR tweet, but BERTCNN gives it a non-ADR label. The reason for this difference is that our model utilizes the drug-ADR co-occurrence pair, ''humira#ad red sick sic vomiting sting hot''. In addition, the sentimental label of the second tweet is predicted as negative, and the tweet is an ADR tweet. The proposed method gains the right label owing to capturing the negative emotions hidden in the tweet.
In the second part, we first present a tweet that the model predicts wrong due to mislabelling. Although rich emotional expression and the co-occurrence sub-sentence are useful for identifying ADR posts containing co-occurrence or negative emotions partly, some noise or excessive focusing on co-occurrence pair is also brought in, which results in classification errors of such posts such as the 4th (containing a drug-ADR co-occurrence pair) and 5th (containing negative emotions) tweets. In fact, not all tweets containing co-occurrence pairs or negative emotions contain ADR when also containing a drug. Nevertheless, our model labels them as ADR tweets, which are false negative examples, such as the 6th and 7th tweets.

VII. CONCLUSION
Discovering ADRs on social media has become a major research trend recently due to the widespread and real-time nature of social media, but not due to the limitations and lags of clinical experiments. However, due to insufficient expression of emotions and inadequacy of information expression in short social texts, the existing methods do not achieve unsatisfactory performance. In this paper, we propose a neural network model for the ADR detection task. The model uses the word-level emotional score and sentence-level emotional context gained by our pre-trained BERT to capture the tweets that contain the negative emotions. These negative emotions may be an inherent clue of ADR posts. Moreover, we generate the co-occurrence sub-sentence using the drug-ADR medical dictionary. These sub-sentences help the proposed method extract the dominant hidden feature for distinguishing between ADR and non-ADR tweets, resulting in an increase in the recall rate and overall performance. The experimental results and analysis show that word-level emotional score and sentence-level emotional context contribute to promoting precision and the overall performance of ADR classification. In addition, the co-occurrence sub-sentence reduces the precision in part, but it achieves the improvement of the recall rate and promotes the F1-score. However, further improvement is needed on SM44H datasets containing more non-ADRs tweets. Therefore, improved BERT and additional features will be considered in future work. In addition, greater medical knowledge may be combined into our model.