A Graph Attention Model for Dictionary-Guided Named Entity Recognition

The lack of human annotations has been one of the main obstacles for neural named entity recognition in low-resource domains. To address this problem, there have been many efforts on automatically generating silver annotations according to domain-specific dictionaries. However, the information of domain dictionaries is usually limited, and the generated annotations may be noisy which poses significant challenges on learning effective models. In this work, we try to alleviate these issues by introducing a dictionary-guided graph attention model. First, domain-specific dictionaries are utilized to extract entity mention candidates by a graph matching algorithm, which can capture word patterns of domain entities. Furthermore, a word-mention interactive graph is leveraged to integrate the semantic and boundary information of entities into their context. We evaluated our model on the biomedical-domain datasets of recognizing chemical and disease entities, namely BC5CDR and NCBI disease corpora. The results show that our model outperforms several state-of-the-art models with different methodologies, such as feature-based models (e.g., BANNER), ensemble models (e.g., CollaboNet), multi-task learning models (e.g., MTM-CW), dictionary-based models (e.g., AutoNER). Moreover, the performance of our model is also comparable with BioBERT that owns huge parameters and needs large-scale pre-training.


I. INTRODUCTION
Named entity recognition (NER) is a task that extracts entity mentions from texts and classifies them into predefined types, such as person and location in the general domain [1], [2], and disease and chemical in the biomedical domain [3], [4]. NER is a fundamental task in natural language process (NLP) and bioinformatics, which is crucial for many downstream applications including relation extraction [5] and event extraction [6]. A growing number of studies have investigated this task with various approaches [2], [7]. Recently, neural networks, especially BiLSTM-CRF [2], have become popular approaches and achieved the state-ofthe-art performance. However, neural networks need a large The associate editor coordinating the review of this manuscript and approving it for publication was Navanietha Krishnaraj K. Rathinam. amount of annotated data. When the data are not enough (e.g., in low-resource domains), the performance degrades significantly [8].
To address this issue, domain-specific dictionaries have been used to generate silver annotations for NER in lowresource domains. A domain-specific dictionary is a manually curated list of entity names that belong to a specific entity type. The dictionary can be employed to extract entity mentions by string matching [9] or regular expression matching [10]. However, we observed two limitations in such method. The first is that the information in a dictionary has not been fully utilized. For example, considering a new disease name ''ulcer of nose'' that is not in the dictionary, it cannot be correctly extracted if simple string matching is used. Nevertheless, the disease dictionary contains two other disease entities: ''ulcer of anus'' and ''cancer of nose''. As shown in Figure 1, if we represent their word formation with a graph, we can infer the new disease name ''ulcer of nose'' via the graph. Such observation motivates us to leverage a word formation graph to extract entity mention candidates.
The second limitation is that most NER errors come from inaccurate boundary detection. As shown in Figure 2, the sentence ''Provocation of postural hypotension by nitroglycerin in diabetic autonomic neuropathy.'' contains two disease entities, but these entities can match seven mention candidates in the dictionary. The widely-used NER model, BiLSTM-CRF [11], may suffer from sparse boundary tags [12], while ambiguous matching will make the problem worse. Liu et al. [13] observed no significant performance improvement when directly utilizing the entity mentions matched with the dictionary as additional training data.
In this work, we exploited the NER task in low-resource domains using domain-specific dictionaries. In particular, we studied disease and chemical entity recognition using related dictionaries in the biomedical domain. We proposed a graph attention BiLSTM-CRF model (GAT-BiLSTM-CRF), whose architecture is illustrated in Figure 2. Our model firstly extracts entity mention candidates from text using a word formation graph, which is designed to capture word formation patterns of entities and model the connections between the words in domain dictionaries. From our observation on the dataset, the words in an entity play different roles in NER.
Some words are pivot words, which are usually located in the head or tail of an entity. In contrast, other words are decorative words, which are located in the middle of an entity. Hence, we label each node with using the label scheme ''BIES'' (i.e., beginning, inside, end and single) [14] according to its position in an entity. Furthermore, to integrate both semantic and boundary information of entity mention candidates into the neural network, we construct a word-mention interactive graph, which models the connections between mentions and words. Afterwards, a graph attention mechanism is used to alleviate the influence of false-negative candidates.
We evaluate our approach on two NER datasets, namely the BC5CDR and NCBI disease corpora [15], [16]. The experimental results show that our model significantly outperforms different kinds of baselines [8]- [10], [14], [17], [18] and even comparable with a pre-trained model, BioBERT [19]. To summarize, our main contributions include: 1) We propose a graph attention BiLSTM-CRF model (GAT-BiLSTM-CRF) using a word-mention interactive graph to integrate the semantic and boundary information of entities.
2) We propose a graph matching algorithm to improve the quality and coverage of matching mention candidates in lowresource domains.

II. RELATED WORK
Early study on NER heavily focuses on feature-based methods [20]. There are also a number of recent neural network approaches applied to NER, such as [21]- [24]. Such neural models are increasingly common in domain NER tasks [25], [26]. However, the performance of these methods degrades significantly due to low-resource in some domains such as biomedical, social media, etc. The architecture of our NER model with the graph attention mechanism and domain-specific dictionary. The left part shows the overall network architecture, including the embedding layer, bidirectional LSTM layer, graph layer and CRF layer. The right part shows the process of extracting entity mention candidates and building word formation graphs. VOLUME 8, 2020 A. LOW-RESOURCE NER Many recent technologies have been proposed to tackle the low-resource domain NER, such as pre-trained embedding [19], multi-task [8], multi-models [14], transfer learning [27]. Instead of using the above technologies, we exploit the usage of a domain-specific dictionary for NER. There are attempts on the distantly supervised NER task recently. SwellShark [10] utilizes a collection of dictionaries, ontologies, and, optionally, heuristic rules to generate annotations and predict entity mentions in the biomedical domain without human annotated data. Shang et al. [28] use exact string matching to generate pseudo annotated data and apply highquality phrases to reduce the number for false negative annotations. However, their annotation quality is limited because of low coverage of dictionaries, which leads to a relatively low recall. Compared with the distantly supervised NER, there are two main differences in our work. One is that a domain-specific dictionary is used to create a word formation graph, which can capture variants of entities and discover as new entity mentions as possible. The other is that the matched entity is used to enrich the word's representation instead of generating pseudo annotated training data.

B. GRAPH CONVOLUTIONAL NETWORK
There are a number of recent graph convolutional network (GCN) architectures [29]- [31] for learning over graphs. Our work is closely related to the graph attention networks (GAT), introduced by [30], leveraging masked self-attention layers to assign different importance to neighboring nodes. Sui et al. [32] presented collaborative GAT to integrate selfmatched lexical knowledge in Chinese NER, which aims to eliminate the ambiguity of Chinese text. Inspired by Sui's work, we use GAT to incorporate the information of mention candidates and filter noise false negative candidates.

III. APPROACH
In this section, we first introduce the construction of wordmention interactive graph. Then we detail the architecture of GAT-BiLSTM-CRF model for Domain NER task.

A. WORD-MENTION INTERACTIVE GRAPH
As shown on the left in Figure 2, to construct a word-mention interactive graph, we need to create a word formation graph of entities, and extract mention candidates using the graph matching algorithm in advance.

1) WORD FORMATION GRAPH
Give a domain-specific dictionary, the vertex set is all words contained in the dictionary. We regard lemmas of words as vertexes for variants of entities in our experiment. If two words i and j are in the same entity, the (i, j) entry connects with a undirected edge instead of a directed edge in Figure 1, which can expand the coverage of the dictionary. Besides, to capture the position relationship of words in an entity, we introduce a tag set ''BIES'', which refer to begin, inside, end and single respectively. For example, in the word formation graph of Figure 2, the word ''hypotension'', labeled with ''BIES'', may appear in the beginning, middle or end of an entity, and also may be a single entity.

2) MENTION EXTRACTION VIA GRAPH MATCHING
After building a word formation graph, we extract mention candidates using a graph matching algorithm. Give a sentence, all word sub-sequences are scanned to detect whether it is an entity mention. Specifically, by walking in the word formation graph, a word subsequence is regarded as a candidate if it can be totally matched in the graph and be constrained with node's tags. The detail of the matching algorithm is shown in Algorithm 1.

Algorithm 1 Mention Candidate Detection Algorithm
Input: a word sequence S = (w 1 , w 2 ,. . . , w m ), a word formation graph G(v,e,l). Output: S is a mention or not.

3) WORD-MENTION INTERACTIVE GRAPH CONSTRUCTION
After extraction all possible entity mention candidates, we need to construct a word-mention interactive graph for integrating the formation of mentions into their sentences. The vertex set of a graph is made up of the words in the input sentence and mention candidate extracted from it. For example, as shown in Figure 2, the vertex set is V = V word {Provocation, of, ·, neuropathy} U V candidates {postural hypotension, hypotension, ·, neuropathy}. To represent the edge set, the adjacency matrix needs to be introduced. The elements of the adjacency matrix indicate whether pairs of vertices are adjacent or not in the graph. As shown in Figure 2, if a mention i contains a word j, the (i, j)-entry will be assigned a value of 1, and if a word i is the neighbor of word j, the (i, j)-entry will also assign a value of 1. Intuitively, word pairs entry can capture contextual information, and word-mention entry can capture the semantic and boundaries information of mention of the entry.

B. MODEL
The overall architecture of our proposed model is shown in Figure 2. It includes an embedding layer, a BiLSTM layer,, a graph layer, and a CRF layer. The embedding layer and the BiLSTM layer are to encode the contextual information of the sentence. The graph layer is based on GAT [30] for modeling over word-mention interactive graphs. Finally, a standard CRF model is used for decoding labels.

1) ENCODING
The input of the model is a sentence and all mention candidates that match consecutive sub-sequences of the sentence. We denote the sentence as s = {w 1 , w 2 , . . . , w n }, where w i is the i-th word and denote the mention candidates as l = {m 1 , m 2 , . . . , m m }. A word is represented by concatenating its word embedding and its character representation: where e w denotes a word embedding lookup table and x c i denotes its character representation. Following [2], [33], we adopt Bi-LSTM for character encoding.
To capture contextual information, word representations X 1 , X 2 , . . . , X n are fed into a Bidirectional LSTM layer. Finally, the two hidden states are concatenated for a final representation.
Each mention m i is represented as a semantic vector, which denotes as w m i We concatenate the contextual representation and the word embeddings as the output of this layer, denoting is as Node f .

2) GRAPH ATTENTION NETWORK
We use Graph Attention Network (GAT) to model over wordmention interactive graph. In an M-layer GAT, the input of j-th layer is a set of node features, NF j = f 1 , f 2 , . . . , f N , together with an adjacency matrix A, f i ∈ R F , A ∈ R N ×N , where N denotes the number of the nodes and F is the dimension of features at j-th layer. The output of j-th layer is a new set of node features, NF j+1 = {f 1 , f 2 , . . . , f N }. A GAT operation with K independent attention head can be written as: where ⊕ denotes concatenation operation, σ is a non-linear activation function, N i is the neighborhood of node i in the graph, α k ij is the attention coefficients, W k ∈ R F ×F , and a ∈ R 2F is a single-layer feed-forward neural network. Note that, the dimension of the output f i is KF . At the last layer, averaging will be adopted, and the dimension of final output features is F .
The output of all node features is denoted as G, where G ∈ R F ×(n+m) . We keep the first n columns of these matrices and discard the last m columns, because only word representations are used to decode labels.
Finally, the input of CRF layer is denoted as: where W 1 and W 2 are trainable matrices. The new represent R for sentence integrate the contextual information from the BiLSTM layer and the semantic and boundary information of the entities from the GAT layer.

3) DECODING AND TRAINING
A standard CRF layer is used to capture the dependencies between successive labels. The input of the CRF layer is R = {r 1 , r 2 , . . . , r n }, and the conditional probability of the golden tag sequence y = {l 1 , l 2 , . . . , l n } is Here y is an arbitrary label sequence, W l i CRF is used for modeling emission potential for the i-th word in the sentence, and T l i−1 ,l i CRF is the transition matrix storing the score of transferring from l i 1 to l i .
We use the Viterbi algorithm to find the highest scored label sequence. Given a manually annotated training data {(s 1 , y 1 ), (s 2 , y 2 ), . . . , (s n , y n )}, sentence-level log-likelihood loss with L2 regularization is used to train the mode. The loss function is defined as: where λ denotes the L 2 regularization parameter and θ is the all trainable parameters set.

IV. EXPERIMENTS
In this section, we carry out the experiments to investigate the effectiveness of our approach.

A. EXPERIMENTAL SETTINGS 1) DATASETS
We evaluate the performances of our models on two datasets. The BioCreative V Chemical Disease Relation (BC5CDR) dataset [16] and the NCBI disease dataset [15]. BC5CDR dataset consists of 1,500 PubMed abstracts, which has VOLUME 8, 2020 been equally separated into training set (500), development set (500) and test set (500). The dataset contains 12,852 disease and 15,935 chemical mentions. The NCBI disease dataset consists of 793 PubMed abstracts, which has been separated into training set (593), development set (100), and test set (100). The dataset contains 6,892 disease mentions. We use two types of dictionaries to recognize disease and chemical entities, respectively. The overall statistics on the datasets are provided in Table 1.

2) DOMAIN-SPECIFIC DICTIONARY
We build two types of dictionaries: a disease dictionary for disease mention recognition and a chemical dictionary for chemical mention recognition. we construct and process dictionaries and high-quality phases in the same way as in [9]. We combine the MeSH database 1 and the Comparative Toxicogenomic Database (CTD) Chemical and Disease vocabularies 2 as dictionaries. The phrases are mined from titles and abstracts of PubMed papers using the phrase mining method proposed in [9]. As suggested in [9], we apply tailored dictionaries to reduce false positive matching.

3) METRIC
Following the standard setting, we evaluate the methods using micro-averaged F1 score and also report the precision (P) and recall (R) in percentage. All the reported scores are averaged over ten different runs. Table 2 shows the values of hyperparameters in our models. we do not use hand-crafted features and only words and characters are considered as the inputs. We stop the training when we find the best results in the development set.

B. BASELINES
To further demonstrate the effectiveness of our model, we compare our model with previous state-of-the-art methods for disease and chemical mention recognition. The proposed models are as follows: • BANNER [17] is a CRF-based recognition model using a rich feature set.
• TaggerOne [18] uses a semi-Markov linear classifier while performing normalization and NER during training and prediction.
• CollaboNet [14] consists of multiple BiLSTM-CRF models for biomedical named entity recognition. The target model can obtain information from other collaborator models trained on different datasets in biomedical.
• BioBERT [19] is a domain specific language representation model pre-trained on large-scale biomedical corpora, including 18b words, which is based on the BERT architecture.
• MTM-CW [8] is a multi-task BiLSTM-CRF learning framework for BioNER, which shared character-and word-level information among relevant biomedical entities across differently labeled corpora.
• SwellShark [10] is a distantly supervised method designed for the biomedical domain, especially on the BC5CDR and NCBI-Disease datasets. It requires human efforts to customize regular expression rules and hand tune candidates.
• AutoNER [9] is a recent state-of-the-art distantly supervised method. After dictionary matching, it trains a BiLSTM-Softmax architecture.

C. RESULTS
We present recall, precision, and F1 scores on all datasets in Table 3. Since MTM-CW, SwellShark and AutoNER did not measure on recognizing disease and chemical entities on BC5CDR separately, we run their models on BC5CDR-Chemical and BC5CDR-Disease datasets for a fair comparison with other models. The scores are denoted with asterisks. From Table 3, one can find our model achieves the best performance on BC5CDR-Disease, and the second scores on BC5CDR-chemical and NCBI-disease, comparing with the baselines using the hot technologies, including feature engineering, multiple models, multi-task, pre-training, distantly supervision. For a fair comparison with the BioBERT model [19], we also use the word representations generated by BioBERT to compute the word or mention representations of our model, which are not fine-tuned during NER training. Compared with BioBERT, which is the best model in biomedical at present, our model GAT-BiLSTM-CRF (BioBERT) significantly outperforms it on BC5CDR-Disease dataset and results on BC5CDR-Chemical and NCBI-Disease datasets. Compared with both SwellShark and AutoNER models, the GAT-BiLSTM-CRF achieves a significant improvement. It demonstrates that our model is more effective than the distantly supervised models in reducing the impact of noise data. We will also make further analysis on disease and chemical recognition respectively.

1) DISEASE RECOGNITION
According to Table 3, we can see that GAT-BiLSTM-CRF (BioBERT) achieves the F1-scores 89.95% and 89.41% on BC5CDR-disease and NCBI-disease test sets, respectively. Compared with BioBERT, GAT-BiLSTM-CRF (BioBERT) improves the F1-score by 3.39% on BC5CDR-disease test set and is slightly higher by 0.05% on NCBI-disease test set. We further find that the improved F1-score on NCBI-disease is lower than that on BC5CDR-disease. The possible reason is that there is a higher coverage rate of matched mentions on BC5CDR-disease than that on NCBI-disease, which will be analyzed in the next section.

2) CHEMICAL RECOGNITION
In Table 3, GAT-BiLSTM-CRF (BioBERT) achieves the F1-score 93.50% on BC5CDR-chemical test set for chemical recognition. It is slightly higher 0.06% than BioBERT, while it is significantly higher than all the other models.

1) EFFECTIVENESS OF THE DICTIONARY
The dictionary is used to exact entity mention candidates from the datasets. The extraction algorithm greatly affects the quality of extracted mentions. We use precision and recall evaluating the accuracy and the coverage of mention candidates on the training set. We compare our graph matching algorithm with the exact string matching algorithm in the experiments. As shown in Table 4, using graph matching algorithm is able to boost the recall (coverage) by a large margin, increasing 18.72%, 21.63%, 14.86% on the BC5CDR-Disease, NCBI-Disease and NCBI-Disease datasets respectively, while greatly reducing the precision (which is inevitable due to the more additional noise introduced by our graph matching algorithm).
According to the Table 4, the word-mention interactive graph constructed by the graph matching algorithm contains more noises than the graph constructed by the exact string matching algorithm. We evaluate the effectiveness of two interactive graphs on the GAT-BiLSTM-CRF. Table 5 shows the performance on development sets and test sets on the three datasets. It can be seen that, despite more noise introduced by the graph matching algorithm, the graph matching algorithm helps GAT-BiLSTM-CRF achieve better recall with slightly decrease of precise. There are two possible reasons. One is that the coverage of entity mention candidates on test data play an important effect on model performance.
Intuitively, the higher the coverage of matched entities on test data, the better the performance. To investigate the effect of the coverage on our model, we manually construct six dictionaries by randomly removing or adding entity names from the train and test data, with coverage between 0% and 100%. We show the F1-scores on the BC5CDR-disease, BC5CDRchemical datasets, in Figure 3. We find that the F1-score is 79.16% when the coverage is 0% on the BC5CDR-disease, while the F1-score is 93.78% when the coverage is 100% the BC5CDR-chemical. In both solid, the higher coverage, the F1-score is larger. Moreover, we also see that the F1-scores both our model GAT-BiLSTM-CRF and BiLSTM-CRF are equal when the coverages are 57.63% and 62.17% for BC5CDR-disease and BC5CDR-chemical, respectively. This is presumably due to a higher false positive rate introduced by matching mentions. The other reason is that the graph attention module can effectively alleviate the impact of noises, which is analyzed in the next section.

2) EFFECTIVENESS OF THE MODEL
The graph attention network(GAT) is the core module in the GAT-BiLSTM-CRF model, which performs the functions that integrate extracted mentions information and filter those VOLUME 8, 2020   noise mentions. To test the validity of the GAT. We compare GAT-BiLSTM-CRF with models by removing the GAT module and using graph convolutional network instead of GAT. Table 6 gives the scores on the three datasets. The third row shows the result of the BiLSTM-CRF model without using the GAT. We can see that there is a big drop in all of the cases. It demonstrates that the GAT module can help integrate effective information. The fourth row is the GCT-BiLSTM-CRF with GCT instead of GAT. It demonstrates the GAT can effectively filter the noise mentions, validating the results shown in 5.
Furthermore, we also give a visualization case of the graph attention module. Figure 4 shows the weights of nodes w 4 and w 8 sub-graph in the word-mention interactive graph in Figure 2. According to Figure 4, for the word w 4 ''hypotension'' sub-graph, we can see that m 2 ''hypotension'' has the highest attention weight and the word ''by'' has lowest attention weight. In fact, it is intuitive that the word w 4 ''hypotension'' is similar m 2 , which is consistent with Figure 2.

3) CASE STUDY
In order to verify the ability of our model to discover new mentions, we compared our model GAT-BiLSTM-CRF with BiLSTM-CRF in the disease recognition task. We randomly choose four examples from BC5CDR-disease test set as shown in Table 7. Column 1 represents sentences containing new mentions, which are neither in BC5CDR-disease training set nor in the disease dictionary. Column 2 represents gold mentions in the sentences. Column 3-4 represent the predicted results by BiLSTM-CRF and GAT-BiLSTM-CRF, respectively. From Table7, we can see that both models can correctly predict the mentions in example 1 and 2, in which mentions have obvious indicator features '' s'' and ''of''. But BiLSTM-CRF can only detect the part of gold mentions in the example 3 and 4, since ''glomerular'' and ''duct destruction'' have appeared in the training set but the gold mentions do not. While our model can correctly discover the mentions because our model can extract the gold mentions from sentences using the graph matching algorithm.

V. CONCLUSION
In this paper, we propose a graph attention-based BiLSTM-CRF model for low-resource domain NER with a domainspecific dictionary. A domain-specific dictionary is used to extract entity mention candidates by the graph matchingbased entity mention extraction algorithm. Using the graph matching algorithm is able to boost the coverage by a large margin. Although it produces more noises, the graph attention module can effectively alleviate the impact of noises. The experiments 4 on disease and chemical datasets in the biomedical domain show that our model has complementary strengths to the art-of-art models, and both word formation graph and word-mention interactive graph is effective. In the future, we plan to further extract and incorporate dictionary knowledge from other learning principles and empower more sequence labeling tasks on low-resource domains.