Biomedical Word Sense Disambiguation Based on Graph Attention Networks

Biomedical words have many semantics. Biomedical word sense disambiguation (WSD) is an important research issue in biomedicine field. Biomedical WSD refers to the process of determining meanings of ambiguous word according to its context. It is widely applied to process, translate and retrieve biomedical texts now. In order to improve WSD accuracy in biomedicine, this paper proposes a new WSD method based on graph attention neural network (GAT). Words, parts of speech, and semantic categories in context of ambiguous word are used as disambiguation features. Disambiguation features and the sentence are used as nodes to construct WSD graph. GAT is used to extract discriminative features, and softmax function is applied to determine semantic category of biomedical ambiguous word. MSH dataset is used to optimize GAT-based WSD classifier and test its accuracy. Experiments show that average accuracy of the proposed method is improved. At the same time, majority voting strategy is adopted to optimize GAT-based WSD classifier further.


I. INTRODUCTION
With the rapid development of biomedicine, the number of biomedical vocabulary is increasing. We need specific tools to process biomedical texts. However, it is very difficult to process biomedical texts in many cases. This is because many biomedical words have multiple meanings, which results in ambiguity of biomedical text. Faced with these challenges, we need to design a novel and effective tool to solve ambiguities of biomedical words. For example, biomedical word 'BLM' has two semantics, including 'Bureau of Land Management' and 'Bleomycin'. So, we need determine correct meanings of biomedical word according to its context. Biomedical WSD is the process of assigning ambiguous word with unambiguous sense.
We extract two sentences containing ADA from the corpus. The first sentence is 'Third dental therapeutics guide debuts at ADA session'. The second sentence is 'We isolated a novel ADA inhibitor from a culture of Bacillus spJ89 and evaluated The associate editor coordinating the review of this manuscript and approving it for publication was Amjad Ali. its anti proliferative activity on human cancer cell lines'. ADA has two semantics. The first one is American Dental Association and the second one is Adenosine deaminase. They are all abbreviated to ADA. The semantics of ADA in the first sentence is American Dental Association. The semantics of ADA in the second sentence is Adenosine deaminase. We explore the full form of ADA and find it to be ambiguous in original text. We can find that ADA is ambiguous and should be disambiguated based on its contexts. Now, biomedical WSD is widely applied to document classification, information extraction and document retrieval in biomedicine field. Biomedical WSD methods are divided into 3 categories: supervised method, unsupervised one and knowledge-based one.
In supervised WSD method, human-annotated instances are used to train WSD classifier, which assigns semantic category to test instance [1]. In unsupervised method, structural knowledge is learned from unlabeled instances to determine category of biomedical ambiguous word [2]. In knowledgebased method, thesauri and sense inventories are applied to disambiguate biomedical ambiguous words. For example, WordNet and the Unified Medical Language System (UMLS) are important thesaurus which define different senses and corresponding synonyms [3]. We propose a new biomedical WSD method based on GAT. The main innovations and contributions of this paper are summarized as follows: • Words, parts of speech, semantic categories from contexts and sentence containing biomedical ambiguous word are used as disambiguation features. Word2vec and doc2vec tools are adopted to extract feature vectors from disambiguation features.
• WSD graph is constructed. We construct WSD data as graph and solve WSD problem in graph-structured data. Disambiguation features are used as nodes in graph. Edges are established between word nodes and sentence ones, word nodes and part of speech ones, word nodes and semantic category ones.
• Multi-head graph attention mechanism is adopted to adjust dynamically weight between two neighbor nodes.
This paper is organized as follows. Related work is reported in Section II. WSD feature extraction is given in Section III. WSD based on GAT is described in Section IV. Experimental results are given and analyzed in Section V. Conclusion is described in Section VI.

II. RELATED WORK
Mcinnes firstly researches on Biomedical WSD [4]. WSD method is divided into supervised one, unsupervised one, and knowledge-based one.
In supervised WSD method, labeled data is used to train WSD classifier. Wang gives an interactive learning algorithm with expert labeling instances and features [5]. Experts provide supervision in 3 ways: labeling instances, specifying indicative words of a sense, and highlighting the supporting evidence in a labeled instance. Zhang proposes two supervised WSD models based on deep learning technology [6]. One is based on Bi-directional Long Short-Term Memory (BiLSTM) network, and the other is based on self-attention mechanism. Yepes evaluates several features from contexts of ambiguous word and uses word embeddings to explore global features from MEDLINE [7]. Festag investigates WSD based on word embeddings and recurrent convolutional neural networks [8]. He focuses on terms mapped to multiple concepts of the UMLS. Antunes gives a supervised biomedical WSD method which uses bag-of-words as local features, and utilizes word embeddings as global features [9]. Bis proposes a novel deep neural network for supervised medical WSD based on a layered bidirectional LSTM network which performs a max-pooling along multiple time steps to create dense representation of the context [10]. Lui suggests that the notion of destination is a strong predictor of pedestrian trajectories and proposes a novel enhancement of the data-driven approach for pedestrian tracking in public buildings [11]. Qiao presents DEep contextualized biomedical abbreviation expansion model, which automatically collects substantial and relatively clean annotated contexts for 950 ambiguous abbreviations from PubMed abstracts using a simple heuristic [12]. Supervised methods usually have high accuracy. But, training data need be labeled, and the cost of labeling data is relatively high.
In unsupervised WSD method, unlabeled corpus is clustered to determine semantic category of ambiguous word. Pesaranghader designs deepBioWSD model which leverages 1 single bidirectional LSTM network to predict sense of ambiguous term [13]. Smalheiser gives similarity metrics for relating two medical subject headings with each other [14]. Li takes account of word order and presents a novel language model based on Bi-LSTM to embed sentential context in continuous space [15]. The proposed model generates contextual representations in an unsupervised manner. Duque presents a graph-based unsupervised biomedical WSD method, in which knowledge base is a graph built with co-occurrence information from medical concepts in scientific abstracts [16]. Ren proposes biomedical WSD method based on Convolutional Neural Network [17]. A large scale of relevant corpus from MEDLINE is crawled for training and contextual feature vectors are obtained. El-Rab applies six relation types of UMLS to build a graph for ambiguous word and gives a graph-based algorithm to disambiguate terms in biomedical text [18]. Moon discusses feature selection for disambiguation of acronyms and abbreviations in clinical domain [19]. Ren predefines the number of senses [20]. He uses kernel fuzzy C-means clustering method to group terms with the same sense into a set. Each set is mapped to a sense. Ahmad proposes the optimized gloss vector relatedness, the adapted gloss vector similarity measures, two enhanced semantic measures [21]. The effectiveness over WSD in biomedical domain is evaluated. Cao proposes an enhanced deep clustering network, which is composed of feature extractor, conditional generator, discriminator and siamese network [22]. The obtained pseudo-labels will be used to generate realistic data by generator. Finally, discriminator is used to model real joint distribution of data and corresponding latent representations for feature extractor enhancement. Ren proposes an abbreviation disambiguation method based on convolutional neural network to solve abbreviation disambiguation problem in biomedical field when no labelled corpus exists [23]. The data of the unsupervised method is unlabeled. We can obtain data easily, but WSD accuracy is not high.
In knowledge-based WSD method, lexical resources are applied including machine-readable dictionaries, thesauri and ontologies. Mohammed presents a simple modified version of SenseRelate algorithm for biomedical WSD, which ignores the distance that terms in contexts have the same distance [24]. Antunes applies results from machine learning and knowledge-based algorithms to biomedical WSD [25]. He represents textual definitions of biomedical concepts from the UMLS as word embeddings, and combines them with concept associations from the MeSH term co-occurrences. Sabbir exploits recent advances in neural word/concept embeddings to improve the performance of biomedical WSD on MSH dataset [26]. Duque gives a biomedical WSD system based on co-occurrence graphs which contain biomedical concepts and textual information [27]. Rais exploits semantic similarity and relatedness measures from biomedical resources to evaluate the influence of context window size on WSD [28]. Pashuk disambiguates biomedical terms based on word bags from the context, definitions and information on related terms from the UMLS [29]. McInnes uses semantic similarity and relatedness measures to determine semantic category of biomedical term, which does not require human-annotated corpus and yields high accuracy [30]. Garla gives a knowledge-based WSD method that uses semantic similarity from the UMLS and evaluates the contribution of WSD to clinical text classification [31]. Kim suggests the link topic model inspired by latent Dirichlet allocation model, in which each document is perceived as a random mixture of topics, where each topic is characterized by a distribution over words [32]. Knowledge-based methods can mine large-scale data and organize useful information. But, knowledge is more difficult to be obtained.
Kang applies graph attention network to learn heterogeneous information [33]. Wang develops a GAT-based scheduler to learn features of scheduling problems automatically [34]. Wang introduces graph attention network based on syntactic dependency graph into natural language processing tasks [35]. Xie adopts attention architecture to learn representations of single views and uses regularization term to constrain the network's parameters [36]. Long designs graph convolutional network with node-level attention to learn embeddings for microbes and drugs [37].
These 3 methods have their own shortcomings. Although supervised WSD method can achieve the better performance, it needs a lot of annotated corpus. It is time-consuming and laborious. Unsupervised WSD method does not label corpus manually. But, disambiguation accuracy is not high. In knowledge-based WSD method, linguistic resources are used to provide disambiguation information for WSD. But, it is expensive to construct dictionaries. MSH dataset is annotated corpus. Supervised methods are usually more accurate than unsupervised ones. We choose the supervised method for WSD. Previous supervised WSD methods do not use linguistic knowledge such as parts of speech and semantic categories. Semantic category of ambiguous word is closely related to linguistic knowledge of its context in biomedical text. When words, parts of speech and semantic categories from ambiguous word's context are combined to determine its meanings, more discriminative information will be provided. WSD accuracy will be improved. In this paper, we use 2-layer GAT to extract discriminative features from context of biomedical ambiguous word. Softmax function is adopted to determine its semantic category.

IV. WORD SENSE DISAMBIGUATION BASED ON GAT
Graph is composed of nodes and edges. In graph attention neural network, attention mechanism is introduced into graph neural network to measure the importance of nodes.
WSD graph is constructed, in which words, parts of speech, semantic categories and sentence are used as nodes, and their relationships are used as edges between nodes. WSD graph are composed of word set W{w1, w2, w3, . . . }, part of speech set P{p1, p2, p3, . . . }, and semantic category set S{s1, s2, s3,. . . }, and sentence set D{d1, d2, d3,. . . }. At the same time, WSD graph contains edge set WP{wp1, wp2, wp3, . . . } between word and part of speech, edge set WS{ws1, ws2, ws3,. . . } between word and semantic category, edge  set WD{wd1, wd2, wd3, . . . } between word and sentence, edge set WW{ww1, ww2, ww3,. . . } between word and word. Adjacency matrix A is constructed based on WSD graph. When the number of nodes is N, the size of matrix A is N * N. If the dimension of feature vector is M, the scale of feature matrix X is N * M. Adjacency matrix A and feature matrix X are input into GAT to extract discriminative feature, and softmax function is used to determine semantic category of biomedical ambiguous word as shown in Figure 2.
For the above sentence containing ambiguous word 'Milk', 16 disambiguation features are extracted. Use Word2Vec tool and Doc2Vec tool to vectorize disambiguation features. Feature matrix X 16 * 200 is gotten as shown in Figure 3 and input into GAT.
We use TF-IDF to determine whether there is edge between word node and sentence node, as shown in formula (1). where, TF(t i ,d j ) represents the frequency of word t i appearing in sentence d j , N is the number of sentences in document, and n i represents the number of sentences containing word t i in document.
Use PMI to determine whether there are edges between word nodes and word nodes as shown in formula (2).
where, p(w i , w j ) represents co-occurrence probability of w i and w j , p(w i ) is occurrence probability of w i , p(w j ) denotes occurrence probability of w j . Each node is regarded as its own neighbor to retain its own information. A closed loop is added into adjacency matrix A, and its diagonal elements are set to 1.
Adjacency matrix A is shown in formula (3).
Assuming that v j is neighbor node of v i , attention weight α ij between v i and v j is computed as shown in formula (4).
where, W is weight, h i (i) and h j (i) are respectively feature vector of node v i and v j in the ith layer, N i is the set containing all adjacent nodes of v i , || represents the splicing operation, LeakyReLU is activation function, a is determined by weight vector.
Attention weight α ij is computed as shown in Figure 4. We use K-head attention to stabilize self-attention learning process. Here, h i (i+1) is feature vector of v i in the i+1th layer VOLUME 10, 2022 as shown in formula (5).
where, σ represents nonlinear activation function. The process of computing h 1 (i+1) under 3 head attentions is shown in Figure 5.
We extract disambiguation features from sentence containing biomedical ambiguous word m. WSD graph is constructed. Feature matrix and adjacency matrix are constructed. They are input into GAT layer to extract discriminative features. Softmax function is used to calculate probability p(s i |m) of m under semantic category s i . Then, semantic category s of ambiguous word m is determined as shown in formula (6).

V. EXPERIMENTAL RESULTS AND ANALYSIS
MSH data set in biomedical field is used to train and testify the proposed method. The US National Library of Medicine has developed a medical language system. The data generated In the fifth group of experiments, majority voting strategy is adopted to optimize GAT-based WSD classifier in which CNN-based classifier and GCN-based classifier are used. Average accuracy is used to evaluate WSD classifier as shown in formula (7).
where, N is the number of ambiguous words, m i is the number of test sentences correctly classified for the ith ambiguous word, n i is the number of test sentences containing the ith ambiguous word, p i is disambiguation accuracy of the ith ambiguous word, p avg is average accuracy. The first group of experiments include Experiment 1, Experiment 2, and Experiment 3. In these 3 experiments, the learning rate is 0.01, the dropout rate is 0.5, and the number of training epochs is 100. In Experiment 1 and Experiment 2, words, parts of speech and semantic categories are extracted from contexts of ambiguous word as disambiguation features. Disambiguation features and sentence containing ambiguous word are used as nodes to construct WSD graph. GAT and GCN are respectively used to determine semantic category of ambiguous word on WSD graph. In Experiment 3, words, parts of speech and semantic categories are extracted as disambiguation features from two left and right units around ambiguous word. CNN is used to determine semantic category of ambiguous word. Activation function of GAT, GCN and CNN layer is Relu. Softmax layer is adopted to determine semantic category of ambiguous word. Training corpus is used to optimize GAT, GCN, and CNN. Test corpus is adopted to evaluate the optimized GAT, GCN, and CNN as shown in Table 1.
It can be seen from Table 1 that average accuracy of Experiment 1 is the best and achieves better than Experiment 2. GCN and GAT aggregate neighbor nodes' features to the center one. But, GCN uses Laplacian matrix and GAT uses attention coefficient. GAT can extract more effective features than GCN. At the same time, the correlation between nodes is better integrated into WSD model. Experiment 2 achieves better than Experiment 3 at average accuracy. This is because that disambiguation features are extracted from all left and right units of ambiguous word in Experiment 2. But, disambiguation features of Experiment 3 are extracted from two left and right units around ambiguous word. More linguistic knowledge is integrated into WSD classifier in Experiment 2. So, average accuracy of Experiment 2 is better than that of Experiment 3.
Ambiguous words of Experiment 1, Experiment 2 and Experiment 3 in Table 1 are respectively classified according to category number. Average accuracy of ambiguous words with the same category number is calculated as shown in Figure 6.  From Figure 6, it can be seen that average accuracy of WSD classifier decreases when category number increases. The reason is that the predicted results have more possibilities with category number increasing. It makes error rate of WSD classifier higher. Average accuracy of GAT is higher than that of GCN for 2 categories and 3 ones. This is because that GAT has better ability of feature extraction than GCN. GAT assigns different weights to neighbor nodes with the same order. Information from neighbor nodes is aggregated and scaled according to attention weight. The correlation between nodes is better integrated into WSD classifier. GCN assigns the same weights to neighbor nodes with the same order. So, the way that GCN fuses adjacent nodes' features is related to graph structure. GCN WSD classifier has poor ability of being generalized to graphs with different structure. Average accuracy of GAT is higher than that of CNN for 2 categories and 3 ones. The reason is that GAT extracts disambiguation features from context of ambiguous word. But, CNN only extracts disambiguation features from 2 left and right units of ambiguous word. Average accuracy of GCN is higher than that of CCN for 2 categories. This is because that information of all units is used in GCN. But, information of 2 left and right units is adopted in CNN. Average accuracy of GCN is lower than that of CCN for 3 categories. The reason is that accuracy of GCN is considerably lower than that of CCN for some biomedical ambiguous words. For example, DDS and DI.
Attention head number affects the performance of the proposed network. The second group of experiments are performed to investigate the influence of head number on biomedical WSD. In these 4 experiments, the learning rate is 0.01, the dropout rate is 0.5, and the number of training epochs is 100. Activation function of GAT layer is Relu and softmax layer is adopted to determine semantic category of ambiguous word. Head number is respectively set to 3, 4, 5 and 6. Training corpus is used to optimize the proposed network. Test corpus is adopted to evaluate the optimized network as shown in Table 2.
It can be seen from Table 2 that average accuracy of the proposed network first increases and then decreases with the increase of head number. The proposed network with 5 head attentions achieves the best and its average accuracy reaches 0.8424. Different attentions consider the relevance in different levels and calculate independently. When head number is larger, information in more levels can be considered to obtain effective features. When head number is smaller, the dimension of feature vector for each attention is larger. The network's structure is complicated, and the overfitting phenomenon occurs easily. Disambiguation feature is divided into 5 parts, and 5 attentions calculate independently. Then, they are concatenated to obtain effective discriminative features.
The scale of training corpus influences the performance of the proposed network. The third group of experiments are performed where the ratio of training corpus and test one is respectively set to 4:3, 7:3, 20:13, 5:3. In these 4 experiments, the learning rate is 0.01, the dropout rate is 0.5, and the number of training epochs is 100. Activation function of GAT layer is Relu and softmax layer is adopted to determine semantic category of ambiguous word. Head number of the proposed network is set to 5. Training corpus is used to optimize the proposed network. Test corpus is adopted to evaluate the optimized network as shown in Table 3.
It can be seen from Table 3 that average accuracy of the proposed network first increases and then decreases with the scale of training corpus increasing. The proposed network is optimized adequately when the ratio of training corpus and test one is 7:3. It achieves the best and its average accuracy reaches 0.8424. The reason is that the proposed network is optimized adequately and it can extract more discriminative features. So, WSD classifier performs better. When the ratio is less than 7:3, there are less training data and GAT is optimized inadequately. When the ratio is larger than 7:3, there are more training data and more noise is introduced into the process of optimizing GAT.
Layer number of GAT influences the performance of the proposed network. The fourth group of experiments are conducted where layer number is respectively set to 1, 2, 3, and 4 respectively. Head number of the proposed network is set to 5. The learning rate is 0.01, the dropout rate is 0.5, and the number of training epochs is 100. Activation function of GAT layer is Relu and softmax layer is adopted to determine semantic category of ambiguous word. Training corpus is used to optimize the proposed network. Test corpus is adopted to evaluate the optimized network as shown in Table 4.
It can be seen from Table 4 that the proposed network achieves the best, whose layer number is 2. Its average accuracy reaches 0.8424. This is because that information between GAT nodes cannot be fused well when layer number is too small. When there are more GAT layers, extensive information will be collected. But, when layer number is too high, information will be diffused excessively in graph's nodes. Each node's representation is smoothed and its discriminative ability decreases. Information in high-order neighbor nodes are fused, which results in excessive fusion of information and reduces the performance of biomedical WSD classifier.
Ambiguous words in Table 4 are respectively classified according to category number and layer number. Average accuracy of ambiguous words with the same category number and layer number is calculated as shown in Figure 7.  From Figure 7, it can be seen that average accuracy of the proposed network decreases with layer number increasing. The is because that the predicted results have more possibilities when layer number increases. It makes error rate of the proposed network higher. Its average accuracy first increases and then decreases with the increase of layer number. The proposed network with 2 layers achieves the best for 2 categories and 3 ones. If layer number is too small, information between nodes can not be well fused. If there is more GAT layers, information in high-order nodes is fused and effective features can not be extracted.
In the fifth group of experiments, the optimized CNNbased classifier, GCN-based classifier and GAT-based classifier are respectively used to determine semantic categories of ambiguous word m in test sentence. Then, 3 semantic categories can be gotten. We select category with the highest frequency as semantic category of m. Disambiguation accuracies of ambiguous words are shown in Table 5.
From Table 5, we can see that WSD method based on majority voting strategy achieves better than GAT-based WSD method on average accuracy. Its average accuracy reaches 0.8510. The reason is that CNN, GCN and GAT have their own advantages and disadvantages on feature extraction. When CNN, GCN are used to help GAT for determining semantic category of ambiguous word under majority voting strategy, average accuracy is improved.

VI. CONCLUSION AND FUTURE WORKS
In this paper, a biomedical WSD network is proposed, including input layer, feature extraction layer, graph construction layer, GAT layer and output layer. Words, parts of speech and semantic categories are extracted as disambiguation features from context of biomedical ambiguous word. WSD graph is constructed, in which disambiguation features and sentence are used as nodes. Relationships between word and sentence, word and part of speech, word and semantic category are viewed as edges. Then, adjacency matrix and feature matrix are built. GAT is used to extract discriminative features and softmax function is adopted to determine semantic category of biomedical ambiguous word. Experiments are conducted on MSH dataset and results show that average accuracy of the proposed method reaches 0.8424. When majority voting strategy is adopted to optimize GAT-based WSD classifier, its average accuracy is improved.
In the future, more linguistic and biomedicine knowledge will be introduced into biomedical WSD. For example, parsing information of context and common knowledge in biomedicine field. At the same time, we apply GAT to WSD graph for disambiguating biomedical ambiguous word. So, WSD graph is key to the proposed network in this paper. When adjacency matrix is constructed, PMI is adopted to evaluate the relevance of two words. It is not precise. In the future, we will try more methods to compute the relevance of two words.
CHUN-XIANG ZHANG received the Graduate and Ph.D. degrees from the MOE-MS Key Laboratory of Natural Language Processing and Speech, School of Computer Science and Technology, Harbin Institute of Technology, in 2007. He is currently a Professor at the School of Computer Science and Technology, Harbin University of Science and Technology. His research interests include natural language processing, machine translation, machine learning, computer graphics and CAD, and 3D model retrieval. He has authored or coauthored more than 60 journals and conference papers in these areas.
MING-LEI WANG received the B.S. degree from the Qilu University of Technology, in 2019. He is currently pursuing the master's degree with the School of Computer Science and Technology, Harbin University of Science and Technology. His research interests include natural language processing and word sense disambiguation.
XUE-YAO GAO received the Graduate and Ph.D. degrees from the School of Computer Science and Technology, Harbin University of Science and Technology, in 2009. She is currently a Professor at the School of Computer Science and Technology, Harbin University of Science and Technology. Her research interests include computer graphics and CAD, 3D model retrieval, natural language processing, and machine learning. She has authored or coauthored more than 50 journals and conference papers in these areas.