Interactive Self-Attentive Siamese Network for Biomedical Sentence Similarity

,


I. INTRODUCTION
Increasing numbers of medical texts have been accumulated with a growing amount of biomedical information. However, many sentences represent similar semantic meaning, but consist of completely different text descriptions in these considerable data, resulting in considerable unnecessary trouble for medical research. Therefore, evaluating the textual similarity between biomedical texts is an important task of extracting useful biomedical information, such as drug-drug interactions (DDIs) [1]. Although some researchers have utilized biomedical resources [2] or corpora [3] to improve the performance of evaluation similarity, the generalization of these methods is poor due to limitations in the resources and corpora. Therefore, machine learning-based methods [4] are also proposed for this task.
The associate editor coordinating the review of this manuscript and approving it for publication was Gina Tourassi. Some researchers have utilized word embedding [5], sentence embedding [6], [7] and shared sentence encoders [8] to obtain semantic expression and estimate similarity. Neculoiu et al. [9] and Mueller [10] computed the similarity between texts via Siamese neural networks (SNNs), containing dual recurrent neural networks. Although Siamese neural networks perform well in evaluating text similarity tasks, researchers have proposed improved models based on the standard SNN for this task. Zhu et al. [11] presented a replicated Siamese LSTM model for estimating similarity in asymmetric text and achieved better performance. Then, a dependency-based Siamese LSTM network was proposed by Zhu for sentence similarity. The model captured richer semantic information about a sentence than the standard LSTM model and learned an efficient sentence representation. However, these improved SNNs are mainly based on the improvement of a single network structure such as hierarchical LSTM, or the introduction of other mechanisms VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ such as dependency. Thus, Pontes [12] extracted crucial semantic vectors via two convolutional neural networks (CNNs), and then fed the semantic vectors into the following Siamese network. This combination of networks helped to preserve the relevant information of sentences and promoted the calculation results of the similarity between sentences. Nevertheless, these methods have ignored the influence of different words on the overall semantics. An attention mechanism can calculate the weight of each word, representing the influence of words on sentence semantics. The attentive Siamese LSTM network, instead of using numerous manual features relying on prior knowledge or external resources, is presented for measuring semantic textual similarity [13].
It is insufficient to estimate the similarity between sentences via calculating the similarity/distance of two separate sentence semantic representations, which are obtained from deep neural network models. In fact, different words have different contributions to the overall semantics in a single sentence. Moreover, the semantics of one sentence also impacts the semantics of other sentences in the task of evaluating the similarity of two sentences. Inspired by self-attention [14] and interactive attention [15], interactive self-attention (ISA, integrating self-attention with interactive attention) is proposed in this paper. Then, ISA is integrated with a Siamese neural network, named the interactive self-attentive Siamese neural network (ISA-SNN). The model learns the weights of words in a single sentence by using a self-attention mechanism and sharing parameters of the SNN and captures the interactive semantic information between two sentences via an interactive attention mechanism.
The main contributions can be summarized as follows: • An interactive self-attention (ISA) mechanism is proposed that integrates self-attention with interactive attention.
• The proposed model utilizes ISA to obtain the interactive semantic information between the biomedical sentences, and the interactive semantic information may enhance the sentence semantic representation learned by self-attention.
• Three biomedical datasets are employed to verify the effectiveness of the proposed mechanism and model. Analysis of Siamese networks with different mechanisms shows that the ISA is helpful in enhancing semantic information on the sentence pair and results in improving the performance of estimating textual similarity. The rest of the paper is organized as follows. Section II gives a brief review of related work. Section III presents the different attention mechanisms and ISA-SNN for evaluating the similarity between sentences. Section IV describes the experiment detail, parameter settings, and evaluation metrics. Detailed analysis and a summary of the experimental results are explained in section V. In section VI, a conclusion is drawn, and the future work is discussed.

II. RELATED WORK
Evaluating textual similarity as one of the main tasks in NLP refers to the calculation of the similarity between term pairs or sentence pairs. Researchers evaluated the textual similarity using information retrieval methods [16] and automatic text summarization [17]. To overcome the difficult operation and common error of information retrieval methods, an automatic evaluation algorithm based on word cooccurrence [18] was proposed, but common words in sentences affected the scores. Therefore, knowledge- [19] and corpus-based methods were applied to measure the semantic similarity [3] and decrease the error rate. However, the performance of these methods depended on the quality of the knowledge base/corpus. Hence, such machine learning-based methods as SVM [4] were used for the task. Specifically, neural network methods such as DBN [20] and broad learning [21] achieve better performance. A Siamese network, sharing the parameters between two RNNs or other neural network models [9] was better than one. The method mainly focused on the semantics of words or a single sentence without considering the contribution of different words to semantics until an attention mechanism was used in NLP tasks.
In some NLP tasks, some parts of the input can be more relevant compared to others [22]. Thus, Bahdanau et al. [23] first introduced the attention mechanism for machine translation. Researchers used the mechanism and improved attention methods in such different NLP tasks as text classification [24], Name Entity Recognition (NER) [25], and sentiment classification [26]. First, for a single sequence, relative to soft attention [23], a hard attention model [27] in which the context vector is computed from stochastically sampled hidden states in the input sequence was used, while local and global attention [28] were used to capture the local and global context. For capturing the relationship within a sequence, self-attention [29], also known as inner attention, was proposed. Second, for multiple sequences, coattention [30] was joined with multiple source information attention to guide each other, and interactive attention [15] well represented a target and its collective context, which contributes to classification tasks. The interactive attention mechanism in such sentence/word pair tasks as question-answer is effective due to its interactive semantics [15]. Estimating the similarity between sentences is a typical sentence pair task. Although some researchers improved the Siamese network [31] and utilized different deep neural networks [8] to obtain better results, and interactive attention and self-attention were not integrated with other deep neural network-based methods for calculating the similarity between sentences. Therefore, we introduce self-attention to obtain weights of words and interactive attention to enhance semantic representation. Figure 1  embedding layer and dual bi-RNN layer to extract the basic semantic information of sentence pairs. Then, interactive self-attention obtains the final sentence semantic representation. Finally, a distance measuring function such as Manhattan is employed to estimate the semantic similarity between sentences.

A. SIAMESE NEURAL NETWORK
Siamese neural networks (SNNs) [9], [10] are used to evaluate the similarity between texts. They are dualbranch networks with shared weights, which consist of the same network copied and merged with an energy function [9]. Furthermore, bidirectional recurrent neural networks (bi-RNNs) have achieved good results on other biomedical NLP tasks such as named entity recognition (NER) [32]. Thus, the Siamese network employs bi-RNN as the branch networks in the paper. In addition, double bi-LSTMs in each branch network are employed for textual similarity estimation owing to syntactical complexity in biomedical texts [34]. LSTM and GRU are the two most widely used network cells of RNN. Both GRU and LSTM contain an update gate z to determine how much of the memory of a previous cell to keep and a reset gate r to determine how to combine the input of a new cell with previous memory. Nevertheless, a slight difference exists between LSTM and GRU. First, an LSTM cell has three gates, but there are only two gates in a GRU. Second, the forget and input gates of a GRU cell are combined into a single update gate z, and reset gate r is directly applied to the previous hidden state [33]. In this work, our experimental datasets consist of long and syntactically complex sentences. Hence, we use LSTM cells as units of our bi-RNN owing to capturing more complex patterns via the extra gating function [34]. The hidden output of each LSTM cell can be calculated via equations 1-6.
where W i , W f , W o ∈ R d×d are the weighted matrices and b i , b f , b o ∈ R d are biases of LSTM to be learned during training, x t ∈ R d is the input at time-step t, and d the feature VOLUME 8, 2020 dimension for each word, σ the elementwise sigmoid function, and • the elementwise product. C t is the memory cell designed to lower the risk of vanishing/exploding gradient, and therefore, enables learning of dependencies over a larger period of time than what is feasible with traditional recurrent networks.
∼ C t is the temporary state at time-step t. The forget gate, f t resets the memory cell. i t and o t denote the input and output gates, respectively, and essentially control the input and output of the memory cell. tanh is the activation function.
B. SELF-ATTENTION SNN ignores different contributions of different words to the overall sentence semantic representation while it extracts the semantics through sharing parameters. Therefore, some researchers utilized attention mechanisms to learn the different weights in a sentence [13], [35]. Additionally, the semantic similarity between words is useful for biomedical sentence similarity estimation because these sentences obtained from biomedical databases, such as PubMed and the Conserved Domain Database, are syntactically complex, and it is difficult to obtain significant semantic features. Furthermore, self-attention obtains weights of words via an attention operation that is performed between each word and all words in the sentence. These attention weights not only denote the different contributions to the overall sentence semantics but also denote the similarity between words in a sentence. Therefore, self-attention is more suitable than other attention mechanisms for biomedical sentence similarity estimation.
Given an input sequence s = (x 1 , x 2 , . . . x n ), self-attention is defined as calculating the weight α i of each word x i towards the other words in the sequence s. The weight vector α i represents the contribution of word x i on the semantics of the sequences. The weight matrix of the sequences can be described as α = (α 1 , α 2 , . . . α n ), where n is the number of tokens in the sentence and α i is calculated by the following equation 7-10.
where d is the dimension of q and k, the weight matrix W is a parameter of the model, and • is the elementwise product. q i , k i and v i denote the vector of the i th word. α i,j andα i,j are values of the i th and j th word using the elementwise product and softmax function, respectively. Finally, the output of self-attention is C = (c 1 , c 2 , . . . c n ), where c i = α i v i which represents the enhanced semantic vector of each word in the sentence s.

C. INTERACTIVE SELF-ATTENTION
However, it is insufficient to only focus on semantic information in a single sentence for computing semantic similarity between sentences. In biomedical literature, the researchers described the same contents/opinions using sentences that consist of the same words (synonyms, near-synonym) with different positions, e.g., ''Kalirin and Trio are encoded by separate genes in mammals and by a single one in invertebrates.'' And ''In mammals, Kalirin and Trio are encoded by separate genes, but invertebrates have a single homologous gene.'' Therefore, semantic information in one sentence contributes to semantic representation in another sentence in biomedical sentence pairs. Inspired by the interactive attention network (IAN) [15], one sentence and another sentence are regarded as the context and target, respectively, in IAN.
In this way, interactive attention can be used to compute the semantic similarity in the sentence pairs naturally. In this paper, self-attention is integrated with interactive attention, named interactive self-attention (ISA). The ISA extracts sentence semantics via self-attention and enhances the semantic representation through interactive attention. Given the sequence X = (x 1 , x 2 , . . . x n ) and another sequence Y = (y 1 , y 2 , . . . y n ), where X and Y are the two semantic features of the output of the Siamese network, respectively, and n is the length of the sequence. Since there are two inputs, each input has a function in the interactive self-attention network, defined as equations 11 and 12.
where W x and W y are weight matrices of the A and B sequences, respectively. Moreover,x andȳ are the average hidden states of all elements in the sequence X and Y .; is the concatenation operator. The attention weight vectors of the two sequences are described as α = (a 1 , a 2 , . . . a n , αȳ) and β = (β 1 , β 2 , . . . β n , βx). Here, a i , aȳ, β i , and βx are calculated according to equation 8 -10. Then, aȳ and βx are both removed from matrices α and β. Finally, the sequence X and Y are multiplied by their respective attention weight vectors as the output of interactive self-attention, described as follows.

D. SIMILARITY ESTIMATION
In this section, the final process of similarity estimation is presented. The training set for a Siamese neural network consists of triplets (s 1 , s 2 , y), where s 1 and s 2 are word sequences, and y ∈ [0, 1] indicates the similarity between s 1 and s 2 . Given sentence pair s 1 = (w 1 , w 2 , . . . , w n ) and s 2 = (w 1 , w 2 , . . . , w m ), where n and m are the length of s 1 and s 2 , respectively. s 1 and s 2 are fed into the upper and lower branch network of our SNN and then the output of the SNN corresponding to s 1 and s 2 are described as: Then, h t 1 and h t 2 are used as X and Y of Section 3.3 to calculate the interactive self-attentive weight. Finally, the vector for calculating the similarity is obtained by equations 13 and 14. Then, the similarity is measured by the Manhattan distance formula (i.e., equation 17), and the evaluation score on the test set is computed by the Pearson function.
where C x and C y are the outputs of interactive self-attention, as shown in equations 13 and 14. The accuracy of the predicted results is evaluated by this method and reflects the degree of linear correlation between two variables [37]. Given P = (p 1 , p 2 , . . . p n ) and Y = (y 1 , y 2 , . . . y n ), where P is the set of scores for n sentence pairs generated by our model and Y is the set of scores of the human assessors' judgment. Each y i in the set Y , represents the semantic textual similarity between the i th sentence pair p i in the set P. The Pearson correlation coefficient r is defined as follows:

2) SPEARMAN CORRELATION COEFFICIENT
When the two set joint (P, Y ) is bivariate normal, the Pearson correlation coefficient provides a complete summary of the association between P and Y . However, the Pearson correlation coefficient is extremely sensitive to even the slightest departures from normality -a single outlier can easily conceal the underlying association [38]. Therefore, the robust correlation coefficient, namely, the Spearman coefficient correlation, is used to estimate the similarity of sentence pairs. The Spearman correlation coefficient ρ is just a Pearson correlation coefficient r between the ranked variables, which is defined as follows.
The degree of correlation is divided into five levels, i.e. In addition, because the mean square error is adopted as the loss function in the training process, the mean square error (MSE) is also used as our evaluation criteria.

3) PRECISION, RECALL, F1-SCORE, AND ACCURACY
Measuring the similarity between sentences is also converted into a binary classification task for verifying generalization performance of the proposed method and other existing methods. To perform the experiment, both the annotated similarity score and prediction score are converted to a classification label. First, annotated scores of DBMI([0-5]) and CDD-ful/ref( [1][2][3][4][5]) are mapped to a [0-1] interval via equation 20.
Map (x, a, b, c, d) where a and b are the upper and lower boundaries, respectively, of the target interval. Additionally, c and d are the upper and lower boundaries, respectively, of the original interval. Here, a = 0 and b = 1, for the DBMI and CDD-ful/-ref dataset, c = 0, d = 5 and c = 1, d = 5, respectively. Then, as in [35], the predicted label is 1 when the output Dis(C x , C y ) ≥ 0.5; otherwise, the predicted label is 0. Finally, to evaluate the performance of the binary classification task, accuracy, precision, recall, and the F1-score are adopted, which are defined as follows.
where TP, TN , FP and FN are the number of true positives, true negatives, false positives and false negatives, respectively.

C. TRAINING DETAILS
Our implementation of all models is based on the open source deep learning package Keras running on VOLUME 8, 2020

V. RESULTS AND ANALYSIS A. BASELINES AND OUR MODELS
To demonstrate the effectiveness of our proposed model, we compare it against multiple baseline methods and state-ofthe-art approaches for the sentence pair similarity estimation task on other corpora. • MaLSTM: proposed by Mueller [10] and achieved state-of-the-art results on the SICK [39] corpus.
• SCNN: like MaLSTM, replaced the recurrent neural network with a convolutional neural network.
• ImprovedSNN: proposed by Chi and Zhang [35], it employed hierarchical attention [29] to give different words different attention weights and achieved better results on a large dataset downloaded from Stanford Web.
• AttentiveSNN: proposed by Bao et al. [13], it regarded the attention weights as the coefficient of the Manhattan distance and achieved a higher Pearson correlation score than other methods on a cross-lingual textual similarity corpus.
• SNN: our baseline, similar to MaLSTM, but double biLSTMs in each branch network are employed in the model.
• IA-SNN: our baseline introducing interactive attention into a Siamese neural network. The six methods mentioned above that we implemented, are our baselines for comparison, and the following methods are presented in this paper.
• SA-SNN: introducing self-attention into a Siamese neural network.  For a more detailed analysis of the proposed attention's ability in assessment similarity between biomedical sentences, the Manhattan distance of the worst and best sentence pairs is shown in Table 2 (enlarged to a similarity scale 1-5 via equation 20). The first sentence pair, containing 93 tokens, is the worst dissimilar example. The IA-SNN obtains the best score 2.7 (predicted score 0.45), and ISA-SNN achieves 3.4. This example shows that interactive self-attention may cause slight noise for long sentences. However, the second example, consisting of one sentence with 42 tokens and another with 14 tokens, is the best dissimilar sentence pair. The Manhattan distance obtained by ISA-SNN is obviously better than other methods including IA-SNN. Finally, both the third and fourth instances are the most similar sentence pairs. Moreover, the query is exactly the same as the answer sentence, and all methods achieve a similar score on the two sentence pairs (i.e., the third pair is the worst and the fourth is best). The scores of the two sentence pairs show that IA-SNN and ISA-SNN perform on the most similar sentence pairs better than other methods owing to the enhanced sentence semantic representation via interactive attention.
We implement the hybrid system [40] without additional features, which is used for the OHNLP 2018 task2, to further verify the effectiveness of ISA. First, the attention mechanism in ABCNN [41] is replaced with ISA, named ISA-BCNN. ABCNN achieves a MAP and MRR of 0.647 and 0.658, respectively, and the MAP and MRR of IAS-BCNN are 0.676 and 0.689, respectively. Second, ISA-BCNN serves as a substitute for ABCNN in [40]. Finally, the results of [40] and our implemented hybrid system are illustrated in Table 3. The performance of ISA-BCNN is better than that of ABCNN, which shows the effectiveness of ISA.

C. THE EFFECT OF SELF-ATTENTION AND INTERACTIVE SELF-ATTENTION
We also investigate the effect of self-attention and interactive self-attention on the performance in our model, as shown in Table 4. First, the proposed method ISA-SNN achieves the best performance on CDD-ref and DBMI across all three evaluation metrics. In contrast, SA-SNN attains the best MSE score on the CDD-ful dataset. Second, self-attention results in MSE decrease by nearly 1.232 compared to the baseline, but interactive self-attention leads to MSE increasing by 0.002 relative to self-attention. The results indicate that the semantic features (attention weights) learned by self-attention contribute to promoting the semantic representation in a single sentence, but interactive attention introduces a small amount of noise on CDD-ful. Nevertheless, interactive self-attention yields the best results compared to self-attention on CDD-ref and DBMI. The main reason may be that interactive attention enlarges the semantic difference   between dissimilar sentences and reduces the semantic difference between similar sentences, which results in improving the discrimination ability in textual similarity.

D. ANALYSIS OF ATTENTION SCORE VISUALIZATION BETWEEN ISA-SNN AND SA-SNN
To compare the difference between self-attention and interactive self-attention, we still visualize the attention score of similar and dissimilar sentence pairs, as shown in Figure 2, Tables 5 and 6. Figure 2(a) shows that comparisons of attention score on a similar sentence pair, namely, ''Patient has contact information and understands the need to call with any questions or concerns.'' (sentence A) and ''The patient knows how to reach us any time with questions or concerns.'' (sentence B). For the ISA-SNN, the words of maximum and minimum attention weight in sentence    opposite conclusion is obtained, as shown in Tables 5 and 6. These results reveal that the difference between the maximum and minimum values is reduced in similar pairs but is enlarged in the dissimilar sentence pairs. Therefore, interactive self-attention is useful to reduce the semantic difference of the similar sentence pair and enlarge the semantic difference of the dissimilar sentence pair.

E. PERFORMANCE COMPARISON OF OUR METHOD AND BASELINE WHEN REGARDING AS BINARY CLASSIFICATION TASK
Estimating the similarity between sentences is also regarded as the sentence pair classification task. The performance of the converted binary classification task, as mentioned above, is illustrated in Table 7. The SNN achieves the best recall rate, but the proposed method shows the best F1-score and accuracy on DBMI and CDD-ref. However, SA-SNN obtains the best recall rate, F1-score, and accuracy on CDD-ful, and ISA-SNN obtains the best precision and recall rate on CDD-ref and DBMI. The SNN extracts the semantic information between sentences via shared weights, so there is almost no noise. In contrast, the introduction of self-attention and interactive attention reduces the number of false negative examples but causes a small amount of noise, which increases the number of false positive examples, resulting in a decrease in recall rate. In terms of results, despite a small decrease in the recall rate, compared to the baseline, the F1-scores and accuracy have been significantly improved, with 0.022 and 0.044 improvement on the CDD-ref corpus, and 0.153 and 0.243 improvement on the DBMI corpus. However, on CDD-ful, the performance of SA-SNN is better than that of ISA-SNN, the reasons are given later.
Moreover, to analyze the generalization performance of three methods, we draw the ROC curve and compute the AUC score on the DBMI dataset, as shown in Figure 3. It demonstrates that ISA-SNN outperforms the other methods in the classification task and is more suitable for applying to other sentence pair datasets.

F. ANALYSIS OF CDD-FUL/-REF
The aforementioned results show that the similarity and classification performance of SA-SNN on CDD-ful outperforms that of ISA-SNN. The direct reason may be that the interaction semantic information causes a slight amount of noise in this corpus. However, on CDD-ref corpus found in the same literature [36], ISA-SNN performs better. Therefore, we compare the difference between the two datasets according to the labeled score distribution, sentence length, and text quality. First, we give statistics on the similarity levels distribution of the judged pairs of sentences in the CDD-ful/-ref. Figure 4 shows that the number of sentences with 4-5 scores in CDD-ful is almost twice that in CDD-ref while the number of sentences with 2-3 points is the opposite. Second, as shown in Figure 5, the number of sentence pairs with more than 120 words in CDD-ful is more than that in the CDD-ref,  groups such as query numbers 1983390001 and 1935450007. These differences may be due to the different sources of CDD-ful and CDD-ref (i.e., PubMed Central database and Conserved Domain Database according to the literature [35]). Hence, the reason resulting in the different performance between the two datasets may be that i) interaction semantic information causes noise when it handles similar or long sentences, which results in the reduction in discrimination ability between sentences; and ii) interactive attention slightly reduces the ability of estimation similarity on corpora with noise.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose an interactive self-attention mechanism and introduce it into the Siamese neural network with bidirectional LSTM units (named ISA-SNN), which can enhance the semantic expression in a single sentence via self-attention on the basis of the shared parameters of the Siamese network. Furthermore, inherent interactive semantic information between sentences is learned by means of interactive attention, which contributes to reducing the semantic difference in a similar sentence pair and enlarging the semantic difference in a dissimilar sentence pair. Moreover, we conduct experiments on three biomedical datasets. Experimental results indicate that the proposed model for measuring biomedical textual similarity has higher performance on the three datasets. However, further improvement is needed in long sentences and noisy datasets. Therefore, an investigation into whether the multi-head self-attention mechanism may optimize our model will be considered in future work. In addition, external biomedical knowledge may be combined into our model. ZHENGGUANG LI received the master's degree in computer science from Dalian Jiaotong University, in 2007. He is currently pursuing the Ph.D. degree with the School of Computer Science and Technology, Dalian University of Technology. His research interests include text mining and natural language processing in social media and biomedical fields.
HONGFEI LIN received the B.Sc. degree from Northeast Normal University, in 1983, the M.Sc. degree from the Dalian University of Technology, in 1992, and the Ph.D. degree from Northeastern University, in 2000. He is currently a Professor with the School of Computer Science and Technology, Dalian University of Technology. He is also the Director of the Information Retrieval Laboratory, Dalian University of Technology. He has published over 100 research articles in various journals, conferences, and books. His research interests include information retrieval, text mining, natural language processing, and effective computing. In recent years, he has focused on text mining for biomedical literature, biomedical hypothesis generation, information extraction from large biomedical resources, learning to rank, sentimental analysis, and opinion mining. His research projects are funded by the National Natural Science He has published more than 20 research articles on topics in biomedical literature data mining. His current research interests include biomedical literature data mining and information retrieval.
JIAN WANG received the B.S., M.S., and Ph.D. degrees from the Dalian University of Technology, China, in 1988China, in , 1993, and 2014, respectively. She is currently a Professor with the School of Computer Science and Technology, Dalian University of Technology. Her current research interests include biomedical literature data mining, information retrieval, and natural language processing.