Biomedical Text Similarity Evaluation Using Attention Mechanism and Siamese Neural Network

It is a crucial component to estimate the similarity of biomedical sentence pair. Siamese neural network (SNN) can achieve better performance for non-biomedical corpora. However, SNN alone cannot obtain satisfactory biomedical text similarity evaluation results due to syntactic complexity and long sentences. In this paper, a cross self-attention (CSA) is proposed to design a new attention mechanism, namely self2self-attention(S2SA). Then the S2SA is introduced into SNN to construct a novel self2self-attentive siamese neural network, namely S2SA-SNN. In the S2SA-SNN, self-attention is used to learn the different weights of words and complex syntactic features in a single sentence. The means of the CSA are used to learn inherent interactive semantic information between sentences, and it employs self-attention instead of global attention to perform cross attention between sentences. Finally, three biomedical benchmark datasets of Pearson Correlation of 0.66 and 0.72/0.66 on DBMI and CDD-ful/-ref are used to test and prove the effectiveness of the S2SA-SNN. The experiment results show that the S2SA-SNN can achieve better performances with pre-trained word embedding and obtain better generalization ability than other compared methods.


I. INTRODUCTION
More and more medical texts have been accumulated with an amount of biomedical information growing. The biomedical text similarity evaluation method is a critical task in drug and drug interaction(DDI), question answering [1]- [3], etc. Although some researchers utilized biomedical resources [4] to improve the evaluation similarity performance, the generalization of these methods is poor due to the limitation of resources and corpus. Therefore, deep neural networkbased methods such as character-based [5], inter-weighted alignment [6] are proposed. Furthermore, some researchers utilized word embedding [7], sentence embedding [8], [9] and shared sentence encoder [10] to obtain sentence semantic representation and estimate the similarity. Meanwhile, some The associate editor coordinating the review of this manuscript and approving it for publication was Jesus Felez . researchers employed Siamese neural networks (SNN) [11] consisting of dual recurrent neural networks with shared parameters, to model sentence pairs and compute the similarity via distance function. In addition, the attention mechanisms [12] are integrated with SNN to focus on crucial words. These words have an important impact on the sentence semantic representation. These neural networks with attention mechanisms achieve good results, but they ignored the importance of interactive information between sentences. Therefore, some methods apply interactive attention [13] and cross attention [14] mechanism to obtain the interaction semantic information between sentences. The interaction contributed to enhance the semantic information of two sentences, and promise the semantic similarity estimation performance.
Even though these methods with interactive/cross attention mechanism show the effectiveness on non-biomedical datasets, their performance on biomedical corpora is unsatisfactory owing to long-range dependencies [15] and complex syntactical structure [16] in biomedical corpora. Inspired by self-attention [15] and interactive attention [17], interactive self-attention has been proposed in our previous work [18]. Other methods are proposed for this field [19]- [24]. However, the semantic loss might be introduced by the semantic vector averaging operation in the interactive attention. Therefore, interactive attention is replaced by cross attention in this paper, forming a novel attention mechanism, named cross self-attention (CSA). Meanwhile, the hybrid attention mechanism based on integrating the self-attention and CSA is proposed, i.e., self2self-attention (S2SA). The S2SA mechanism is introduced into SNN to evaluate the similarity of sentence pairs and to verify the effectiveness of S2SA. The proposed attention mechanism consists of self-attention in a single sentence and CSA between sentences. Firstly, our attention mechanism learns the attention weights between words and complex syntactical features from the long/complex biomedical sentences via self-attention in a single sentence. Secondly, the attention mechanism employs CSA to obtain interactive semantic information. The interactive information is more helpful for enhancing the sentence semantic representation and alleviating the semantic loss in the interactive attention network [13].
The main contributions of this paper can be summarized as follows: • A cross self-attention mechanism is proposed to realize semantic interaction between sentences and reduce semantic loss to a certain extent.
• A self2self-attention mechanism with composing of self-attention in a single sentence and cross selfattention between sentences, is proposed to estimate the semantic textual similarity.

II. S2SA-SNN
The semantic textual similarity estimation at sentence-level involves two sentences. Given one sentence X and another sentence Y , the goal of the proposed model is i)to learn the sentence semantic representation of X and Y , and ii) calculate a score to measure their similarity or obtain the output of Softmax activation function via the semantic representations. As shown in FIGURE 1, the model first learns basic semantic representation via the double Siamese neural network which takes biomedical word embeddings as inputs to obtain context information for each word (Sec. A). Biomedical texts are mainly collected from biomedical literature or clinical notes in this paper, these sentences are middle/long and syntactically complex. Learn long-range dependencies are a key challenge in these sentence pairs. Thus, the self-attention mechanism is introduced into our model to learn the semantic vector of each word in a sentence (Sec. B). Moreover, the researchers described the same contents/opinions using the sentences, which are consisted of the same words (synonyms, near-synonyms) with different positions in biomedical literature. Although both interactive attention [17] and cross attention [14] can learn interactive semantic information, interactive attention might introduce semantic loss owing to the semantic vector averaging. Therefore, cross self-attention is proposed in section C to obtain interactive semantic information. The hybrid attention, self2selfattention, consists of self-attention in a single sentence and CSA between sentences due to the different role of semantic information in a single sentence and interactive semantic information (Sec. D). Finally, the prediction of the proposed model is given by measuring similarity or active function (Sec. E).

A. SIAMESE NEURAL NETWORK
Siamese neural networks(SNN) consist of dual-branch networks with shared weights [11]. Therefore, they are applied to sentence/word pair tasks, such as textual similarity [25]. Moreover, bi-directional long short-term memory (bi-LSTM) has achieved good results on other biomedical NLP tasks like Named Entity Recognition (NER) [26]. Furthermore, LSTM is helpful for solving the problem of the vanishing gradient problem suffered by standard RNN in which backpropagated gradients become vanishingly small over long sequences. Hence, bi-LSTM networks are usually chosen as a branch network of SNN. However, unlike the standard SNN, each branch network is a double layer bi-LSTM network in this paper due to syntactical complexity and sentence length in biomedical texts. Given a sentence pair X = (x 1 , x 2 , . . . , x n ) and Y = (y 1 , y 2 , . . . , y m ), where n and m are the length of X and Y , respectively. Sentence X and Y are converted word embedding matrix and then separately fed into upper and lower bi-LSTM branch network. The forward and backward hidden vectors of each branch network at time-step t are described . Therefore, the output of SNN corresponding to X and Y at time-step t is described as: In here, the hidden output of each LSTM cell can be calculated by equations(3)∼ (8).
where, x t ∈ R d is the input at time-step t, and d is the feature dimension for each word, σ is the element-wise sigmoid SNN is used to learn basic semantic information of sentence pairs. S2SA is employed to capture local semantic information in a sentence and learn mutual semantic between sentences. Finally, we adopt Manhattan distance formula to evaluate the semantic similarity, and Softmax to predicate the classes.
function, • is the element-wise product. C t is the memory cell designed to lower the risk of vanishing/exploding gradient, and therefore enabling learning of dependencies over larger period of time feasible with traditional recurrent networks.C t is the temporary state at time-step t.
The forget gate, f t is to reset the memory cell. i t and o t denote the input and output gates, and essentially control the input and output of the memory cell. tanh is the activation function.

B. SELF-ATTENTION
In fact, some parts of the sequence can be more relevant compared to others [27], namely the contribution of each word to the sentence semantic representation is different in a sentence. Therefore, some researchers proposed attention mechanisms to get different weights for denoting the different contributions to the semantic representation [28]. On the other hand, syntactic structure is relevant to the sentence semantic representation, i.e., the relationship between words with different positions has different influence on semantics. Furthermore, choosing appropriate mechanisms/methods is necessary for biomedical sentences with complex syntactic structure. Meanwhile, self-attention obtains weights of words via attention operation that is performed each word towards all words in the sentence. These weights represent the contribution of different words and syntactic structures to semantic representation. Thus, self-attention is more suitable than other attention mechanisms for biomedical sentence semantic representation. An attention function can be regarded as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors [15]. The attention weight of each value is calculated by a compatibility function of the query with the corresponding key. Given a matrix Q, K and V denoting a set of queries, keys, and values, respectively, the output matrix of self-attention as: where, Q ∈ R n×d k , K ∈ R m×d k , V ∈ R m×d k , d k is the dimension of each query q in Q and key k in K , the weight matrix W is a learning parameter, the X is an input matrix.

C. CROSS SELF-ATTENTION
Our previous work has verified the impact of semantic interaction on the semantic textual similarity estimation between sentences. Although our previous proposed interactive selfattention contributes to improving the performance owing to its semantic interaction, semantic loss might exist due to the average operation of semantic vectors. Therefore, cross self-attention (CSA) is proposed to reduce semantic loss. It directly adopts a similar self-attention to implement the semantic interaction, replacing the interactive attention, as shown in Figure 2. Figure 2(a) shows the basic framework of interactive self-attention, but the architecture of S2SA is shown in Figure 2(b). A and B parts in Figure 2(a) and Figure 2(b) both denote self-attention operation. C part in Figure 2(a) denotes interactive operation. Cxy and Cyx parts in Figure 2(b) denote CSA. Therefore, CSA replaces the average operation of semantic vectors with similar selfattention to reduce semantic losses. Given the two semantic vectors X = (x 1 , x 2 , . . . x n ) and Y = (y 1 , y 2 , . . . y n ), according to the definition of selfattention, the Q, K , V of X and Y are defined as: where W X and W Y are weight matrix of X and Y respectively. Then, the cross self-attention is described as equation 13.
where CSA(X , Y ) denotes the interactive semantic information from Y towards X , d X represents the dimension of each vector X . In fact, to obtain the interactive semantic vector from Y towards X , the dimension of X and Y are equal.
To clear describe how does the cross self-attention works, an example is illustrated in Figure 3. We take a sentence pair (sentence X: ''The patient was transferred to the Title.''; sentence Y: ''The plan was discussed with patient/family and they are in agreement.'') as an example. Part (1) in Figure 3 describes the CSA operation from sentence X towards sentence Y. The semantic similarity between any word in sentence X and each word in sentence Y is calculated, formed a vector. Like this, a matrix can be obtained. The output of the Softmax operation on the matrix is used to denotes the attention weights of sentence X towards sentence Y. Part (2) shows the opposite meaning. Part (3) denotes self-attention operation of sentenc X. The semantic similarity between any word in sentence X and each word in sentence X is calculated.

D. SELF2SELF-ATTENTION
Existing methods put more emphasis on learning the separate sentence representations. For instance, Tang et al. [10] utilized the rich annotation data in a rich resource language to perform semantic textual similarity between sentences. Zhu et al. [29] proposed a dependency-based LSTM model to learn sentence representation. Their experimental results show that semantic information of a single sentence also plays an important role. The CSA is helpful for enhancing the semantic representation to each other. Meanwhile, selfattention can precisely capture semantic information of long biomedical sentence and reduce long-range dependencies problem. In other words, they are highly complementary to each other. Therefore, it is necessary to combine the advantages of self-attention and CSA. The self2self-attention, integrating self-attention with CSA, is proposed in this paper. Furthermore, to avoid semantic loss caused by the pooling of hidden states, the attention is directly applied over the final hidden state of our Siamese neural network. Finally, the hybrid attention contains i)self-attention in the X or Y , and ii)self-attention between X and Y , namely CSA. Furthermore, CSA stands for the mutual attention between X and Y . The CSA consists of two parts: the CSA of X towards Y and the CSA of Y towards X . Here, the X and Y refer to the final hidden state of the corresponding branch network in the Siamese neural network.
According to the definition in section C, the final output matrix of self2self-attention as: where equations 14 and 15 denote the basic semantic representation of sentence X and Y , respectively. λ is a VOLUME 9, 2021 learning parameter or hyperparameter, d is the length of a sequence.

E. OUTPUT LAYER
The aforementioned outputs of self2self-attention are the final semantic representations of the sentence pair. Moreover, the sentence pair tasks are generally regarded as similarity estimation or prediction classification. Therefore, the evaluation functions in the output layer are defined as follows.
where C x and C y are the outputs of the self2self-attention layer as shown in equations 14 and 15. Then the evaluation score on the test dataset is computed by evaluation functions such as Pearson, Spearman, Jaccard coefficient, etc.

2) PREDICATION CLASSIFICATION
To predicate the classes, the sentence semantic representations of sentence pairs are concatenated to form the final vector, which is then fed into a Softmax layer to predict the result as shown in equation 17.
whereŷ is the predication class, pair(x, y) denotes a sentence pair.

III. EXPERIMENTS
In this section, the proposed model named S2SA-SNN is evaluated using three biomedical datasets. Firstly, the experimental datasets and evaluation metrics are introduced. Then we describe the hyperparameters and related resources. Finally, we list the results of our model and other methods. Moreover, the three corpora mentioned above are converted into binary classification datasets for conducting classification experiments. Firstly, annotated scores of DBMI ([0-5]) and CDD-ful/-ref ( [1][2][3][4][5]) are converted classification class(0 or 1). Give the middle value is m v = (b+a) 2 . The class label is 1 when the annotated score is larger than m v , otherwise, the class label is 0, where a and b are upper and lower boundaries of annotated interval respectively.
Finally, Pearson(r), Spearman(ρ) correlation coefficient and mean square error (MSE) is used to evaluate the performance of the similarity. Meanwhile, accuracy, precision, recall, and F1-Score are adopted to evaluate the performance of a binary classification task.

B. HYPERPARAMETER AND RELATED RESOURCES
We implemented our models using Keras running on top of backend TensorFlow and Python3.6. Furthermore, the pretrained biomedical word embedding can be obtained via link URL: http://evexdb.org/pmresources/ngrams/PubMed/ in website (i.e., http://bio.nlplab.org/). Mean square function and cross-entropy error function are used as the loss functions of estimation similarity and classification task individually. λ in eq. 14 and eq.15 is set 0.5 due to the interchangeability of similar textual semantic estimation task. Other hyperparameters are shown in TABLE 2.

IV. RESULTS AND ANALYSIS A. BASELINES AND OUR MODELS
To demonstrate the effectiveness of the proposed model, we compare it against multiple baseline methods and state-ofthe-art approaches for the sentence pair similarity estimation task on other corpora.
• MaLSTM: proposed by Mueller [25], achieving the state of the art results on SICK [30] corpus.
• ImprovedSNN: proposed by Chi and Zhang [31], employing hierarchical attention [28] to give different words different attention weights and achieving better results on a large dataset downloaded from Stanford Web.
• AttentiveSNN: proposed by Bao et.al. [32], regarding the attention weights as the coefficient of the Manhattan distance and achieving a higher Pearson correlation score than other methods on cross-lingual textual similarity corpus.
• SNN: our baseline like MaLSTM, but double biLSTMs in each branch network are employed in the model.
• IA-SNN: our baseline introducing interactive attention [17] into Siamese neural network.  attention and improves semantic representation by means of interactive attention. Therefore, compared with Atten-tiveSNN, the MSE score of IAN increases by 5%, and it achieves the best Spearman correlation coefficient on CDD-ref. Moreover, the overall performance is better than the other three methods on DBMI while there is a decrease on CDD-ful. Furthermore, IAN attains the worst performance in all the methods with the attention mechanism on CDD-ful. This shows that the performance of IAN may depend on the quality of the corpus and the complexity of the sentences in the datasets. The analysis of the corpora will be given later. The ISA-SNN outperforms all the methods on the three corpora owing to alleviating long-range dependencies by self-attention. Finally, the proposed model achieves the best results by the self2self-attention on the three datasets except for the Spearman correlation coefficient on CDD-ref. The final Pearson correlation coefficient is increased to 0.661, approaching the official value 0.678 on CDD-ref [33]. The reasons for the better performance obtained by our model on the three corpora may be that the proposed self2selfattention not only helps our model to more precisely represent the semantic via self-attention in a single sentence, but VOLUME 9, 2021 enhance the sentence semantic representation through cross self-attention. Moreover, the results of the proposed model are higher than that of the previous proposed model. This shows that the vector averaging in IAN causes semantic loss and the CSA is useful for reducing the effect of the problem.

C. THE EFFECT OF SELF-ATTENTION AND CROSS SELF-ATTENTION
We also investigate the effect of self-attention and cross self-attention on the performance in our model. Table 4 shows the results of the two attention mechanisms on the CDD-ful/-ref and DBMI. The baseline is our Siamese neural networks(SNN), whose each branch network contains dual layers bi-LSTM. Then, self-attention is introduced into our SNN(named SA-SNN). Finally, cross self-attention (CSA) is integrated with self-attention, named self2self-attention. Moreover, the self2self-attention is added into SNN(named S2SA-SNN). Firstly, three evaluation scores on the three datasets are improved significantly. Moreover, the Pearson and Spearman correlation coefficient is increased by 0.16 and 0.17, respectively on CDD-ref. This excellent performance may benefit from the precise semantic representation, i.e., weights of words and syntactic structure learned by self-attention within the query and answer. Secondly, on DBMI, CDD-ref, CDD-ful datasets, the cross self-attention yields a boost of up to 0.1, 0.09, 0.06 Pearson correlation coefficient over self-attention separately. Therefore, the sentence enhancement semantic representation gained by cross self-attention between the query and the answer is useful for improving performance. Although S2SA-SNN achieves the best results on CDD-ref, its increase based on self-attention is the smallest one. On the contrary, the increase of SA-SNN in the Pearson and Spearman correlation coefficient based on the baseline is the largest. Therefore, an investigation into the difference among the three datasets is opened in the following part of this section.
The maximum difference is the number of long and other sentence pairs. As shown in FIGURE 4, the number of sentence pairs with more than 80 words in CDD-ful is more than that in the CDD-ref and DBMI (20.6%, 10.7%, 4.6%, respectively). In addition, more special tokens like '' Figure'', ''()'' are found in CDD-ful. This indicates that i) self-attention is more effective than other attention mechanisms to estimate the similarity between the long and complex sentences, and ii) although cross self-attention is helpful to improve the performance on different corpora, it is more effective for the improvement of short and medium sentences. Furthermore, S2SA-SNN also promotes the results of corpora with a small amount of noise. To further illustrate the performance of the proposed method on long sentence pairs, the test sets are divided into long and other sentence pairs and recalculated the evaluation scores as shown in Table 5. This table shows that the improvement scope of SA-SNN is larger than that of the other three attentive methods on long sentence pairs of CDD-ful, and S2SA-SNN achieves the best results. On the contrary, S2SA-SNN outperforms the other methods while the performance of ImprovedSNN exceeds that of the SA-SNN on short/medium sentence pairs. Meanwhile, the results obtained by S2SA-SNN is better than other methods on the long sentence pairs of CDD-ref, but the results of SA-SNN is worst relative to other methods. Furthermore, both SA-SNN and S2SA-SNN don't attain the best results on the DBMI corpus.

D. PERFORMANCE COMPARISON OF OUR METHOD AND BASELINE WHEN REGARDING AS BINARY CLASSIFICATION TASK
To investigate the ability of classification on sentence pairs, the experiments with the converted binary classification datasets are conducted using SNN, SA-SNN, and S2SA-SNN. The performance of tasks as mentioned above is  shown in Table 6. The SNN achieves the best precision rate on CDD-ful, but the worst F1-score and accuracy on the three datasets. However, the recall rate, F1-score, and accuracy of the S2SA-SNN are higher than that of SNN and SA-SNN on CDD-ful. The results reveal that applying self2self-attention in corpora with main long and syntactic complex sentences has some advantages over methods without attention and selfattention due to the weights and complex syntactic features learned by self-attention and interactive semantic information obtained cross self-attention between sentences. However, the recall rate of the S2SA-SNN outperforms that of the SA-SNN while the precision of the SA-SNN is better than S2SA-SNN on DBMI. Meanwhile, S2SA-SNN only attains the best precision rate and accuracy relative to other methods. Therefore, the overall classification results achieved by S2SA-SNN are worse than SA-SNN on short/medium sentences pairs. The reason may be that the introduction of cross self-attention causes a small amount of noise that has an impact on the classification performance.
In addition, to analyze the generalization performance of three methods, we draw ROC curve and compute AUC score on CDD-ful dataset, as shown in Figure 5. It demonstrates that the AUC of the S2SA-SNN(0.90) outperforms the SNN without attention (SNN: 0.80) and the SA-SNN (introduced self-attention, 0.86) in the classification task. Thus, The S2SA-SNN is more suitable for applying to other sentence pair datasets with long or complex syntactic sentences.  Table 7. Firstly, there is no obvious difference in the efficiency of the six methods in terms of time, only a few milliseconds. S2SA-SNN took three milliseconds more than SNN evaluation. Moreover, in terms of CPU and GPU occupancy, there is almost no difference. The occupancy of SNN in GPU and CPU is slightly higher than that of S2SA-SNN.
Combined with results of short / long sentences and binary classification and computational efficiency analysis, S2SA owns the following advantages: i) it is more suitable for sentence pair corpus with long or complex syntactic sentences. ii) The generalization of S2SA is better than SNN and SA-SNN. iii) The computational efficiency is not lower than other methods. However, it also has some limitations. First, its performance is not better than simple SNN on the datasets with more short sentences. Second, it is sensitive to noisy data, thus, it is not recommended to be applied in datasets with noise texts.

V. CONCLUSION AND FUTURE WORK
In this paper, a cross self-attention is proposed, which is integrated with self-attention for designing a novel hybrid attention mechanism, namely self2self-attention mechanism. Finally, the proposed hybrid attention is introduced into the Siamese neural network with bidirectional LSTM, called self2self-attentive Siamese neural network (S2SA-SNN). It can represent the sentence semantic more precisely in a single sentence via self-attention on basis of shared parameters of the Siamese network. Moreover, inherent interactive semantic information between sentences is learned via the cross self-attention. The semantic loss is alleviated by CSA owing to removing vector averaging operation. Consequently, the interactive information learned by CSA contributes to enhancing the sentence semantic representation and improving the overall performance. Furthermore, we conduct experiments on three biomedical datasets. Experimental results indicate that the proposed model for measuring biomedical textual similarity and classifying sentence pairs has a better performance on the three datasets. The analyses of long / short sentences and corpus indicate that self2self-attention is more suitable for datasets with long or complex syntactic and less noise sentences. Our model depends on traditional context-independent word embeddings only to verify the effectiveness of cross self-attention and self2self-attention. In addition, we can combine external biomedical knowledge into our model. Since 2012, she has been an Associate Professor with China West Normal University, Nanchong. Her research interests include artificial intelligence, optimization method, and image processing. VOLUME 9, 2021