Re-Ranking System with BERT for Biomedical Concept Normalization

In recent years, various neural network architectures have been successfully applied to natural language processing (NLP) tasks such as named entity normalization. Named entity normalization is a fundamental task for extracting information in free text, which aims to map entity mentions in a text to gold standard entities in a given domain-specific ontology; however, the normalization task in the biomedical domain is still challenging because of multiple synonyms, various acronyms, and numerous lexical variations. In this study, we regard the task of biomedical entity normalization as a ranking problem and propose an approach to rank normalized concepts. We additionally employ two factors that can notably affect the performance of normalization, such as task-specific pre-training (Task-PT) and calibration approach. Among five different biomedical benchmark corpora, our experimental results show that our proposed model achieved significant improvements over the previous methods and advanced the state-of-the-art performance for biomedical entity normalization, with up to 0.5% increase in accuracy and 1.2% increase in F-score.


I. INTRODUCTION
With the rapid development of computational technology, a large amount of literature has accumulated on various aspects regardless of domain. Based on a large amount of text data, many researchers consider constructing multiple knowledge bases (KB) of domain-specific ontologies. It is generally useful in many applications, from the general domain to specialized domains such as biomedicine, and beneficial for extracting key information related to entities of interest [1]. Because newly discovered biomedical evidence is written in natural language, accurate and efficient extraction of information from unstructured data has become important in natural language processing (NLP) [2], [3].
Named entities are meaningful terms or multi-word phrases and named entity recognition (NER) is an important task for identifying named entities and classifying the domain of pre-defined entities or entity types from informal texts [4]. After named entities in texts have been recognized, the next step is named entity normalization by mapping recognized The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . mentions to suitable identifiers (IDs) in the pre-defined dictionary. The entity normalization task in the biomedical domain is necessary to resolve semantic ambiguity, as each biomedical entity may be written in numerous forms [5]. For example, although 'cancer' and 'tumor' are apparently different forms in the text, they can be normalized to 'neoplasms' with the same concept ID (MeSH:D009369). On the other hand, 'AS' can be expanded to various words after abbreviation resolution like 'Angelman Syndrome (MeSH:D017204)' or 'Ammonium Sulfate (MeSH:D000645).' Although many researchers consider the ambiguity resolution to avoid these difficulties, the normalization task in the biomedical domain is still challenging because of multiple synonyms, various acronyms, and numerous lexical variations [6].
The goal of this study is to improve the performance of the biomedical entity normalization by utilizing different scoring schemes between mentions and concept names. To achieve the goal, we generate a list of candidate concept names of the input biomedical mentions sorted by their similarity scores and then re-rank the retrieved candidate concept names by developing scoring systems. Additionally, we employ our architecture focused on the following VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ two points: a semi-supervised learning model using unlabeled task-specific data [7] and a calibrated classification model [8]. We evaluate our system using five biomedical corpora with four entity types and various assessment methods. Our experimental results show that the proposed method significantly outperforms the existing state-of-the-art (SOTA) models on biomedical corpora for the task of normalization. The main contributions of our proposed study are as follows: (i) We demonstrate the effectiveness of word representations with pre-trained language models (LMs) rather than context-independent representation; (ii) We utilize pre-trained LMs with task-specific sentences in terms of the ranking tasks for biomedical normalization; (iii) We prove that our models employing the calibration method show significant improvements in normalization performance; and (iv) We show that a simple but effective strategy of implementing the incorporation of two different scoring systems is a key factor for performance improvement of our models.

II. RELATED WORKS
The biomedical entity normalization is a long-standing and important task in the biomedical NLP domain [9]- [11] and the goal of a biomedical normalization task is to map a mention in a document to a unique concept ID in a biomedical ontology [12]. Various challenges have been organized to solve the normalization problem, and many researchers have participated in these assessments of the NLP methods. As one of the representative challenges, the BioCreative workshops provided a set of biomedical tasks to encourage NLP research and related applications. Focusing specifically on the normalization track, the BioCreative I, II, and III workshops were designed to address a number of gene names [13]- [15], and the BioCreative V workshop aimed to normalize disease and chemical mentions from MEDLINE abstracts [16]. CLEF eHealth has been running an annual evaluation campaign in the medical and biomedical domain, and the Shared Annotated Resources (ShARe) project has created a disorder mention corpus from clinical texts. Therefore, the ShARe/CLEF eHealth 2013 Challenge offered the NER task for disorder mentions in clinical notes, along with the normalization task to map unique identifiers [17]. Furthermore, a part of the SemEval workshop was designed as a follow-up to the ShARe/CLEF eHealth 2013 Challenge. Using the ShARe/CLEF corpus, the SemEval-2014 Task 7 (Task B) [18] and the SemEval-2015 Task 14 (Task 1) [19] organized open challenges to recognize the span of a disorder mention in the clinical text and to normalize the disorder to a unique CUI in the SNOMED-CT subset of UMLS terminology. The Text Analysis Conference (TAC) is another series of workshops organized to assess a variety of NLP methods. To detect the adverse drug reactions (ADR) described in the structured product labels of drugs, the TAC 2017 challenge consisted of several intermediate tracks including the ADR extraction from drug labels and normalization through Med-DRA terminology [20]. To provide various NLP tasks with annotated data in the clinical domain, the Informatics for Integrating Biology and the Bedside (i2b2) project has organized a series of shared tasks since 2006. In 2010, i2b2 with VA Salt Lake City Health Care System has run a medical NLP workshop for clinical records, called the 2010 i2b2/VA challenge, and they released a manually annotated corpus of patient reports [21]. The MCN (medical concept normalization) corpus is a subset of discharge summaries from the fourth i2b2/VA 2010 shared task [22] and this corpus is utilized as a shared-task dataset in the 2019 National NLP Clinical Challenges (n2c2)/Open Health NLP (OHNLP) track 3 [23].
The community-wide tasks have greatly promoted biomedical NLP research by building benchmark datasets and innovative methods. Through these challenges, many researchers have examined various techniques, such as dictionary-based, rule-based, machine learning-based, and deep learning-based methods.
The most common traditional normalization approaches are dictionary-based and rule-based methods, which use pattern matching based on dictionary lookup and heuristic matching rules, respectively. The sieve-based system [12] is a cascade architecture based on ten kinds of manual rules and Apache Lucene [24] is a Java-based text indexing and searching engine library by calculating the similarity between a document and a query. Although these approaches can be easily applied to broad areas such as disease, gene, and chemical name normalization tasks [25]- [29], they may often be inefficient and less accurate for words not in the dictionary or ungrammatical text with typos.
Although many normalization tools still tend to rely on the accuracy of well-constructed dictionaries or domain-specific rules, several studies have applied machine learning techniques to overcome the previous limitations. DNorm [10] proposed a pairwise learning-to-rank method to measure the similarities between entity mentions and candidate concepts. TaggerOne [30] is a machine learning-based system that jointly performs disease NER and normalization by utilizing semi-Markov models. Another machine learning technique for biomedical normalization is to utilize word representations in vector space. For instance, the Word2Vecbased method [6], convolutional neural network (CNN)based ranking method [31], BNE [32] using a long short-term memory (LSTM), and NormCo [33] using a gated recurrent unit (GRU) network proposed entity representation architecture to calculate semantic similarities between biomedical mentions and candidate concepts.
Along with the success of deep learning, recent studies have focused on a paradigm shift in NLP from task-specific training methods to fine-tuning approaches based on generalpurpose LMs. Following this trend, the most commonly used pre-trained model is bidirectional encoder representations from Transformers (BERT) [34] based on the transformer architecture [35]. BERT is a contextual language representation model that uses pre-trained deep bidirectional representations from the unlabeled text. Recently, BERT has been adapted to the biomedical domain by further pre-training on additional corpora as follows: BioBERT: BioBERT [36] is a domain-specific language representation model designed for biomedical text, and the model is initialized with the checkpoint of BERT, followed by training the BERT model on PubMed abstracts and PubMed Central full-text articles again. BioBERT achieves SOTA performance on various biomedical NLP tasks with task-specific fine-tuning while requiring only minimal architectural modifications.
SciBERT: Similar to BioBERT, SciBERT [37] is another BERT-based model following the same architecture as BERT. Although BERT was pre-trained using general-domain corpora, SciBERT was pre-trained from scratch using several scientific papers that consisted of the full text of computer science and biomedical domains. Furthermore, they constructed a new in-domain vocabulary on their scientific text corpora, called SciVocab.
PubMedBERT: PubMedBERT [38] is another pre-trained LM, following the same architecture as BERT. However, unlike the mixed-domain pre-training models, the weights of the PubMedBERT model were not initialized with those of BERT during pre-training. They constructed an in-domain vocabulary of the target biomedical domain and pre-trained from scratch on PubMed abstracts and additional data from PubMed Central full-text articles.
These fine-tuned versions of BERT-based models are often combined with various machine learning approaches to deliver good performance in biomedical normalization tasks. Ji et al. [39] applied an ensemble approach based on Lucene and a pair-wise BERT classifier, and Xu et al. [40] also proposed a hybrid system based on Lucene or a multi-class BERT classifier for the candidate generation, and a list-wise BERT classifier for ranking. BIOSYN [41] utilized entity representation from the BERT-based model and developed a synonym marginalization method with marginal maximum likelihood.

III. METHODOLOGY
In this section, we propose a method for entity normalization using the BERT-based model. First, we assume that an input mention m has its own concept ID c, and each c has at least one concept name n according to the dictionary. Our goal in this study is to assign a biomedical mention m to its unique concept ID c in the target dictionary. Formally, given a list of biomedical mentions M = {m 1 , m 2 , . . .} from a document and a set of concept IDs C = {c 1 , c 2 , . . .} and concept names N = {n 1 , n 2 , . . .} from the ontology, the goal of concept normalization is to map the i-th mention m i to its correct concept c * through a normalization function f : where ID(n) is a function that returns the unique ID of the concept name n, and θ denotes a parameter corresponding to our normalization model. As shown in Fig. 1, our system for biomedical entity normalization consists of three steps: candidate concept generation, candidate concept ranking, and entity disambiguation. A detailed description of these steps is provided in the following sections.

A. CANDIDATE CONCEPT GENERATION
Traditional word embeddings are context-independent representations such as Word2Vec [42] and GloVe [43], which constitute a single vector for each word regardless of the meaning and position of the word in the sentence. With advances in contextualized representations, including ELMo [44] and BERT [34], the ability to share contextual information of words in sentences has further improved performance in various NLP tasks and demonstrated that relatively simple models using contextualized embeddings can outperform complex models using non-contextualized embeddings [45]. Therefore, we employed contextual representation models (i.e., BERT-based models) to extract feature embeddings for candidate concept generation.
Recent studies suggest further pre-training of a pre-trained LM with the in-domain data for task adaptation and show improved performance and effectiveness on downstream tasks from each target domain [7], [36], [37], [46]. To employ this strategy, we first collect corresponding texts from the same target task, which only makes use of the in-task text without any label as task-specific pre-training (Task-PT) data. Subsequently, we employ the original BERT-based models and continually execute an additional phase of pre-training with a masked language model (MLM) and next sentence prediction (NSP) approach on the Task-PT data.
The candidate concept generation step is retrieved to construct a list of candidate names N m ⊆ N , which consists of possible k concept names in the ontology for the given mention m ∈ M . Based on the BERT-based model, we first extract embeddings e m and e n for mention m and each concept name n ∈ N , respectively. BERT uses WordPiece tokenization [47], which split an input word into pre-defined subword units to reflect rare words and morphological variation in linguistics. To retain linguistic information, we sum up embeddings of the subwords by the BERT encoder into one vector as desired embeddings of the input word. To retrieve relevant k candidate concept names for each mention m, we define the scoring function as a static BERT-score (score SB ) of each pair (m, n) as follows: where sim(e m , e n ) is calculated using the cosine similarity between two vectors e m and e n , which is a value ∈ [0, 1].

B. CANDIDATE CONCEPT RANKING
In this section, we re-rank the list of candidate concepts by fine-tuning the BERT-based models, where we transform a binary classification task into a ranking task. Suppose there are k candidates in N m = {n 1 , . . . , n k } for the input mention m. We can generate all mention-candidate name pairs {(m, n 1 ), (m, n 2 ), . . . , (m, n k )} from a list of the mention m and their candidate names N m . We design the label of mention-candidate pairs as binary classes, which are represented as either 'correct' or 'incorrect.' If the i-th candidate concept name n i is related to the mapping concept ID c for the target mention m, then it is labeled as '1', otherwise '0', which means that the candidate n i is irrelevant with the mention m as a negative sample. Concretely, for each pair of the mention m and the i-th candidate name n i , we take an input sequence '[CLS] m [SEP] n i ' of the fine-tuning procedure, where '[CLS]' is a special token and '[SEP]' is a special separator token between m and n i . To apply BERT for the classification task, we used the final layer of the special token '[CLS]' in the mention-candidate pair to compute the probability distribution of binary classes, and the output probability for each class was calculated using the softmax function.
However, these hard labels may adversely affect model generalization as the probabilistic models become overconfident about their predictions and overfit the training data with hard targets [48]. To solve this problem, we employ the confidence panelty as a regularization term for the loss function to alleviate the peaked distributions [8]. The conditional distribution p θ (y|x) is described as: where p θ is the probability of class y i given an input sequence x and i indicates the index of each class. The confidence penalty loss (CPL) function ensures that the low entropy output distributions are penalized by adding the negative entropy H to the negative log-likelihood training objective as follows: where β controls the strength of the confidence penalty. Thus, the incorporation of negative entropy into the original loss function enables overfitting and improves the generalization performance [49]. Similar to the previous method [39], we defined the probability of label '1' obtained from the softmax function as a ranking BERT-score (score RB ) of each mention-candidate pair (m, n) as follows: Score RB (m, n) = P(label = 1|m, n) ∈ R (5)

C. ENTITY DISAMBIGUATION
The final score is calculated for each candidate pair using the two aforementioned scores both score SB and score RB as follows: Score(m, n) = Score SB (m, n) + Score RB (m, n) Using the above equation, we re-calculated the scores of all the retrieved pairs and then re-ordered the list of candidates according to the final ranking score in decreasing order. Therefore, we could predict the proper concept ID c with the highest score:

IV. EXPERIMENTAL SETUP A. DATASETS
We evaluated our normalization approach on the English biomedical benchmark corpora described in Table 1: the National Center for Biotechnology Information disease (NCBI) corpus [50], the BioCreative V Chemicals Disease Relationship (CDR) corpus [16], the BioCreative II Gene Normalization (GN) corpus [14], and the plant (Plant) corpus [6]. These corpora cover four major biological entity types: disease, chemical, gene, and plant. We briefly explain each corpus in the following sections. NCBI for disease names: The NCBI corpus is the gold standard dataset of disease name recognition and normalization tasks, which consists of disease mentions mapped to their concept IDs in the MEDIC [51] vocabulary of the Comparative Toxicogenomics Database (CTD) project [52]. This corpus is available at http://www.ncbi.nlm.nih.gov/CBBresearch/ Dogan/DISEASE/. In our experiments, we used the July 2012 version of MEDIC, which contains 11,915 MeSH and OMIM identifiers and 71,923 disease names with synonyms.
CDR for disease and chemical names: The CDR corpus is a dataset used for the BioCreative V challenge based on disease and chemical entity recognition and chemical-induced disease relation extraction tasks. It can be downloaded from http://www.biocreative.org/tasks/biocreative-v/ track-3-cdr/. It should be noted that we denote a subset with disease entities and chemical entities as 'CDR-DIS' and 'CDR-CHEM', respectively. In this study, we used the MEDIC version published in June 2015 and the CTD chemical vocabulary published in July 2015 for the experiments of the CDR-DIS and the CDR-CHEM, respectively.
GN for human gene names: The GN corpus is the gold standard for the gene normalization task in the BioCreative II challenge to determine human genes or gene products mentioned in PubMed abstracts and to map them to the unique concept IDs. This corpus can be found at https://biocreative.bioinformatics.udel.edu/tasks/biocreativeiii/gn/. In this study, we used the EntrezGene [53] list, which contains 32,975 identifiers and 182,989 gene names and synonyms.
Plant for plant names: The plant corpus is a manually annotated abstract-based corpus for the plant normalization task with plant mentions and their unique concept IDs. The plant corpus is freely available for download (http://gcancer.org/plant/). Following a previous study [6], we also used the viridiplantae ontology from the NCBI taxonomy database [54].

B. IMPLEMENTATION DETAILS 1) PREPROCESSING
We performed several pre-processing strategies for each mention and each concept in the KB as follows: (i) combine mention information in the training set to increase the coverage of the ontology [12]; (ii) resolve abbreviations using the abbreviation resolution module [55]; (iii) lowercase all characters; (iv) remove all characters except for the lowercase alphanumeric for both mentions and concept names in the dictionary; and (v) split composite mentions into separate mentions using heuristic rules [12]. For example, we could separate 'breast and ovarian cancer' into 'breast cancer' and 'ovarian cancer', respectively.

2) TRAINING
Our experiments were conducted in a workstation with an Intel(R) Xeon(R) Gold 5120 CPU, 265 GB RAM, and three Tesla V-100-SXM2-32GB GPU. Our model was implemented in Python and based on the source codes of BERT, which were built using TensorFlow in the backend. To set hyperparameters, we performed a grid search on the training epochs and weights of CPL, and then selected the hyperparameters with the best performance in the development corpus.
To re-rank the list of candidate concepts, we performed fine-tuning by using the BERT-based models, where we transform a binary classification task into a ranking task. To fine-tune our models, we derived our training and development datasets from the candidate concept generation step. We heuristically selected k = {100, 10} candidates for the input mention to generate mention-candidate name pairs, and then we employed the pairs to relation classification data as training and development, respectively. Note that the number of top candidates k was set to 20, as proposed by [41].
To set the training epoch and the weights of CPL, we performed a grid search over the epochs from 1 to 10 and weight values of CPL in {0.05, 0.1, 0.2, 0.3, 0.4, 0.5}, and then selected the hyperparameters with the best performance on the development dataset of each corpus. The hyperparameters of our proposed model are described in Table 2. We set the maximal sequence length to 64 and the batch size as 64, based on the recommendation options in BERT and other training settings including the optimizer are the same as those in the original BERT.
Although it takes approximately 8.5 hours to pre-train our models regardless of the length of task-specific data, each epoch, during fine-tuning, takes a different time ranging from 10 minutes to 1.3 hours, depending on the size of the training data.

3) EVALUATION
We applied two evaluation methods, namely accuracy and F-measure, for comparison of previous studies which used different evaluations. Accuracy: We assessed our model using the accuracy of the top k predictions for the disease and chemical name normalization tasks. We define Accuracy@n (Acc@n) as the percentage of mentions in the corpus, which contain the correct concept ID within the top n retrieved candidates. We set n = {1, 5} as Acc@1 and Acc@5, respectively.
F-score: We assessed our model in the gene and plant name normalization based on the following performance measures: precision (p), recall (r), and F-score (f ). Precisely, the prediction is recognized in the following manner: (i) True positives (TP) if the identifiers match the answer; (ii) false positives (FP) if the identifiers do not match the answer; and (iii) false negatives (FN ) if the gold standard identifiers do not match. The formulas are as follows:

A. MAIN RESULTS
In Table 3, we compare the performance of our proposed model (Task-PT PubMedBERT with CPL) with previous normalization methods in terms of accuracy and F-score on a set of benchmark corpora. More precisely, our model with 'score SB ' and 'score RB ' shows that our model uses only the static BERT-scores and the ranking BERT-scores, respectively. Moreover, our model with 'score' represents our complete model with the default scoring method. It should be noted that we use the same list of candidates to test different scoring types of our models. Furthermore, we independently tested ten times for each corpus from scratch and calculated the mean accuracy and standard deviations for each evaluation type. We evaluate the effectiveness of our proposed scoring algorithm using experiments, and the results show that our approach is helpful in improving the performance of the biomedical normalization task. First of all, we compare the mean accuracy of our models with that of the existing methods, based on the NCBI, CDR-DIS, and CDR-CHEM test datasets. As shown in the top section of Table 3, our proposed model achieves new SOTA performance on the NCBI and CDR-DIS. Especially, our experiments evaluated that the ensemble scoring algorithm showed significant improvement over the scoring methods using score SB and score RB separately. One example of such improvements is that 'deficiency of the second component of complement (OMIM:217000)' was correctly predicted as 'deficiency of complement protein c2 (OMIM:217000)' in the final ensemble scoring algorithm, while 'deficiency of the fifth component of complement (OMIM:609536)' incorrectly had the highest score in score SB . As another example, 'autosomal recessive alport syndrome (MeSH:C536587)' was correctly mapped into 'alport syndrome autosomal recessive (MeSH:C536587)' in the final ensemble scoring algorithm, while 'alport syndrome (MeSH:D009394)' incorrectly had the highest score in score RB .
Compared with the current SOTA model on the CDR-CHEM, our model measured by Acc@1 shows slightly lower accuracy (approximately 0.5%); however, our system measured by Acc@5 obtained the same performance as the SOTA system. In the bottom section of Table 3, the F-score is used to compare the performance of our model with that of the existing methods based on the GN and plant test corpora. From the results, it can also be seen that our model consistently outperforms previous other models and achieves new SOTA performance in terms of F-score by 1.2% and 3.5% on the GN and Plant test datasets, respectively. Note that we use the BM25 similarity measure [59] provided by Lucene for gene and plant name retrievals, whereas the performance of other models was obtained from other studies.

B. TASK-PT EVALUATION
We illustrate the impact of Task-PT using several in-task data with five types of pre-trained LMs: BERT, BioBERT, SciBERT, PubMedBERT, and PubMedBERTfulltext. It should be noted that we empirically set the hyperparameters of the training epoch as 3 and the value of CPL as 0.2 in this experiment. In Table 4, we describe our experiments in terms of three points: (i) Which model shows the best performance? (ii) How good is the performance of Task-PT with in-task data? (iii) Which type of in-task data is suitable for this approach?
With respect to Acc@1 among the vanilla BERT-based models on the NCBI development set, we observed that a significant benefit is achieved by using the PubMedBERT model, instead of the PubMedBERT-fulltext, as it contains more pre-training corpora. The surprisingly poor performance of the SciBERT model compared to other models is because SciBERT is an adaptation of BERT for biomedical and scientific domains, and computer science text is clearly out-domain from the perspective of biomedical applications.
As in-task data, training, development, and test sets of each corpus were used to represent sentences in the training set, in the combination of training and development sets, and the entire sets as 'TRAIN', 'TRAIN+DEV', and 'ALL', respectively. Although the performance of BioBERT with TRAIN is slightly lower than expected in terms of Acc@1, all results of Acc@5 show reasonable differences ranging from 0.2% to 2.1% when compared to the vanilla models. In the case of PubMedBERT, PubMedBERT with TRAIN+DEV and ALL appear to be approximately equivalent to Acc@1 and Acc@5. To compare models with the same highest scores, we denote the additional type of evaluation as Acc@3 (n = 3). Because the best accuracy is achieved in Acc@3, we demonstrated the effectiveness of Task-PT PubMedBERT with 'ALL' and utilized it to test other corpora for normalization tasks. We compared several BERT-based models with or without Task-PT using different in-task data on the NCBI development dataset. The values in bold denote the best performance of each corpus.

C. CALIBRATION EVALUATION
To evaluate the calibration performance, we compared each of our approaches with respect to the expected calibration error (ECE) and over-confidence error (OE) [60], [61]. Owing to the relatively small number of test mentions, we skipped the grouping interval bins, and then slightly modified the formula for calculating ECE and OE, which is described as follows:  (10) where M is a list of test mentions, acc(M ) is the number of correct predictions for given M , and conf(M ) is the summation of the winning softmax scores for M . When the ECE of a model is close to zero, the model is significantly well-calibrated because its accuracy and confidence VOLUME 9, 2021 are almost the same. In Table 5, it can be seen that the Pub-MedBERT model using CPL is better calibrated than vanilla PubMedBERT. We additionally visualized the distribution of the output labels for each model. As shown in Fig. 2, the confidence penalty promotes dispersed distributions, which may lead to better performance.

D. ERROR ANALYSIS
We performed an error analysis of the NCBI corpus by dividing the four major causes of false positive prediction produced by our model.

1) MISPREDICTIONS IN THE CANDIDATE CONCEPT GENERATION STEP
The majority of the errors (26.7%) were attributed to wrong candidate selections, where a list of candidate names for each mention did not include any concept name of a gold standard concept ID. If appropriate candidates are not extracted from the dictionary in the candidate concept generation step, it becomes difficult to improve the performance even after conducting additional post-processing.

2) MISPREDICTIONS BIASED TOWARDS Score SB
The 21.4% of errors occurred when the score SB of a candidate is significantly higher than other candidate names. Although we expect score SB to reflect the linguistic regularity between the pairs of words, the slight spelling difference or overlap cases between the mention and candidate pairs can make prediction challenging. For example, although 'desmoid tumor' is considered as 'MeSH:C535944' for a gold standard concept ID in the NCBI train set (PubMed ID:1351034), 'desmoid tumors' in ''No desmoid tumors were found in these kindreds. (PubMed ID:9585611)'' is annotated with 'MeSH:D018222' in the NCBI test set.
As a special case, the score SB is 1.0 when the input mention and the predicted name are the same. In this case, the concept ID with the identical name is placed in the top rank, because score SB is too dominant in the final score to re-order the list of candidates. This could be due to an annotation error in the corpus construction or the same mention could have been interpreted differently depending on the context. These error cases would be equally represented in other studies using four corpora (i.e. NCBI, CDR-DIS, CDR-CHEM, and GN) except the Plant corpus. When we eliminated such annotation concerns from test sets, we can obtain the improved Acc@1 of up to 93.60% for NCBI, 94.79% for CDR-DIS, 97.34% for CDR-CHEM, and F-score of up to 92.4% for GN. (see Supplementary Material for more details.)

3) MISPREDICTIONS BIASED TOWARDS Score RB
Contrary to the above-mentioned error, there are 21.4% errors when the score RB of a candidate is significantly higher than the others, and it significantly contributes to semantically unrelated concept name. For example, although 'heart abnormalities (MeSH:D006330)' is a gold standard concept name for the input 'cardiac defects (MeSH:D006330)', 'cardiac abnormalities (MeSH:D018376)' has a higher score RB than 'heart abnormalities', that is, 0.985 and 0.008, respectively.

4) MISPREDICTIONS IN THE ENTITY DISAMBIGUATION STEP
Although both score SB and score RB are not considered to be significant during the scoring and ranking step, when the sum of both scores is used as the total score, the final rank of candidates changes slightly. To solve this problem, instead of using a simple summation approach, an appropriate weight parameter can be used to balance the degree of importance between score SB and score RB .

VI. CONCLUSION
In this study, we applied and evaluated pre-trained language representation models for the biomedical entity normalization task as the re-ranking problem, which takes advantage of pre-trained LMs in modeling two different scoring strategies between entity mentions and candidate concepts. Among the five biomedical corpora, the results of our experiment showed that our model achieved SOTA performance for four biomedical corpora and obtained promising performance for the chemical entity normalization task. We found that PubMedBERT-based models outperformed other BERT-based models. Moreover, the performance can be further improved by additional Task-PT with in-task data, and we found that the calibration approach can significantly improve the performance of PubMedBERT-based models. In the future, our approach will be evaluated using more biomedical NLP tasks of various biological entities.